Running Data Pipelines
Data Pipelines allow you to process data stored in one location and send the result to another, such as from a data lake to an analytics database or into a payment processing system. You can also use the same source and sink so the pipeline only processes data.
With the Hazelcast Platform Operator, you can run Data Pipelines from existing JAR files for processing data. Data Pipelines, depending on the data source, can be used for stream or batch processing. To create a Data Pipeline using JetJob
CR, the Jet Engine in the Hazelcast
CR must be configured. It is required to set enable
and resourceUploadEnabled
to true
. To understand the Data Pipelines and the Jet Engine, refer to Platform documentation.
Configuring the JetJob Resource
Below are the configuration options for the JetJob
resource. You can find more detailed information in API Reference page.
Field | Description |
---|---|
|
Name of the Jet Job to be created. If empty, the CR name will be used. It cannot be updated after the Jet Job is created successfully. |
|
HazelcastResourceName defines the name of the Hazelcast resource. |
|
State is used to manage the job state. The default value is 'Running' and its value must be |
|
JarName specify the name of the Jar to run that is present on the member |
|
MainClass is the name of the main class that will be run on the submitted job |
|
JAR file that is specified in the
|
|
URL from where the file will be downloaded. |
Providing the JAR file for Data Pipeline
To run the Data Pipeline, you need to provide a JAR file that contains the Pipeline. The JAR file can be pre-downloaded before the cluster starts by configuring the jet.bucketConfig
, jet.remoteURLs
, or jet.configMaps
in the Hazelcast
CR. This way, all the files in the bucket will be accessible to the member when the cluster starts.
Another option is to configure bucketConfig
or remoteURL
in the JetJob
CR. This way, only the JAR file specified in the jarName
parameter will be downloaded in the runtime before starting the Data Pipeline.
JetJob state management
Once the job is created, you can use state
field to manage its lifecycle.
The following state
values are available:
-
Running
. All the jobs must be created with theRunning
state. It will run the newly created job or will start theSuspended
job. -
Suspended
. Gracefully suspends theRunning
job. -
Canceled
. Gracefully stops the job. -
Restarted
. Suspends and resumes the job in one step.
Deleting the JetJob
resource will forcefully cancel the job.
Example Configuration
The following JetJob
resource runs the Data Pipeline for the Hazelcast
resources on the source Hazelcast cluster from my-data-pipeline.jar
.
apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
name: hazelcast
spec:
clusterSize: 3
repository: 'docker.io/hazelcast/hazelcast-enterprise'
jet:
enabled: true
resourceUploadEnabled: true
bucketConfig:
secretName: br-secret-gcp
bucketURI: "gs://your-bucket/path/to/jars"
licenseKeySecretName: hazelcast-license-key
---
apiVersion: hazelcast.com/v1alpha1
kind: JetJob
metadata:
name: jet-job-sample
spec:
name: my-test-jet-job
hazelcastResourceName: hazelcast
state: Running
jarName: my-data-pipeline.jar
For further information about accessing resources on different cloud providers, see Authorization Methods to Access Cloud Provider Resources. |