Spark configuration

The Spark configuration view contains the Spark specific properties you can define for your Job depending on the distribution and the Spark mode you are using.

The information in this section is only for users who have subscribed to
Talend Data Fabric or to any Talend product with Big Data but it is not
applicable to Talend Open Studio for Big Data users.

Defining the connection to the Azure Storage account to be used in the Studio

Define the connection metadata to Azure Storage in the
Repository of the Studio.

You have an Azure account with appropriate rights and permissions to the
Azure Storage.
The Azure Storage account to be used has been properly created and you have the
appropriate permissions to access it. For further information about Azure
Storage, see Azure Storage tutorials from Azure
documentation.
You are using one of the Talend solutions with Big Data.

Obtain the access key to the Azure Storage account to be used on https://portal.azure.com/.
1. Click All services on the menu bar on the left
  of the Azure welcome page.
2. Click Storage accounts in the
  STORAGE section.
3. Click the storage account to be used.
4. On the list that is displayed, click Access keys
  to open the corresponding blade.
5. Copy and keep the key that is displayed somewhere appropriate so as to
  use it in the steps to come.
In the Integration perspective
of the Studio, expand the Metadata node in the
Repository, right click the Azure
Storage node and from the contextual menu, select
Create an Azure Storage Connection to open the
wizard.

Complete the fields in the wizard:

Name	Enter the name you want to use for this connection to be defined.
Account Name	Enter the name of the Azure Storage account to be connected to.
Account Key	Enter the access key you got in the previous steps.

Click Test connection to verify the configuration. Once
a message pops up to say that the connection is successful, the
Next button is activated.
Click Next to access the list of containers available on
Azure under this Azure Storage account.

This list is empty if this Azure Storage account does not contain any
containers.
Select the container to connect to and click Next or
just click Next to skip this step. You can revise this
step anytime later by coming back to this wizard.
Do the same to the query list and the table list that are respectively
displayed in the wizard.
Click Finish to validate the creation. The connection
appears under the Azure Storage node in the
Repository.

The Azure Storage connection has been defined in the Studio and ready to be used by
your Jobs to work with the Azure services that are associated with this Azure Storage
account.

Defining the Azure Databricks connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
When running a Spark Batch Job, only if you have selected the Do not restart the cluster
when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
automatically restarts the cluster, the Jobs that are launched in parallel interrupt
each other and thus cause execution failure.

From the Cloud provider drop-down list, select
Azure.

Enter the basic connection information to Databricks.

Standalone	In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://westeurope.azuredatabricks.net. In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster. You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL. Click the […] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Azure documentation. In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then. Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running. The default value is `300000`, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status. Use transient cluster: you can select this check box to leverage the transient Databricks clusters. The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime. Autoscale: select or clear this check box to define the number of workers to be used by your transient cluster. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your transient cluster is scaled up and down within this scope based on its workload. According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards. If you clear this check box, autoscaling is deactivated. Then define the number of workers a transient cluster is expected to have. This number does not include the Spark driver node. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks. For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation. Elastic disk: select this check box to enable your transient cluster to automatically scale up its disk space when its Spark workers are running low on disk space. For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your transient cluster. If no SSH access has been set up, ignore this field. For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS. Do not restart the cluster when submitting: select this check box to prevent the Studio restarting the cluster when the Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that the Studio resarts your cluster to take these changes into account.

Standalone

In the Endpoint
field, enter the URL address of your Azure Databricks workspace.
This URL can be found in the Overview blade
of your Databricks workspace page on your Azure portal. For example,
this URL could look like https://westeurope.azuredatabricks.net.
In the Cluster ID
field, enter the ID of the Databricks cluster to be used. This ID is
the value of the
spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on the
properties list in the Environment tab in the
Spark UI view of your cluster.

You can also easily find this ID from
the URL of your Databricks cluster. It is present immediately after
cluster/ in this URL.
Click the […] button
next to the Token field to enter the
authentication token generated for your Databricks user account. You
can generate or find this token on the User
settings page of your Databricks workspace. For
further information, see Token management from the
Azure documentation.
In the DBFS dependencies
folder field, enter the directory that is used to
store your Job related dependencies on Databricks Filesystem at
runtime, putting a slash (/) at the end of this directory. For
example, enter /jars/ to store the dependencies
in a folder named jars. This folder is created
on the fly if it does not exist then.
Poll interval when retrieving Job status (in
ms): enter, without the quotation marks, the time
interval (in milliseconds) at the end of which you want the Studio
to ask Spark for the status of your Job. For example, this status
could be Pending or Running.

The default value is 300000, meaning 30
seconds. This interval is recommended by Databricks to correctly
retrieve the Job status.
Use
transient cluster: you can select this check box to
leverage the transient Databricks clusters.

The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.
1. Autoscale: select or clear this check box to define
  the number of workers to be used by your transient cluster.
  1. If you select this check box,
    autoscaling is enabled. Then define the minimum number
    of workers in Min
    workers and the maximum number of
    worders in Max
    workers. Your transient cluster is
    scaled up and down within this scope based on its
    workload.
    
    According to the Databricks
    documentation, autoscaling works best with
    Databricks runtime versions 3.0 or onwards.
  2. If you clear this check box, autoscaling
    is deactivated. Then define the number of workers a
    transient cluster is expected to have. This number does
    not include the Spark driver node.
2. Node type
  and Driver node type:
  select the node types for the workers and the Spark driver node.
  These types determine the capacity of your nodes and their
  pricing by Databricks.
  
  For details about
  these node types and the Databricks Units they use, see
  Supported Instance
  Types from the Databricks documentation.
3. Elastic
  disk: select this check box to enable your
  transient cluster to automatically scale up its disk space when
  its Spark workers are running low on disk space.
  
  For more details about this elastic disk
  feature, search for the section about autoscaling local
  storage from your Databricks documentation.
4. SSH public
  key: if an SSH access has been set up for your
  cluster, enter the public key of the generated SSH key pair.
  This public key is automatically added to each node of your
  transient cluster. If no SSH access has been set up, ignore this
  field.
  
  For further information about SSH
  access to your cluster, see SSH access to
  clusters from the Databricks
  documentation.
5. Configure cluster
  log: select this check box to define where to
  store your Spark logs for a long term. This storage system could
  be S3 or DBFS.
Do not restart the cluster
when submitting: select this check box to prevent
the Studio restarting the cluster when the Studio is submitting your
Jobs. However, if you make changes in your Jobs, clear this check
box so that the Studio resarts your cluster to take these changes
into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Adding Azure specific properties to access the Azure storage system from Databricks

Add the Azure specific properties to the Spark configuration of your Databricks
cluster so that your cluster can access Azure Storage.

You need to do this only when you want your Talend
Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with
Databricks.

Ensure that your Spark cluster in Databricks has been properly created and is
running and its version is supported by the Studio. If you use Azure Data
Lake Storage Gen 2, only Databricks 5.4 is supported.

For further information, see Create Databricks workspace from
Azure documentation.
You have an Azure account.
The Azure Blob Storage or Azure Data Lake Storage service to be used has been
properly created and you have the appropriate permissions to access it. For
further information about Azure Storage, see Azure Storage tutorials from Azure
documentation.

On the Configuration tab of your Databricks cluster
page, scroll down to the Spark tab at the bottom of the
page.
Click Edit to make the fields on this page
editable.

In this Spark tab, enter the Spark properties regarding
the credentials to be used to access your Azure Storage system.

Option

Description

Azure Blob Storage

When you need to use Azure Blob Storage with Azure Databricks, add the
following Spark property:

The parameter to provide account key:

spark.hadoop.fs.azure.account.key.<storage_account>.blob.core.windows.net <key>

1

spark.hadoop.fs.azure.account.key.<storage_account>.blob.core.windows.net <key>

Ensure that the account to be used has the appropriate read/write rights and permissions.
If you need to append data to an existing file, add this parameter:

spark.hadoop.fs.azure.enable.append.support true

1

spark.hadoop.fs.azure.enable.append.support true

Azure Data Lake Storage (Gen 1)

When you need to use Azure Data Lake Storage Gen1 with
Databricks, add the following Spark properties, each per
line:

spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id &lt;your_app_id&gt;
spark.hadoop.dfs.adls.oauth2.credential &lt;your_authentication_key&gt;
spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/&lt;your_app_TENANT-ID&gt;/oauth2/token

spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential

spark.hadoop.dfs.adls.oauth2.client.id <your_app_id>

spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key>

spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/token

Azure Data Lake Storage (Gen 2)

When you need to use Azure Data Lake Storage Gen2 with Databricks,
add the following Spark properties, each per line:

The parameter to provide an account key:

spark.hadoop.fs.azure.account.key.<storage_account>.dfs.core.windows.net <key>

1

spark.hadoop.fs.azure.account.key.<storage_account>.dfs.core.windows.net <key>

This key is associated with the storage account to be used.
You can find it in the Access keys
blade of this storage account. Two keys are available for
each account and by default, either of them can be used for
this access.

Ensure that the account to be used has the appropriate read/write rights and permissions.
If the ADLS file system to be used does not exist yet, add the following parameter:

spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization true

1

spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization true

For further information about how to find your application ID and authentication
key, see Get application ID and authentication
key from the Azure documentation. In the same documentation, you can
also find details about how to find your tenant ID at Get tenant ID.

If you need to run Spark Streaming Jobs with Databricks, in the same
Spark tab, add the following property to define a
default Spark serializer. If you do not plan to run Spark Streaming Jobs, you
can ignore this step.

spark.serializer org.apache.spark.serializer.KryoSerializer

1

spark.serializer org.apache.spark.serializer.KryoSerializer
Restart your Spark cluster.
In the Spark UI tab of your Databricks cluster page,
click Environment to display the list of properties and
verify that each of the properties you added in the previous steps is present on
that list.

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
  when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
  automatically restarts the cluster, the Jobs that are launched in parallel interrupt
  each other and thus cause execution failure.
Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Enter the basic connection information to Databricks on AWS.

Standalone	In the Endpoint field, enter the URL address of the workspace of your Databricks on AWS. For example, this URL could look like `https://<your_endpoint>.cloud.databricks.com`. In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster. You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL. This field is not used and thus not available if you are using transient clusters. Click the […] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Databricks documentation. In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then. This directory stores your Job dependencies on DBFS only. In your Job, use tS3Configuration, tDynamoDBConfiguration or, in a Spark Streaming Job, the Kinesis components, to read or write your business data to the related systems. Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running. The default value is `300000`, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status. Use transient cluster: you can select this check box to leverage the transient Databricks clusters. The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime. Autoscale: select or clear this check box to define the number of workers to be used by your transient cluster. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your transient cluster is scaled up and down within this scope based on its workload. According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards. If you clear this check box, autoscaling is deactivated. Then define the number of workers a transient cluster is expected to have. This number does not include the Spark driver node. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks. For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation. Elastic disk: select this check box to enable your transient cluster to automatically scale up its disk space when its Spark workers are running low on disk space. For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your transient cluster. If no SSH access has been set up, ignore this field. For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS. Do not restart the cluster when submitting: select this check box to prevent the Studio restarting the cluster when the Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that the Studio resarts your cluster to take these changes into account.

Standalone

In the Endpoint
field, enter the URL address of the workspace of your Databricks on
AWS. For example, this URL could look like
https://<your_endpoint>.cloud.databricks.com.
In the Cluster
ID field, enter the ID of the Databricks cluster to
be used. This ID is the value of the
spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on the
properties list in the Environment tab in the
Spark UI view of your cluster.

You can also easily find this ID
from the URL of your Databricks cluster. It is present immediately
after cluster/ in this URL.

This field is not used and thus not available if you are using transient clusters.
Click the […]
button next to the Token field to enter the
authentication token generated for your Databricks user account. You
can generate or find this token on the User
settings page of your Databricks workspace. For
further information, see Token management from the
Databricks documentation.
In the DBFS
dependencies folder field, enter the directory that
is used to store your Job related dependencies on Databricks
Filesystem at runtime, putting a slash (/) at the end of this
directory. For example, enter /jars/ to store
the dependencies in a folder named jars. This
folder is created on the fly if it does not exist then.

This directory stores your Job dependencies on DBFS only. In your
Job, use tS3Configuration,
tDynamoDBConfiguration or, in a Spark
Streaming Job, the Kinesis components, to read or write your
business data to the related systems.
Poll interval when retrieving Job status (in
ms): enter, without the quotation marks, the time
interval (in milliseconds) at the end of which you want the Studio
to ask Spark for the status of your Job. For example, this status
could be Pending or Running.

The default value is 300000, meaning 30
seconds. This interval is recommended by Databricks to correctly
retrieve the Job status.
Use
transient cluster: you can select this check box to
leverage the transient Databricks clusters.

The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.
1. Autoscale: select or clear this check box to define
  the number of workers to be used by your transient cluster.
  1. If you select this check box,
    autoscaling is enabled. Then define the minimum number
    of workers in Min
    workers and the maximum number of
    worders in Max
    workers. Your transient cluster is
    scaled up and down within this scope based on its
    workload.
    
    According to the Databricks
    documentation, autoscaling works best with
    Databricks runtime versions 3.0 or onwards.
  2. If you clear this check box, autoscaling
    is deactivated. Then define the number of workers a
    transient cluster is expected to have. This number does
    not include the Spark driver node.
2. Node type
  and Driver node type:
  select the node types for the workers and the Spark driver node.
  These types determine the capacity of your nodes and their
  pricing by Databricks.
  
  For details about
  these node types and the Databricks Units they use, see
  Supported Instance
  Types from the Databricks documentation.
3. Elastic
  disk: select this check box to enable your
  transient cluster to automatically scale up its disk space when
  its Spark workers are running low on disk space.
  
  For more details about this elastic disk
  feature, search for the section about autoscaling local
  storage from your Databricks documentation.
4. SSH public
  key: if an SSH access has been set up for your
  cluster, enter the public key of the generated SSH key pair.
  This public key is automatically added to each node of your
  transient cluster. If no SSH access has been set up, ignore this
  field.
  
  For further information about SSH
  access to your cluster, see SSH access to
  clusters from the Databricks
  documentation.
5. Configure cluster
  log: select this check box to define where to
  store your Spark logs for a long term. This storage system could
  be S3 or DBFS.
Do not restart the cluster
when submitting: select this check box to prevent
the Studio restarting the cluster when the Studio is submitting your
Jobs. However, if you make changes in your Jobs, clear this check
box so that the Studio resarts your cluster to take these changes
into account.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Defining the AWS Qubole connection parameters for Spark Jobs

Complete the Qubole connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.

You have properly set up your Qubole cluster on AWS. For further information about
how to do this, see Getting Started with Qubole on AWS from the
Qubole documentation.
Ensure that the Qubole account to be used has the proper IAM role that is allowed to
read/write to the S3 bucket to be used. For further details, contact the
administrator of your Qubole system or see Cross-account IAM Role for QDS from
the Qubole documentation.
Ensure that the AWS account to be used has the proper read/write permissions to
the S3 bucket to be used. For this purpose, contact the administrator of your
AWS system.

Enter the basic connection information to Qubole.

Connection configuration	Click the … button next to the API Token field to enter the authentication token generated for the Qubole user account to be used. For further information about how to obtain this token, see Manage Qubole account from the Qubole documentation. This token allows you to specify the user account you want to use to access Qubole. Your Job automatically uses the rights and permissions granted to this user account in Qubole. Select the Cluster label check box and enter the name of the Qubole cluster to be used. If leaving this check box clear, the default cluster is used. If you need details about your default cluster, ask the administrator of your Qubole service. You can also read this article from the Qubole documentaiton to find more information about configuring a default Qubole cluster. Select the Change API endpoint check box and select the region to be used. If leaving this check box clear, the default region is used. For further information about the Qubole Endpoints supported on QDS-on-AWS, see Supported Qubole Endpoints on Different Cloud Providers.

Connection configuration

Click the … button next to the
API Token field to enter the
authentication token generated for the Qubole user account
to be used. For further information about how to obtain this
token, see Manage Qubole
account from the Qubole documentation.

This
token allows you to specify the user account you want to
use to access Qubole. Your Job automatically uses
the rights and permissions granted to this user account
in Qubole.
Select the Cluster label check
box and enter the name of the Qubole cluster to be used. If
leaving this check box clear, the default cluster is
used.

If you need details about your default cluster,
ask the administrator of your Qubole service. You can
also read this article
from the Qubole documentaiton to find more information
about configuring a default Qubole cluster.
Select the Change API endpoint
check box and select the region to be used. If leaving this
check box clear, the default region is used.

For further
information about the Qubole Endpoints supported on
QDS-on-AWS, see Supported Qubole
Endpoints on Different Cloud
Providers.

Configure the connection to the S3 file system to be used to temporarily store the dependencies of your Job so that your Qubole cluster has access to these dependencies.
This configuration is used for your Job dependencies only. Use a
tS3Configuration in your Job to write your actual
business data in the S3 system with Qubole. Without
tS3Configuration, this business data is written in the
Qubole HDFS system and destroyed once you shut down your cluster.
- Access key and
  Secret key:
  enter the authentication information required to connect to the Amazon
  S3 bucket to be used.
  
  To enter the password, click the […] button next to the
  password field, and then in the pop-up dialog box enter the password between double quotes
  and click OK to save the settings.
- Bucket name: enter the name of the bucket in
  which you want to store the dependencies of your Job. This bucket must
  already exist on S3.
- Temporary resource folder: enter the
  directory in which you want to store the dependencies of your Job. For
  example, enter temp_resources to write the dependencies in
  the /temp_resources folder in the bucket.
  
  If this folder already exists at runtime, its contents are overwritten
  by the upcoming dependencies; otherwise, this folder is automatically
  created.
- Region: specify the AWS region by selecting a region name from the
  list. For more information about the AWS Region, see Regions and Endpoints.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Defining the EMR connection parameters

Complete the EMR connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Enter the basic connection information to EMR:

Yarn client

The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.

If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.

If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.

This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

1

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

You
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:

hdfs-sidt.xml core-site.xml

1
2

hdfs-sidt.xml
core-site.xml

The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

Ensure that the user name in the Yarn
client mode is the same one you put in
tS3Configuration, the component used to provides S3
connection information to Spark.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
  and enter the directory where your winutils.exe is
  stored.
- Otherwise, leave this check box clear, the Studio generates one
  by itself and automatically uses it for this Job.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .

Defining the Cloudera connection parameters

Complete the Cloudera connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

If you cannot
find the Cloudera version to be used from this drop-down list, you can add your distribution
via some dynamic distribution settings in the Studio.

Select the type of the Spark cluster you need to connect to.

Standalone

The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.

If you are using the Standalone mode, you need to
set the following parameters:

In the Spark host field, enter the URI
of the Spark Master of the Hadoop cluster to be used.
In the Spark home field, enter the
location of the Spark executable installed in the Hadoop cluster to be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn client

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.

This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

1

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

You
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:

hdfs-sidt.xml core-site.xml

1
2

hdfs-sidt.xml
core-site.xml

The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
  and enter the directory where your winutils.exe is
  stored.
- Otherwise, leave this check box clear, the Studio generates one
  by itself and automatically uses it for this Job.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .
If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch
Jobs, you can make use of Cloudera Navigator to trace the lineage of given
data flow to discover how this data flow was generated by a Job.
- Defining data lineage with Cloudera Navigator.

Defining the Dataproc connection parameters

Complete the Google Dataproc connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

Enter the basic connection information to Dataproc:

Project identifier	Enter the ID of your Google Cloud Platform project. If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.
Cluster identifier	Enter the ID of your Dataproc cluster to be used.
Region	From this drop-down list, select the Google Cloud region to be used.
Google Storage staging bucket	As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution. The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.

Provide the authentication information to your Google Dataproc cluster:

Provide Google Credentials in file	Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine. When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine. For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.

Provide Google Credentials in file

Leave this check box clear, when you
launch your Job from a given machine in which Google Cloud SDK has been
installed and authorized to use your user account credentials to access
Google Cloud Platform. In this situation, this machine is often your
local machine.

When you launch your Job from a remote
machine, such as a Jobserver, select this check box and in the
Path to Google Credentials file field that is
displayed, enter the directory in which this JSON file is stored in the
Jobserver machine.

For further information about this Google
Credentials file, see the administrator of your Google Cloud Platform or
visit Google Cloud Platform Auth
Guide.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .

Defining the Hortonworks connection parameters

Complete the Hortonworks connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Enter the basic connection information to Hortonworks

Yarn client

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.

This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

1

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

You
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:

hdfs-sidt.xml core-site.xml

1
2

hdfs-sidt.xml
core-site.xml

The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
  and enter the directory where your winutils.exe is
  stored.
- Otherwise, leave this check box clear, the Studio generates one
  by itself and automatically uses it for this Job.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
If you encounter the hdp.version is not found issue when
executing your Job, select the Set hdp.version check box
to define the hdp.version variable in your Job and also in
your cluster.

For more details, see Set up the hdp.version parameter to resolve the Hortonworks version
issue.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .
If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce
or Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks
cluster, you can make use of Atlas to trace the lineage of given data flow
to discover how this data was generated by a Job.
- Defining data lineage with Atlas.

Defining the MapR connection parameters

Complete the MapR connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Select the type of the Spark cluster you need to connect to.

Standalone

The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.

If you are using the Standalone mode, you need to
set the following parameters:

In the Spark host field, enter the URI
of the Spark Master of the Hadoop cluster to be used.
In the Spark home field, enter the
location of the Spark executable installed in the Hadoop cluster to be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn client

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.

This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

1

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

You
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:

hdfs-sidt.xml core-site.xml

1
2

hdfs-sidt.xml
core-site.xml

The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
  and enter the directory where your winutils.exe is
  stored.
- Otherwise, leave this check box clear, the Studio generates one
  by itself and automatically uses it for this Job.
Verify, for example with your cluster administrator, whether your MapR cluster
is secured with the MapR ticket authentication mechanism.
- If the MapR cluster to be used is secured with the MapR ticket authentication mechanism,
  set the MapR ticket authentication configuration by following the explanation in Setting up the MapR ticket authentication.
- Otherwise, leave the Use MapR Ticket
  authentication check box clear.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .

Creating an HDInsight cluster on Azure

You have an Azure account with appropriate rights and permissions to the
HDInsight service.

Navigate to the Azure portal: https://portal.azure.com/.
Click All services on the menu bar on the left of the
Azure welcome page.
In the Analytics section, click HDInsight clusters to open the corresponding blade.
Click Add to create an HDInsight cluster.
Click Quick Create to display the Basics blade and enter the basic configuration information on this blade.

Among the parameters, the Cluster
type to be used must be the one officially supported by Talend. For example, select
Spark 2.1.0 (HDI 3.6).

For further information about the supported versions, search for supported
Big Data platforms on Talend Help Center (https://help.talend.com).

For further information about the parameters on this blade,
see Create clusters from Azure
documentation.
Click Next to open the Storage
blade to set the storage settings for the cluster.
1. From the Primary storage type list, select
  Azure storage.
2. For the storage account to be used, select the My
  subscriptions radio button, then click Select
  a storage account and choose the account to be used from
  the blade that is displayed.
3. For the other parameters, leave them as they are to use the default
  values or enter the values you want to use.
Click Next to pass to the Summary
step, in which you review and confirm the configuration made in the previous
steps.
Once you confirm the configuration, click Create.

Azure starts to create your cluster. Once done, your cluster appears on the
HDInsight clusters list.

Defining the HD Insight connection parameters

Complete the HD Insight connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of
cluster.

Enter the basic connection information to Microsoft HD Insight:

Livy configuration	The Hostname of Livy is the URL of your HDInsight cluster. This URL can be found in the Overview blade of your cluster. Enter this URL without the https:// part. The default Port is 443. The Username is the one defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.
HDInsight configuration	The Username is the one defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster. The Password is defined when creating your HDInsight cluster for authentication to this cluster.
Windows Azure Storage configuration	Enter the address and the authentication information of the Azure Storage account to be used. In this configuration, you do not define where to read or write your business data but define where to deploy your Job only. Therefore always use the Azure Storage system for this configuration. In the Container field, enter the name of the container to be used. You can find the available containers in the Blob blade of the Azure Storage account to be used. In the Deployment Blob field, enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account. In the Hostname field, enter the Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account. In the Username field, enter the name of the Azure Storage account to be used. In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.

Livy configuration

The Hostname of
Livy is the URL of your HDInsight cluster. This URL can be
found in the Overview blade of your
cluster. Enter this URL without the https:// part.
The default Port is 443.
The Username is the one defined when
creating your cluster. You can find it in the SSH
+ Cluster login blade of your cluster.

For
further information about the Livy service used by HD Insight, see
Submit Spark
jobs using Livy.

HDInsight
configuration

The Username is the one defined when
creating your cluster. You can find it in the SSH
+ Cluster login blade of your cluster.
The Password is defined when creating your HDInsight
cluster for authentication to this cluster.

Windows Azure Storage
configuration

Enter the address and the authentication information of the Azure Storage
account to be used. In this configuration, you do not define where to read or write
your business data but define where to deploy your Job only. Therefore always use
the Azure
Storage
system for this configuration.

In the Container field, enter the name
of the container to be
used. You can
find the available containers in the Blob blade of the Azure
Storage account to be used.

In the Deployment Blob field, enter the
location in which you want to store the current Job and its dependent libraries in
this Azure Storage account.

In the Hostname field, enter the
Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account.

In the Username field, enter the name of the Azure Storage account to be used.

In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.

In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .

Defining the Cloudera Altus connection parameters

Complete the Altus connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

Prerequisites:

The Cloudera Altus client, Altus CLI, must be installed in the machine in which your
Job is executed:

To install the Cloudera Altus CLI on Linux, see Cloudera Altus Client Setup for
Linux from the Cloudera documentation.
To install the Cloudera Altus CLI on Windows, see Cloudera Altus Client Setup for
Windows from the Cloudera documentation.

In the Spark configuration tab of the
Run view of your Job, enter the basic connection
information to Cloudera Altus.

Force Cloudera Altus credentials	Select this check box to provide the credentials with your Job. If you want to provide the credentials separately, for example manually using the command `altus configure` in your terminal, clear this check box.
Path to Cloudera Altus CLI	Enter the path to the Cloudera Altus client, which must have been installed and activated in the machine in which your Job is executed. In production environments, this machine is typically a Talend Jobserver.

Force Cloudera Altus credentials

Select this check box to provide the credentials with your
Job.

If you want to provide the credentials separately, for example
manually using the command altus configure in
your terminal, clear this check box.

Path to Cloudera Altus CLI

Enter the path to the Cloudera Altus
client, which must have been installed and activated in the
machine in which your Job is executed. In production
environments, this machine is typically a Talend Jobserver.

Configure the virtual Cloudera cluster to be used.

Use an existing Cloudera Altus cluster	Select this check box to use a Cloudera Altus cluster already existing in your Cloud service. Otherwise, leave this check box clear to allow the Job to create a cluster on the fly. With this check box selected, only the Cluster name parameter is useful and the other parameters for the cluster configuration are hidden.
Cluster name	Enter the name of the cluster to be used.
Environment	Enter the name of the Cloudera Altus environment to be used to describe the resources allocated to the given cluster. If you do not know which environment to select, contact your Cloudera Altus administrator.
Delete cluster after execution	Select this check box if you want to remove the given cluster after the execution of your Job.
Override with a JSON configuration	Select this check box to manually edit JSON code in the Custom JSON field that is displayed to configure the cluster.
Instance type	Select the instance type for the instances in the cluster. All nodes that are deployed in this cluster use the same instance type.
Worker node	Enter the number of worker nodes to be created for the cluster. For details about the allowed number of worker nodes, see the documentation of Cloudera Altus.
Cloudera Manager username and Cloudera Manager password	Enter the authentication information to your Cloudera Manager service.
SSH private key	Browse, or enter the path to the SSH private key in order to upload and register it in the region specified in the Cloudera Altus environment. The Data Engineering service of Cloudera Altus uses this private key to access and configure instances of the cluster to be used.
Custom bootstrap script	If you want to create a cluster with a bootstrap script you provide, browse, or enter the path to this script in the Custom Bootstrap script field. For an example of an Altus bootstrap script, see Install a custom Python environment when creating a cluster from the Cloudera documentation.

From the Cloud provider list, select the Cloud service
that runs your Cloudera Altus cluster.

If your cloud provider is AWS, select AWS and
define the Amazon S3 directory in which you store your Job
dependencies.

AWS	Access key and Secret key: enter the authentication information required to connect to the Amazon S3 bucket to be used. To enter the password, click the […] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings. Specify the AWS region by selecting a region name from the list or entering a region between double quotation marks (e.g. “us-east-1”) in the list. For more information about the AWS Region, see Regions and Endpoints. S3 bucket name: enter the name of the bucket to be used to store the dependencies of your Job. This bucket must already exist. S3 storage path: enter the directory in which you want to store the dependencies of your Job in this given bucket, for example, `altus/jobjar`. This directory is created if it does not exist at runtime.

AWS

Access key and
Secret key:
enter the authentication information required to connect to the Amazon
S3 bucket to be used.

To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings.
Specify the AWS region by selecting a region name from the
list or entering a region between double quotation marks (e.g. “us-east-1”) in the list. For more information about the AWS
Region, see Regions and Endpoints.
S3 bucket name: enter the name of the bucket
to be used to store the dependencies of your Job. This bucket must already
exist.
S3 storage path: enter the directory in
which you want to store the dependencies of your Job in this given bucket,
for example, altus/jobjar. This directory is created if
it does not exist at runtime.

The Amazon S3 you specify here is used to store your Job dependencies
only. To connect to the S3 system which hosts your actual data, use a
tS3Configuration component in your Job.

If your cloud provider is Azure, select Azure to
store your Job dependencies in your Azure Data Lake Storage.

In your Azure portal, assign the Read/Write/Execute permissions
to the Azure application to be used by the Job to access your
Azure Data Lake Storage. For details about how to assign
permissions, see Azure documentation: Assign the Azure AD
application to the Azure Data Lake Storage account file or
folder. For example:

Without appropriate permissions, your Job dependencies cannot be
transferred to your Azure Data Lake Storage.
In your Altus console, identify the Data Lake Storage AAD Group
Name used by your Altus environment in the Instance
Settings section.
In your Azure portal, assign the Read/Write/Execute permissions
to this AAD group using the same procedure explained in Azure
documentation: Assign the Azure AD
application to the Azure Data Lake Storage account file or
folder.

Without appropriate permissions, your Job dependencies cannot be
transferred to your Azure Data Lake Storage.

In the Spark configuration tab, configure
the connection to your Azure Data Lake Storage.

Azure (technical preview)	ADLS account FQDN: Enter the address without the scheme part of the Azure Data Lake Storage account to be used, for example, `ychendls.azuredatalakestore.net`. This account must already exist in your Azure portal. Azure App ID and Azure App key: In the Client ID and the Client key fields, enter, respectively, the authentication ID and the authentication key generated upon the registration of the application that the current Job you are developing uses to access Azure Data Lake Storage. This application must be the one to which you assigned permissions to access your Azure Data Lake Storage in the previous step. Token endpoint: In the Token endpoint field, copy-paste the OAuth 2.0 token endpoint that you can obtain from the Endpoints list accessible on the App registrations page on your Azure portal.

Azure (technical preview)

ADLS account FQDN:

Enter the address without the scheme part of the Azure Data Lake Storage
account to be used, for example,
ychendls.azuredatalakestore.net.

This account must already exist in your Azure portal.
Azure App ID and Azure App
key:

In the
Client ID and the Client
key fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.

This application must be the one to
which you assigned permissions to access your Azure Data Lake Storage in
the previous step.
Token endpoint:

In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal.

The Azure Data Lake Storage you specify here is used to store your Job
dependencies only. To connect to the Azure system which hosts your
actual data, use a tAzureFSConfiguration component in
your Job.

Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .
If you need to consult the Altus related logs, check them in your Cloudera
Manager service or on your Altus cluster instances.

Configuring a Spark stream for your Apache Spark streaming Job

Define how often your Spark Job creates and processes micro batches.

In the Batch size
field, enter the time interval at the end of which the Job reviews the source
data to identify changes and processes the new micro batches.
If needs be, select the Define a streaming
timeout check box and in the field that is displayed, enter the
time frame at the end of which the streaming Job automatically stops
running.

Tuning Spark for Apache Spark Batch Jobs

You can define the tuning parameters in the Spark
configuration tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

Select the Set Tuning properties
check box to optimize the allocation of the resources to be used to run this
Job. These properties are not mandatory for the Job to run successfully, but
they are useful when Spark is bottlenecked by any resource issue in the cluster
such as CPU, bandwidth or memory.
Calculate the initial resource allocation as the point to start the
tuning.
A generic formula for this calculation is
- Number of executors = (Total cores of the cluster) / 2
- Number of cores per executor = 2
- Memory per executor = (Up to total memory of the cluster) / (Number of executors)

Define each parameter and if needed, revise them until you obtain the
satisfactory performance.

The following table provides the exhaustive list of the tuning properties.
The actual properties available in the Spark
configuration tab could vary depending on the distribution you
are using.

Spark Standalone mode	Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job. Executor memory: enter the allocation size of memory to be used by each Spark executor. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards. If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.
Spark Yarn client mode	Set application master tuning properties: select this check box and in the fields that are displayed, enter the amount of memory and the number of CPUs to be allocated to the ApplicationMaster service of your cluster. If you want to use the default allocation of your cluster, leave this check box clear. Executor memory: enter the allocation size of memory to be used by each Spark executor. Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Yarn resource allocation: select how you want Yarn to allocate resources among executors. Auto: you let Yarn use its default number of executors. This number is `2`. Fixed: you need to enter the number of executors to be used in the Num executors that is displayed. Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards. If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.
Spark Yarn cluster mode	Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job. Executor memory: enter the allocation size of memory to be used by each Spark executor. Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Yarn resource allocation: select how you want Yarn to allocate resources among executors. Auto: you let Yarn use its default number of executors. This number is `2`. Fixed: you need to enter the number of executors to be used in the Num executors that is displayed. Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards. If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.

Tuning Spark for Apache Spark Streaming Jobs

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

Select the Set Tuning properties
check box to optimize the allocation of the resources to be used to run this
Job. These properties are not mandatory for the Job to run successfully, but
they are useful when Spark is bottlenecked by any resource issue in the cluster
such as CPU, bandwidth or memory.
Calculate the initial resource allocation as the point to start the
tuning.
A generic formula for this calculation is
- Number of executors = (Total cores of the cluster) / 2
- Number of cores per executor = 2
- Memory per executor = (Up to total memory of the cluster) /
  (Number of executors)

Define each parameter and if needed, revise them until you obtain the
satisfactory performance.

Spark Standalone mode	Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job. Executor memory: enter the allocation size of memory to be used by each Spark executor. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.
Spark Yarn client mode	Set application master tuning properties: select this check box and in the fields that are displayed, enter the amount of memory and the number of CPUs to be allocated to the ApplicationMaster service of your cluster. If you want to use the default allocation of your cluster, leave this check box clear. Executor memory: enter the allocation size of memory to be used by each Spark executor. Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Yarn resource allocation: select how you want Yarn to allocate resources among executors. Auto: you let Yarn use its default number of executors. This number is `2`. Fixed: you need to enter the number of executors to be used in the Num executors that is displayed. Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.
Spark Yarn cluster mode	Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job. Executor memory: enter the allocation size of memory to be used by each Spark executor. Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode. Yarn resource allocation: select how you want Yarn to allocate resources among executors. Auto: you let Yarn use its default number of executors. This number is `2`. Fixed: you need to enter the number of executors to be used in the Num executors that is displayed. Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field. Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use. Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine. Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used. Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.

Logging and checkpointing the activities of your Apache Spark Job

It is recommended to activate the Spark logging and checkpointing system in the
Spark configuration tab of the Run
view of your Spark Job, in order to help debug and resume your Spark Job when issues
arise.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
In the Yarn client mode or the
Yarn cluster mode, you can enable the Spark
application logs of this Job to be persistent in the file system. To do this,
select the Enable Spark event logging check
box.
The parameters relevant to Spark logs are displayed:
- Spark event logs directory:
  enter the directory in which Spark events are logged. This is actually
  the spark.eventLog.dir property.
- Spark history server address:
  enter the location of the history server. This is actually the spark.yarn.historyServer.address
  property.
- Compress Spark event logs: if
  needs be, select this check box to compress the logs. This is actually
  the spark.eventLog.compress property.
Since the administrator of your cluster could have
defined these properties in the cluster configuration files, it is recommended
to contact the administrator for the exact values.
If you want to print the Spark context that your Job starts in the log, add the
spark.logConf property in the Advanced
properties table and enter, within double quotation marks,
true in the Value column of
this table.

Since the administrator of your cluster could have
defined these properties in the cluster configuration files, it is recommended
to contact the administrator for the exact values.

Adding advanced Spark properties to solve issues

Depending on the distribution you are using or the issues you encounter, you may need to
add specific Spark properties to the Advanced properties table in
the Spark configuration tab of the Run
view of your Job.

Alternatively, define a Hadoop connection metadata in the
Repository and in its wizard, select the Use Spark
properties check box to open the properties table and add the property
or properties to be used, for example, from spark-defaults.conf of
your cluster. When you reuse this connection in your Apache Spark Jobs, the advanced
Spark properties you have added there are automatically added to the Spark
configurations for those Jobs.

The advanced properties required by different Hadoop distributions or by some common
issues and their values are listed below:

For further information about the valid Spark properties, see Spark
documentation at https://spark.apache.org/docs/latest/configuration.

Specific Spark timeout	When encountering network issues, Spark by default waits for up to 45 minutes before stopping its attempts to submits Jobs. Then, Spark triggers the automatic stop of your Job. Add the following properties to the Hadoop properties table of tHDFSConfiguration to reduce this duration. `ipc.client.ping`: `false`. This prevents pinging if server does not answer. `ipc.client.connect.max.retries`: `0`. This indicates the number of retries if the demand for connection is answered but refused. `yarn.resourcemanager.connect.retry-interval.ms`: any number. This indicates how often to try to connect to the ResourceManager service until Spark gives up.

Specific Spark timeout

When encountering network issues, Spark by
default waits for up to 45 minutes before stopping its attempts to
submits Jobs. Then, Spark triggers the automatic stop of your Job.

Add the following properties to
the Hadoop properties table of
tHDFSConfiguration to reduce this duration.

ipc.client.ping:
false. This prevents pinging if
server does not answer.
ipc.client.connect.max.retries:
0. This indicates the number of
retries if the demand for connection is answered but
refused.
yarn.resourcemanager.connect.retry-interval.ms:
any number. This indicates how often to try to connect to
the ResourceManager service until Spark gives up.

Hortonworks Data Platform V2.4	`spark.yarn.am.extraJavaOptions`: `-Dhdp.version=2.4.0.0-169` `spark.driver.extraJavaOptions`: `-Dhdp.version=2.4.0.0-169` In addition, you need to add `-Dhdp.version=2.4.0.0-169` to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the Preferences window. Setting this argument in the Preferences window applies it on all the Jobs that are designed in the same Studio.

Hortonworks Data Platform V2.4

spark.yarn.am.extraJavaOptions:
-Dhdp.version=2.4.0.0-169
spark.driver.extraJavaOptions:
-Dhdp.version=2.4.0.0-169

In addition, you need to add -Dhdp.version=2.4.0.0-169
to the JVM settings area either in the
Advanced settings tab of the
Run view or in the Talend > Run/Debug view of the
Preferences window. Setting this argument in the
Preferences window applies it on all the Jobs that
are designed in the same Studio.

MapR V5.1 and V5.2	When the cluster is used with the HBase or the MapRDB components: spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster. For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly. For a step-by-step explanation about how to add this parameter, see You can find more details about how to run HBase/MapR-DB on Spark with a MapR distribution in Talend Help Center (https://help.talend.com).

MapR V5.1 and V5.2

When the cluster is used with the HBase or the MapRDB components:

spark.hadoop.yarn.application.classpath: enter the value of this
parameter specific to your cluster and add, if missing, the classpath to HBase to
ensure that the Job to be used can find the required classes and packages in the
cluster.

For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your
cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a
comma(,). The added paths is where HBase is usually installed in a MapR cluster. If
your HBase is installed elsewhere, contact the administrator of your cluster for
details and adapt these paths accordingly.

For a step-by-step explanation about how to add this
parameter, see You can find more
details about how to run HBase/MapR-DB on Spark with a MapR distribution in
Talend Help Center
(https://help.talend.com).

Security	In the machine where the Studio with Big Data is installed, some scanning tools could report a CVE vulnerability issue related to Spark, while this issue does not actually impact Spark, as is explained by the Spark community, because this vulnerability concerns the Apache Thrift Go client library only but Spark does not use this library. Therefore this alert is not relevant to the Studio and thus no action is required.

Security

In the machine where the Studio with Big Data is installed, some scanning
tools could report a CVE vulnerability issue related to Spark, while
this issue does not actually impact Spark, as is explained by the Spark community, because this vulnerability
concerns the Apache Thrift Go client library only but Spark does not use
this library. Therefore this alert is not relevant to the Studio and thus no
action is required.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 7.x

0 Comments

Inline Feedbacks

View all comments

Spark configuration – Docs for ESB 7.x

Spark configuration

Defining the connection to the Azure Storage account to be used in the Studio

Defining the Azure Databricks connection parameters for Spark Jobs

Adding Azure specific properties to access the Azure storage system from Databricks

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Defining the AWS Qubole connection parameters for Spark Jobs

Defining the EMR connection parameters

Defining the Cloudera connection parameters

Defining the Dataproc connection parameters

Defining the Hortonworks connection parameters

Defining the MapR connection parameters

Creating an HDInsight cluster on Azure

Defining the HD Insight connection parameters

Defining the Cloudera Altus connection parameters

Configuring a Spark stream for your Apache Spark streaming Job

Tuning Spark for Apache Spark Batch Jobs

Tuning Spark for Apache Spark Streaming Jobs

Logging and checkpointing the activities of your Apache Spark Job

Adding advanced Spark properties to solve issues

My Website Links

Tags