Spark configuration
The Spark configuration view contains the Spark specific properties you can define for your Job depending on the distribution and the Spark mode you are using.
The information in this section is only for users who have subscribed to
Talend Data Fabric or to any Talend product with Big Data but it is not
applicable to Talend Open Studio for Big Data users.
Defining the connection to the Azure Storage account to be used in the Studio
Repository of the Studio.
- You have an Azure account with appropriate rights and permissions to the
Azure Storage. - The Azure Storage account to be used has been properly created and you have the
appropriate permissions to access it. For further information about Azure
Storage, see Azure Storage tutorials from Azure
documentation. - You are using one of the Talend solutions with Big Data.
-
Obtain the access key to the Azure Storage account to be used on https://portal.azure.com/.
-
Click All services on the menu bar on the left
of the Azure welcome page. -
Click Storage accounts in the
STORAGE section. - Click the storage account to be used.
-
On the list that is displayed, click Access keys
to open the corresponding blade. -
Copy and keep the key that is displayed somewhere appropriate so as to
use it in the steps to come.
-
Click All services on the menu bar on the left
-
In the Integration perspective
of the Studio, expand the Metadata node in the
Repository, right click the Azure
Storage node and from the contextual menu, select
Create an Azure Storage Connection to open the
wizard. -
Complete the fields in the wizard:
Name Enter the name you want to use for this connection to be defined. Account Name Enter the name of the Azure Storage account to be connected
to.Account Key Enter the access key you got in the previous steps. -
Click Test connection to verify the configuration. Once
a message pops up to say that the connection is successful, the
Next button is activated. -
Click Next to access the list of containers available on
Azure under this Azure Storage account.This list is empty if this Azure Storage account does not contain any
containers. -
Select the container to connect to and click Next or
just click Next to skip this step. You can revise this
step anytime later by coming back to this wizard. -
Do the same to the query list and the table list that are respectively
displayed in the wizard. -
Click Finish to validate the creation. The connection
appears under the Azure Storage node in the
Repository.
your Jobs to work with the Azure services that are associated with this Azure Storage
account.
Defining the Azure Databricks connection parameters for Spark Jobs
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
- When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster
when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
automatically restarts the cluster, the Jobs that are launched in parallel interrupt
each other and thus cause execution failure.
-
From the Cloud provider drop-down list, select
Azure. -
Enter the basic connection information to Databricks.
Standalone
-
In the Endpoint
field, enter the URL address of your Azure Databricks workspace.
This URL can be found in the Overview blade
of your Databricks workspace page on your Azure portal. For example,
this URL could look like https://westeurope.azuredatabricks.net. -
In the Cluster ID
field, enter the ID of the Databricks cluster to be used. This ID is
the value of the
spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on the
properties list in the Environment tab in the
Spark UI view of your cluster.You can also easily find this ID from
the URL of your Databricks cluster. It is present immediately after
cluster/ in this URL. -
Click the […] button
next to the Token field to enter the
authentication token generated for your Databricks user account. You
can generate or find this token on the User
settings page of your Databricks workspace. For
further information, see Token management from the
Azure documentation. -
In the DBFS dependencies
folder field, enter the directory that is used to
store your Job related dependencies on Databricks Filesystem at
runtime, putting a slash (/) at the end of this directory. For
example, enter /jars/ to store the dependencies
in a folder named jars. This folder is created
on the fly if it does not exist then. -
Poll interval when retrieving Job status (in
ms): enter, without the quotation marks, the time
interval (in milliseconds) at the end of which you want the Studio
to ask Spark for the status of your Job. For example, this status
could be Pending or Running.The default value is 300000, meaning 30
seconds. This interval is recommended by Databricks to correctly
retrieve the Job status. -
Use
transient cluster: you can select this check box to
leverage the transient Databricks clusters.The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.
- Autoscale: select or clear this check box to define
the number of workers to be used by your transient cluster.- If you select this check box,
autoscaling is enabled. Then define the minimum number
of workers in Min
workers and the maximum number of
worders in Max
workers. Your transient cluster is
scaled up and down within this scope based on its
workload.According to the Databricks
documentation, autoscaling works best with
Databricks runtime versions 3.0 or onwards. - If you clear this check box, autoscaling
is deactivated. Then define the number of workers a
transient cluster is expected to have. This number does
not include the Spark driver node.
- If you select this check box,
- Node type
and Driver node type:
select the node types for the workers and the Spark driver node.
These types determine the capacity of your nodes and their
pricing by Databricks.For details about
these node types and the Databricks Units they use, see
Supported Instance
Types from the Databricks documentation. - Elastic
disk: select this check box to enable your
transient cluster to automatically scale up its disk space when
its Spark workers are running low on disk space.For more details about this elastic disk
feature, search for the section about autoscaling local
storage from your Databricks documentation. - SSH public
key: if an SSH access has been set up for your
cluster, enter the public key of the generated SSH key pair.
This public key is automatically added to each node of your
transient cluster. If no SSH access has been set up, ignore this
field.For further information about SSH
access to your cluster, see SSH access to
clusters from the Databricks
documentation. - Configure cluster
log: select this check box to define where to
store your Spark logs for a long term. This storage system could
be S3 or DBFS.
- Autoscale: select or clear this check box to define
- Do not restart the cluster
when submitting: select this check box to prevent
the Studio restarting the cluster when the Studio is submitting your
Jobs. However, if you make changes in your Jobs, clear this check
box so that the Studio resarts your cluster to take these changes
into account.
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
Adding Azure specific properties to access the Azure storage system from Databricks
Add the Azure specific properties to the Spark configuration of your Databricks
cluster so that your cluster can access Azure Storage.
You need to do this only when you want your Talend
Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with
Databricks.
-
Ensure that your Spark cluster in Databricks has been properly created and is
running and its version is supported by the Studio. If you use Azure Data
Lake Storage Gen 2, only Databricks 5.4 is supported.For further information, see Create Databricks workspace from
Azure documentation. - You have an Azure account.
- The Azure Blob Storage or Azure Data Lake Storage service to be used has been
properly created and you have the appropriate permissions to access it. For
further information about Azure Storage, see Azure Storage tutorials from Azure
documentation.
-
On the Configuration tab of your Databricks cluster
page, scroll down to the Spark tab at the bottom of the
page. -
Click Edit to make the fields on this page
editable. -
In this Spark tab, enter the Spark properties regarding
the credentials to be used to access your Azure Storage system.Option Description Azure Blob Storage When you need to use Azure Blob Storage with Azure Databricks, add the
following Spark property:-
The parameter to provide account key:
1spark.hadoop.fs.azure.account.key.<storage_account>.blob.core.windows.net <key>Ensure that the account to be used has the appropriate read/write rights and permissions.
-
If you need to append data to an existing file, add this parameter:
1spark.hadoop.fs.azure.enable.append.support true
Azure Data Lake Storage (Gen 1) When you need to use Azure Data Lake Storage Gen1 with
Databricks, add the following Spark properties, each per
line:1234spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredentialspark.hadoop.dfs.adls.oauth2.client.id <your_app_id>spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key>spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/tokenAzure Data Lake Storage (Gen 2) When you need to use Azure Data Lake Storage Gen2 with Databricks,
add the following Spark properties, each per line:-
The parameter to provide an account key:
1spark.hadoop.fs.azure.account.key.<storage_account>.dfs.core.windows.net <key>This key is associated with the storage account to be used.
You can find it in the Access keys
blade of this storage account. Two keys are available for
each account and by default, either of them can be used for
this access.Ensure that the account to be used has the appropriate read/write rights and permissions.
-
If the ADLS file system to be used does not exist yet, add the following parameter:
1spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization true
For further information about how to find your application ID and authentication
key, see Get application ID and authentication
key from the Azure documentation. In the same documentation, you can
also find details about how to find your tenant ID at Get tenant ID. -
-
If you need to run Spark Streaming Jobs with Databricks, in the same
Spark tab, add the following property to define a
default Spark serializer. If you do not plan to run Spark Streaming Jobs, you
can ignore this step.1spark.serializer org.apache.spark.serializer.KryoSerializer - Restart your Spark cluster.
-
In the Spark UI tab of your Databricks cluster page,
click Environment to display the list of properties and
verify that each of the properties you added in the previous steps is present on
that list.
Defining the Databricks-on-AWS connection parameters for Spark Jobs
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
- When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster
when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
automatically restarts the cluster, the Jobs that are launched in parallel interrupt
each other and thus cause execution failure.
- Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Standalone |
|
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
Defining the AWS Qubole connection parameters for Spark Jobs
Complete the Qubole connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.
- You have properly set up your Qubole cluster on AWS. For further information about
how to do this, see Getting Started with Qubole on AWS from the
Qubole documentation. - Ensure that the Qubole account to be used has the proper IAM role that is allowed to
read/write to the S3 bucket to be used. For further details, contact the
administrator of your Qubole system or see Cross-account IAM Role for QDS from
the Qubole documentation. - Ensure that the AWS account to be used has the proper read/write permissions to
the S3 bucket to be used. For this purpose, contact the administrator of your
AWS system.
-
Enter the basic connection information to Qubole.
Connection configuration
-
Click the … button next to the
API Token field to enter the
authentication token generated for the Qubole user account
to be used. For further information about how to obtain this
token, see Manage Qubole
account from the Qubole documentation.This
token allows you to specify the user account you want to
use to access Qubole. Your Job automatically uses
the rights and permissions granted to this user account
in Qubole. -
Select the Cluster label check
box and enter the name of the Qubole cluster to be used. If
leaving this check box clear, the default cluster is
used.If you need details about your default cluster,
ask the administrator of your Qubole service. You can
also read this article
from the Qubole documentaiton to find more information
about configuring a default Qubole cluster. -
Select the Change API endpoint
check box and select the region to be used. If leaving this
check box clear, the default region is used.For further
information about the Qubole Endpoints supported on
QDS-on-AWS, see Supported Qubole
Endpoints on Different Cloud
Providers.
-
-
Configure the connection to the S3 file system to be used to temporarily store the dependencies of your Job so that your Qubole cluster has access to these dependencies.
This configuration is used for your Job dependencies only. Use a
tS3Configuration in your Job to write your actual
business data in the S3 system with Qubole. Without
tS3Configuration, this business data is written in the
Qubole HDFS system and destroyed once you shut down your cluster.-
Access key and
Secret key:
enter the authentication information required to connect to the Amazon
S3 bucket to be used.To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings. - Bucket name: enter the name of the bucket in
which you want to store the dependencies of your Job. This bucket must
already exist on S3. - Temporary resource folder: enter the
directory in which you want to store the dependencies of your Job. For
example, enter temp_resources to write the dependencies in
the /temp_resources folder in the bucket.If this folder already exists at runtime, its contents are overwritten
by the upcoming dependencies; otherwise, this folder is automatically
created. -
Region: specify the AWS region by selecting a region name from the
list. For more information about the AWS Region, see Regions and Endpoints.
-
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
Defining the EMR connection parameters
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
Enter the basic connection information to EMR:
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn cluster
The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:1hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:12hdfs-sidt.xmlcore-site.xmlThe parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
Ensure that the user name in the Yarn
client mode is the same one you put in
tS3Configuration, the component used to provides S3
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
Defining the Cloudera connection parameters
Complete the Cloudera connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
If you cannot
find the Cloudera version to be used from this drop-down list, you can add your distribution
via some dynamic distribution settings in the Studio.
-
Select the type of the Spark cluster you need to connect to.
Standalone
The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.If you are using the Standalone mode, you need to
set the following parameters:-
In the Spark host field, enter the URI
of the Spark Master of the Hadoop cluster to be used. -
In the Spark home field, enter the
location of the Spark executable installed in the Hadoop cluster to be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn cluster
The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:1hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:12hdfs-sidt.xmlcore-site.xmlThe parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
-
If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch
Jobs, you can make use of Cloudera Navigator to trace the lineage of given
data flow to discover how this data flow was generated by a Job.
Defining the Dataproc connection parameters
Complete the Google Dataproc connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of cluster.
-
Enter the basic connection information to Dataproc:
Project identifier
Enter the ID of your Google Cloud Platform project.
If you are not certain about your project ID, check it in the Manage
Resources page of your Google Cloud Platform services.Cluster identifier
Enter the ID of your Dataproc cluster to be used.
Region From this drop-down list, select the Google Cloud region to
be used.Google Storage staging bucket As a Talend Job expects its
dependent jar files for execution, specify the Google Storage directory to
which these jar files are transferred so that your Job can access these
files at execution.The directory to be entered must end with a slash (/). If not existing, the
directory is created on the fly but the bucket to be used must already
exist. -
Provide the authentication information to your Google Dataproc cluster:
Provide Google Credentials in file
Leave this check box clear, when you
launch your Job from a given machine in which Google Cloud SDK has been
installed and authorized to use your user account credentials to access
Google Cloud Platform. In this situation, this machine is often your
local machine.When you launch your Job from a remote
machine, such as a Jobserver, select this check box and in the
Path to Google Credentials file field that is
displayed, enter the directory in which this JSON file is stored in the
Jobserver machine.For further information about this Google
Credentials file, see the administrator of your Google Cloud Platform or
visit Google Cloud Platform Auth
Guide. -
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
Defining the Hortonworks connection parameters
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
Enter the basic connection information to Hortonworks
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn cluster
The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:1hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:12hdfs-sidt.xmlcore-site.xmlThe parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
If you encounter the
hdp.version is not found
issue when
executing your Job, select the Set hdp.version check box
to define the hdp.version variable in your Job and also in
your cluster.For more details, see Set up the hdp.version parameter to resolve the Hortonworks version
issue.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
-
If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce
or Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks
cluster, you can make use of Atlas to trace the lineage of given data flow
to discover how this data was generated by a Job.
Defining the MapR connection parameters
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
Select the type of the Spark cluster you need to connect to.
Standalone
The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.If you are using the Standalone mode, you need to
set the following parameters:-
In the Spark host field, enter the URI
of the Spark Master of the Hadoop cluster to be used. -
In the Spark home field, enter the
location of the Spark executable installed in the Hadoop cluster to be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
Yarn cluster
The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:1hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:12hdfs-sidt.xmlcore-site.xmlThe parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
Verify, for example with your cluster administrator, whether your MapR cluster
is secured with the MapR ticket authentication mechanism.-
If the MapR cluster to be used is secured with the MapR ticket authentication mechanism,
set the MapR ticket authentication configuration by following the explanation in Setting up the MapR ticket authentication. -
Otherwise, leave the Use MapR Ticket
authentication check box clear.
-
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
Creating an HDInsight cluster on Azure
used.
- You have an Azure account with appropriate rights and permissions to the
HDInsight service.
- Navigate to the Azure portal: https://portal.azure.com/.
-
Click All services on the menu bar on the left of the
Azure welcome page. - In the Analytics section, click HDInsight clusters to open the corresponding blade.
- Click Add to create an HDInsight cluster.
-
Click Quick Create to display the Basics blade and enter the basic configuration information on this blade.
Among the parameters, the Cluster
type to be used must be the one officially supported by Talend. For example, select
Spark 2.1.0 (HDI 3.6).For further information about the supported versions, search for supported
Big Data platforms on Talend Help Center (https://help.talend.com).For further information about the parameters on this blade,
see Create clusters from Azure
documentation. -
Click Next to open the Storage
blade to set the storage settings for the cluster.-
From the Primary storage type list, select
Azure storage. -
For the storage account to be used, select the My
subscriptions radio button, then click Select
a storage account and choose the account to be used from
the blade that is displayed. -
For the other parameters, leave them as they are to use the default
values or enter the values you want to use.
-
From the Primary storage type list, select
-
Click Next to pass to the Summary
step, in which you review and confirm the configuration made in the previous
steps. - Once you confirm the configuration, click Create.
HDInsight clusters list.
Defining the HD Insight connection parameters
Complete the HD Insight connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn cluster mode is available for this type of
cluster.
-
Enter the basic connection information to Microsoft HD Insight:
Livy configuration
- The Hostname of
Livy is the URL of your HDInsight cluster. This URL can be
found in the Overview blade of your
cluster. Enter this URL without the https:// part. - The default Port is 443.
- The Username is the one defined when
creating your cluster. You can find it in the SSH
+ Cluster login blade of your cluster.
For
further information about the Livy service used by HD Insight, see
Submit Spark
jobs using Livy.HDInsight
configuration- The Username is the one defined when
creating your cluster. You can find it in the SSH
+ Cluster login blade of your cluster. - The Password is defined when creating your HDInsight
cluster for authentication to this cluster.
Windows Azure Storage
configurationEnter the address and the authentication information of the Azure Storage
account to be used. In this configuration, you do not define where to read or write
your business data but define where to deploy your Job only. Therefore always use
the Azure
Storage
system for this configuration.In the Container field, enter the name
of the container to be
used. You can
find the available containers in the Blob blade of the Azure
Storage account to be used.In the Deployment Blob field, enter the
location in which you want to store the current Job and its dependent libraries in
this Azure Storage account.In the Hostname field, enter the
Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account.In the Username field, enter the name of the Azure Storage account to be used.
In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.
- The Hostname of
-
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
-
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
Defining the Cloudera Altus connection parameters
Complete the Altus connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn cluster mode is available for this type of cluster.
Prerequisites:
Job is executed:
-
To install the Cloudera Altus CLI on Linux, see Cloudera Altus Client Setup for
Linux from the Cloudera documentation. -
To install the Cloudera Altus CLI on Windows, see Cloudera Altus Client Setup for
Windows from the Cloudera documentation.
-
In the Spark configuration tab of the
Run view of your Job, enter the basic connection
information to Cloudera Altus.Force Cloudera Altus credentials
Select this check box to provide the credentials with your
Job.If you want to provide the credentials separately, for example
manually using the commandaltus configure
in
your terminal, clear this check box.Path to Cloudera Altus CLI
Enter the path to the Cloudera Altus
client, which must have been installed and activated in the
machine in which your Job is executed. In production
environments, this machine is typically a Talend Jobserver. -
Configure the virtual Cloudera cluster to be used.
Use an existing Cloudera Altus cluster
Select this check box to use a Cloudera Altus cluster already
existing in your Cloud service. Otherwise, leave this check box
clear to allow the Job to create a cluster on the fly.With this check box selected, only the Cluster name parameter is
useful and the other parameters for the cluster configuration
are hidden.Cluster name
Enter the name of the cluster to be
used.Environment
Enter the name of the Cloudera Altus
environment to be used to describe the resources allocated to
the given cluster.If you do not know which environment to select, contact your
Cloudera Altus administrator.Delete cluster after execution
Select this check box if you want to remove the given cluster
after the execution of your Job.Override with a JSON configuration
Select this check box to manually edit JSON code in the
Custom JSON field that is displayed
to configure the cluster.Instance type
Select the instance type for the instances in the cluster. All
nodes that are deployed in this cluster use the same instance
type.Worker node
Enter the number of worker nodes to
be created for the cluster.For details about the allowed number of worker nodes, see the documentation of Cloudera
Altus.Cloudera Manager username and
Cloudera Manager passwordEnter the authentication
information to your Cloudera Manager service.SSH private key
Browse, or enter the path to the SSH private key in order to
upload and register it in the region specified in the Cloudera
Altus environment.The Data Engineering service of Cloudera Altus uses this private
key to access and configure instances of the cluster to be
used.Custom bootstrap script
If you want to create a cluster with a bootstrap script you
provide, browse, or enter the path to this script in the
Custom Bootstrap script field.For an example of an Altus bootstrap script, see Install a custom Python
environment when creating a cluster from the Cloudera
documentation. -
From the Cloud provider list, select the Cloud service
that runs your Cloudera Altus cluster.-
If your cloud provider is AWS, select AWS and
define the Amazon S3 directory in which you store your Job
dependencies.AWS
-
Access key and
Secret key:
enter the authentication information required to connect to the Amazon
S3 bucket to be used.To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings. -
Specify the AWS region by selecting a region name from the
list or entering a region between double quotation marks (e.g. “us-east-1”) in the list. For more information about the AWS
Region, see Regions and Endpoints. -
S3 bucket name: enter the name of the bucket
to be used to store the dependencies of your Job. This bucket must already
exist. -
S3 storage path: enter the directory in
which you want to store the dependencies of your Job in this given bucket,
for example, altus/jobjar. This directory is created if
it does not exist at runtime.
The Amazon S3 you specify here is used to store your Job dependencies
only. To connect to the S3 system which hosts your actual data, use a
tS3Configuration component in your Job. -
-
If your cloud provider is Azure, select Azure to
store your Job dependencies in your Azure Data Lake Storage.-
In your Azure portal, assign the Read/Write/Execute permissions
to the Azure application to be used by the Job to access your
Azure Data Lake Storage. For details about how to assign
permissions, see Azure documentation: Assign the Azure AD
application to the Azure Data Lake Storage account file or
folder. For example:Without appropriate permissions, your Job dependencies cannot be
transferred to your Azure Data Lake Storage. -
In your Altus console, identify the Data Lake Storage AAD Group
Name used by your Altus environment in the Instance
Settings section. -
In your Azure portal, assign the Read/Write/Execute permissions
to this AAD group using the same procedure explained in Azure
documentation: Assign the Azure AD
application to the Azure Data Lake Storage account file or
folder.Without appropriate permissions, your Job dependencies cannot be
transferred to your Azure Data Lake Storage. -
In the Spark configuration tab, configure
the connection to your Azure Data Lake Storage.Azure (technical preview)
-
ADLS account FQDN:
Enter the address without the scheme part of the Azure Data Lake Storage
account to be used, for example,
ychendls.azuredatalakestore.net.This account must already exist in your Azure portal.
-
Azure App ID and Azure App
key:In the
Client ID and the Client
key fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.This application must be the one to
which you assigned permissions to access your Azure Data Lake Storage in
the previous step. -
Token endpoint:
In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal.
-
The Azure Data Lake Storage you specify here is used to store your Job
dependencies only. To connect to the Azure system which hosts your
actual data, use a tAzureFSConfiguration component in
your Job. -
-
-
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
-
After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:-
for Spark Batch Jobs.
-
for Spark Streaming Jobs.
-
-
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:-
.
-
-
If you need to consult the Altus related logs, check them in your Cloudera
Manager service or on your Altus cluster instances.
Configuring a Spark stream for your Apache Spark streaming Job
-
In the Batch size
field, enter the time interval at the end of which the Job reviews the source
data to identify changes and processes the new micro batches. -
If needs be, select the Define a streaming
timeout check box and in the field that is displayed, enter the
time frame at the end of which the streaming Job automatically stops
running.
Tuning Spark for Apache Spark Batch Jobs
You can define the tuning parameters in the Spark
configuration tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.
Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.
-
Select the Set Tuning properties
check box to optimize the allocation of the resources to be used to run this
Job. These properties are not mandatory for the Job to run successfully, but
they are useful when Spark is bottlenecked by any resource issue in the cluster
such as CPU, bandwidth or memory. -
Calculate the initial resource allocation as the point to start the
tuning.A generic formula for this calculation is-
Number of executors = (Total cores of the cluster) / 2
-
Number of cores per executor = 2
-
Memory per executor = (Up to total memory of the cluster) / (Number of executors)
-
-
Define each parameter and if needed, revise them until you obtain the
satisfactory performance.The following table provides the exhaustive list of the tuning properties.
The actual properties available in the Spark
configuration tab could vary depending on the distribution you
are using.Spark Standalone mode
-
Driver memory and Driver
core: enter the allocation size of memory and
the number of cores to be used by the driver of the current
Job. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. - Job progress polling rate (in
ms): when using Spark V2.3 and onwards, enter the
time interval (in milliseconds) at the end of which you want the
Studio to ask Spark for the execution progress of your Job. Before
V2.3, Spark automatically sends this information to the Studio when
updates occur; the default value, 50 milliseconds, of this parameter
allows the Studio to reproduce more or less the same scenario with
Spark V2.3 and onwards.If you set this interval too long, you may
lose information about the progress; if too short, you may send
too many requests to Spark for only insignificant progress
information.
Spark Yarn client mode
-
Set application master tuning properties:
select this check box and in the fields that are displayed,
enter the amount of memory and the number of CPUs to be
allocated to the ApplicationMaster service of your cluster.If you want to use the default allocation of your cluster, leave
this check box clear. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Set executor memory overhead: select this
check box and in the field that is displayed, enter the amount
of off-heap memory (in MB) to be allocated per executor. This is
actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Yarn resource allocation: select how you
want Yarn to allocate resources among executors.-
Auto: you let Yarn use its
default number of executors. This number is
2. -
Fixed: you need to enter the
number of executors to be used in the Num
executors that is displayed. -
Dynamic: Yarn adapts the
number of executors to suit the workload. You need
to define the scale of this dynamic allocation by
defining the initial number of executors to run in
the Initial executors field,
the lowest number of executors in the Min
executors field and the largest number
of executors in the Max
executors field.
-
-
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. - Job progress polling rate (in
ms): when using Spark V2.3 and onwards, enter the
time interval (in milliseconds) at the end of which you want the
Studio to ask Spark for the execution progress of your Job. Before
V2.3, Spark automatically sends this information to the Studio when
updates occur; the default value, 50 milliseconds, of this parameter
allows the Studio to reproduce more or less the same scenario with
Spark V2.3 and onwards.If you set this interval too long, you may
lose information about the progress; if too short, you may send
too many requests to Spark for only insignificant progress
information.
Spark Yarn cluster mode
-
Driver memory and Driver
core: enter the allocation size of memory and
the number of cores to be used by the driver of the current
Job. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Set executor memory overhead: select this
check box and in the field that is displayed, enter the amount
of off-heap memory (in MB) to be allocated per executor. This is
actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Yarn resource allocation: select how you
want Yarn to allocate resources among executors.-
Auto: you let Yarn use its
default number of executors. This number is
2. -
Fixed: you need to enter the
number of executors to be used in the Num
executors that is displayed. -
Dynamic: Yarn adapts the
number of executors to suit the workload. You need
to define the scale of this dynamic allocation by
defining the initial number of executors to run in
the Initial executors field,
the lowest number of executors in the Min
executors field and the largest number
of executors in the Max
executors field.
-
-
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. - Job progress polling rate (in
ms): when using Spark V2.3 and onwards, enter the
time interval (in milliseconds) at the end of which you want the
Studio to ask Spark for the execution progress of your Job. Before
V2.3, Spark automatically sends this information to the Studio when
updates occur; the default value, 50 milliseconds, of this parameter
allows the Studio to reproduce more or less the same scenario with
Spark V2.3 and onwards.If you set this interval too long, you may
lose information about the progress; if too short, you may send
too many requests to Spark for only insignificant progress
information.
-
Tuning Spark for Apache Spark Streaming Jobs
You can define the tuning parameters in the Spark
configuration tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.
Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.
-
Select the Set Tuning properties
check box to optimize the allocation of the resources to be used to run this
Job. These properties are not mandatory for the Job to run successfully, but
they are useful when Spark is bottlenecked by any resource issue in the cluster
such as CPU, bandwidth or memory. -
Calculate the initial resource allocation as the point to start the
tuning.A generic formula for this calculation is-
Number of executors = (Total cores of the cluster) / 2
-
Number of cores per executor = 2
-
Memory per executor = (Up to total memory of the cluster) /
(Number of executors)
-
-
Define each parameter and if needed, revise them until you obtain the
satisfactory performance.Spark Standalone mode
-
Driver memory and Driver
core: enter the allocation size of memory and
the number of cores to be used by the driver of the current
Job. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. -
Activate backpressure: select this check
box to enable the backpressure feature of Spark. The
backpressure feature is available in the Spark version 1.5 and
onwards. With backpress enabled, Spark automatically finds the
optimal receiving rate and dynamically adapts the rate based on
current batch scheduling delays and processing time, in order to
receive data only as fast as it can process.
Spark Yarn client mode
-
Set application master tuning properties:
select this check box and in the fields that are displayed,
enter the amount of memory and the number of CPUs to be
allocated to the ApplicationMaster service of your cluster.If you want to use the default allocation of your cluster, leave
this check box clear. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Set executor memory overhead: select this
check box and in the field that is displayed, enter the amount
of off-heap memory (in MB) to be allocated per executor. This is
actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Yarn resource allocation: select how you
want Yarn to allocate resources among executors.-
Auto: you let Yarn use its
default number of executors. This number is
2. -
Fixed: you need to enter the
number of executors to be used in the Num
executors that is displayed. -
Dynamic: Yarn adapts the
number of executors to suit the workload. You need
to define the scale of this dynamic allocation by
defining the initial number of executors to run in
the Initial executors field,
the lowest number of executors in the Min
executors field and the largest number
of executors in the Max
executors field.
-
-
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. -
Activate backpressure: select this check
box to enable the backpressure feature of Spark. The
backpressure feature is available in the Spark version 1.5 and
onwards. With backpress enabled, Spark automatically finds the
optimal receiving rate and dynamically adapts the rate based on
current batch scheduling delays and processing time, in order to
receive data only as fast as it can process.
Spark Yarn cluster mode
-
Driver memory and Driver
core: enter the allocation size of memory and
the number of cores to be used by the driver of the current
Job. -
Executor memory: enter the allocation size
of memory to be used by each Spark executor. -
Set executor memory overhead: select this
check box and in the field that is displayed, enter the amount
of off-heap memory (in MB) to be allocated per executor. This is
actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select this check box
and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default
allocation defined by Spark is used, for example, all available
cores are used by one single executor in the
Standalone mode. -
Yarn resource allocation: select how you
want Yarn to allocate resources among executors.-
Auto: you let Yarn use its
default number of executors. This number is
2. -
Fixed: you need to enter the
number of executors to be used in the Num
executors that is displayed. -
Dynamic: Yarn adapts the
number of executors to suit the workload. You need
to define the scale of this dynamic allocation by
defining the initial number of executors to run in
the Initial executors field,
the lowest number of executors in the Min
executors field and the largest number
of executors in the Max
executors field.
-
-
Set Web UI port: if you need to change the
default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select the broadcast
implementation to be used to cache variables on each worker
machine. -
Customize Spark serializer: if you need to
import an external Spark serializer, select this check box and
in the field that is displayed, enter the fully qualified class
name of the serializer to be used. -
Activate backpressure: select this check
box to enable the backpressure feature of Spark. The
backpressure feature is available in the Spark version 1.5 and
onwards. With backpress enabled, Spark automatically finds the
optimal receiving rate and dynamically adapts the rate based on
current batch scheduling delays and processing time, in order to
receive data only as fast as it can process.
-
Logging and checkpointing the activities of your Apache Spark Job
Spark configuration tab of the Run
view of your Spark Job, in order to help debug and resume your Spark Job when issues
arise.
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
-
In the Yarn client mode or the
Yarn cluster mode, you can enable the Spark
application logs of this Job to be persistent in the file system. To do this,
select the Enable Spark event logging check
box.The parameters relevant to Spark logs are displayed:-
Spark event logs directory:
enter the directory in which Spark events are logged. This is actually
the spark.eventLog.dir property. -
Spark history server address:
enter the location of the history server. This is actually the spark.yarn.historyServer.address
property. -
Compress Spark event logs: if
needs be, select this check box to compress the logs. This is actually
the spark.eventLog.compress property.
Since the administrator of your cluster could have
defined these properties in the cluster configuration files, it is recommended
to contact the administrator for the exact values. -
-
If you want to print the Spark context that your Job starts in the log, add the
spark.logConf property in the Advanced
properties table and enter, within double quotation marks,
true in the Value column of
this table.Since the administrator of your cluster could have
defined these properties in the cluster configuration files, it is recommended
to contact the administrator for the exact values.
Adding advanced Spark properties to solve issues
Depending on the distribution you are using or the issues you encounter, you may need to
add specific Spark properties to the Advanced properties table in
the Spark configuration tab of the Run
view of your Job.
Alternatively, define a Hadoop connection metadata in the
Repository and in its wizard, select the Use Spark
properties check box to open the properties table and add the property
or properties to be used, for example, from spark-defaults.conf of
your cluster. When you reuse this connection in your Apache Spark Jobs, the advanced
Spark properties you have added there are automatically added to the Spark
configurations for those Jobs.
The advanced properties required by different Hadoop distributions or by some common
issues and their values are listed below:
For further information about the valid Spark properties, see Spark
documentation at https://spark.apache.org/docs/latest/configuration.
Specific Spark timeout |
When encountering network issues, Spark by Add the following properties to
|
Hortonworks Data Platform V2.4 |
In addition, you need to add -Dhdp.version=2.4.0.0-169 |
MapR V5.1 and V5.2 |
When the cluster is used with the HBase or the MapRDB components:
spark.hadoop.yarn.application.classpath: enter the value of this For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your For a step-by-step explanation about how to add this |
Security |
In the machine where the Studio with Big Data is installed, some scanning |