Spark configuration
The Spark configuration view contains the Spark specific properties you can define for your Job depending on the distribution and the Spark mode you are using.
The information in this section is only for users that have subscribed to
one of the
Talend
solutions with Big Data and is not applicable to Talend Open Studio for Big Data users.
Defining the EMR connection parameters
Complete the EMR connnection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of
cluster.
-
Enter the basic connection information to EMR:
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to enter the addresses of the following different
services in their corresponding fields (if you leave the check box of a
service clear, then at runtime, the configuration about this parameter
in the Hadoop cluster to be used will be ignored ):-
In the Resource manager
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Ensure that the user name in the Yarn
client mode is the same one you put in
tS3Configuration, the component used to provides S3
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of this machine.
This allows the Spark master and its workers to recognize this machine to find the Job
and thus its driver.Note that in this situation, you also need to add the name and the IP address
of this machine to its host file. -
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the Cloudera connection parameters
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
Select the type of the Spark cluster you need to connect to.
Standalone
The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.If you are using the Standalone mode, you need to
set the following parameters:-
In the Spark host field, enter the URI of
the Spark Master of the Hadoop cluster to be used. -
In the Spark home field, enter the location
of the Spark executable installed in the Hadoop cluster to be used.
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to enter the addresses of the following different
services in their corresponding fields (if you leave the check box of a
service clear, then at runtime, the configuration about this parameter
in the Hadoop cluster to be used will be ignored ):-
In the Resource manager
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of this machine.
This allows the Spark master and its workers to recognize this machine to find the Job
and thus its driver.Note that in this situation, you also need to add the name and the IP address
of this machine to its host file. -
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the Dataproc connection parameters
Complete the Google Dataproc connnection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of cluster.
-
Enter the basic connection information to Dataproc:
Project identifier
Enter the ID of your Google Cloud Platform project.
If you are not certain about your project ID, check it in the Manage
Resources page of your Google Cloud Platform services.Cluster identifier
Enter the ID of your Dataproc cluster to be used.
Region Enter the geographic zones in which the computing resources are used and your
data is stored and processed. If you do not need to specify a particular
region, leave the default value global.For further information about the available regions and the zones each region
groups, see Regions and Zones.Google Storage staging bucket As a Talend Job expects its
dependent jar files for execution, specify the Google Storage directory to
which these jar files are transferred so that your Job can access these
files at execution.The directory to be entered must end with a slash (/). If not existing, the
directory is created on the fly but the bucket to be used must already
exist. -
Provide the authentication information to your Google Dataproc cluster:
Provide Google Credentials in file
Leave this check box clear, when you
launch your Job from a given machine in which Google Cloud SDK has been
installed and authorized to use your user account credentials to access
Google Cloud Platform. In this situation, this machine is often your
local machine.When you launch your Job from a remote
machine, such as a Jobserver, select this check box and in the
Path to Google Credentials file field that is
displayed, enter the directory in which this JSON file is stored in the
Jobserver machine.For further information about this Google
Credentials file, see the administrator of your Google Cloud Platform or
visit Google Cloud Platform Auth
Guide. -
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the Hortonworks connection parameters
Complete the Hortonworks connnection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of
cluster.
-
Enter the basic connection information to Hortonworks
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to enter the addresses of the following different
services in their corresponding fields (if you leave the check box of a
service clear, then at runtime, the configuration about this parameter
in the Hadoop cluster to be used will be ignored ):-
In the Resource manager
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of this machine.
This allows the Spark master and its workers to recognize this machine to find the Job
and thus its driver.Note that in this situation, you also need to add the name and the IP address
of this machine to its host file. -
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the MapR connection parameters
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
-
Select the type of the Spark cluster you need to connect to.
Standalone
The Studio connects to a Spark-enabled cluster to run the Job from this
cluster.If you are using the Standalone mode, you need to
set the following parameters:-
In the Spark host field, enter the URI of
the Spark Master of the Hadoop cluster to be used. -
In the Spark home field, enter the location
of the Spark executable installed in the Hadoop cluster to be used.
Yarn client
The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.If you are using the Yarn client
mode, you need to enter the addresses of the following different
services in their corresponding fields (if you leave the check box of a
service clear, then at runtime, the configuration about this parameter
in the Hadoop cluster to be used will be ignored ):-
In the Resource manager
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used. -
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Ensure that the user name in the Yarn
client mode is the same one you put in
tHDFSConfiguration, the component used to provide HDFS
connection information to Spark. -
-
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
and enter the directory where your winutils.exe is
stored. -
Otherwise, leave this check box clear, the Studio generates one
by itself and automatically uses it for this Job.
-
-
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of this machine.
This allows the Spark master and its workers to recognize this machine to find the Job
and thus its driver.Note that in this situation, you also need to add the name and the IP address
of this machine to its host file. -
Verify, for example with your cluster administrator, whether your MapR cluster
is secured with the MapR ticket authentication mechanism.-
If the MapR cluster to be used is secured with the MapR ticket authentication mechanism,
set the MapR ticket authentication configuration by following the explanation in Setting up the MapR ticket authentication. -
Otherwise, leave the Use MapR Ticket authentication check box clear.
-
-
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the HD Insight connection parameters
Complete the HD Insight connnection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn client mode is available for this type of
cluster.
-
Enter the basic connection information to Microsoft HD Insight:
Livy configuration
The Hostname of Livy uses
the following syntax: your_spark_cluster_name.azurehdinsight.net. For further
information about the Livy service used by HD Insight, see Submit Spark
jobs using Livy.HDInsight
configurationEnter the authentication information of the HD Insight cluster to be
used.Windows Azure Storage
configurationEnter the address and the authentication information of the Azure Storage
account to be used.In the Container field, enter the name
of the container to be used.In the Deployment Blob field, enter the
location in which you want to store the current Job and its dependent libraries in
this Azure Storage account. -
With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
-
In the Spark “scratch” directory field,
enter the directory in which the Studio stores in the local system the temporary files
such as the jar files to be transferred. If you launch the Job on Windows, the default
disk is C:. So if you leave /tmp in this field,
this directory is C:/tmp.
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Defining the Cloudera Altus connection parameters (technical preview)
Complete the Altus connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
Only the Yarn cluster mode is available for this type of cluster.
-
To install the Cloudera Altus CLI on Linux, see Cloudera Altus Client Setup for Linux from the Cloudera documentation.
-
To install the Cloudera Altus CLI on Windows, see Cloudera Altus Client Setup for Windows from the Cloudera documentation.
-
In the Spark configuration tab of the
Run view of your Job, enter the basic connection
information to Cloudera Altus.Force Cloudera Altus credentials
Select this check box to provide the credentials with your
Job.If you want to provide the credentials separately, for example
manually using the commandaltus configure
in
your terminal, clear this check box.Path to Cloudera Altus CLI
Enter the path to the Cloudera Altus client, which must have been
installed and activated in the machine in which your Job is
executed. In production environments, this machine is typically
a Talend
Jobserver. -
Configure the virtual Cloudera cluster to be used.
Use an existing Cloudera Altus cluster
Select this check box to use a Cloudera Altus cluster already
existing in your Cloud service. Otherwise, leave this check box
clear to allow the Job to create a cluster on the fly.With this check box selected, only the Cluster name parameter is useful and the other parameters for the cluster configuration are hidden.
Cluster name
Enter the name of the cluster to be used.
Environment
Enter the name of the Cloudera Altus environment to be used to describe
the resources allocated to the given cluster.If you do not know which environment to select, contact
your Cloudera Altus administrator.Delete cluster after execution
Select this check box if you want to remove the given cluster after the execution of your Job.
Override with a JSON configuration
Select this check box to manually edit JSON code in the Custom JSON field
that is displayed to configure the cluster.Instance type
Select the instance type for the instances in the cluster. All
nodes that are deployed in this cluster use the same instance
type.Worker node
Enter the number of worker nodes to be created for the
cluster.For details about the allowed number of worker nodes, see the documentation of Cloudera
Altus.Cloudera Manager username and
Cloudera Manager passwordEnter the authentication information to your Cloudera Manager
service.SSH private key
Browse, or enter the path to the SSH private key in order to
upload and register it in the region specified in the Cloudera
Altus environment.The Data Engineering service of Cloudera Altus uses this private key to
access and configure instances of the cluster to be used. -
From the Cloud provider list, select the Cloud service
that runs your Cloudera Altus cluster. Currently, only AWS is available.AWS
-
Access key and Secret key: enter the authentication
information required to connect to the Amazon S3 bucket to be used.To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings. -
Specify the AWS region by selecting a region name from the list or entering
a region between double quotation marks (e.g. “us-east-1”) in the
list. For more information about the AWS Region, see Regions and Endpoints. -
S3 bucket name: enter the name of the bucket to be
used to store the dependencies of your Job. This bucket must already
exist. -
S3 storage path: enter the directory in which you want
to store the dependencies of your Job in this given bucket, for example,
altus/jobjar. This directory is created if it
does not exist at runtime.
The Amazon S3 you specify here is used to store your Job dependencies only.
To connect to the S3 system which hosts your actual data, use a
tS3Configuration component in your Job -
-
Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
-
Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
Configuring a Spark stream for your Apache Spark streaming Job
-
In the Batch size field, enter the time
interval at the end of which the Job reviews the source data to identify changes and
processes the new micro batches. -
If needs be, select the Define a streaming
timeout check box and in the field that is displayed, enter the time frame
at the end of which the streaming Job automatically stops running.
Tuning Spark for Apache Spark Batch Jobs
You can define the tuning parameters in the Spark
configuration tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.
Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.
-
Select the Set Tuning properties check
box to optimize the allocation of the resources to be used to run this Job.
These properties are not mandatory for the Job to run successfully, but they are
useful when Spark is bottlenecked by any resource issue in the cluster such as
CPU, bandwidth or memory. -
Calculate the initial resource allocation as the point to start the
tuning.A generic formula for this calculation is-
Number of executors = (Total cores of the cluster) / 2
-
Number of cores per executor = 2
-
Memory per executor = (Up to total memory of the cluster) / (Number of executors)
-
-
Define each parameter and if needed, revise them until you obtain the
satisfactory performance.Spark Standalone mode
-
Driver memory and
Driver core: enter the allocation size
of memory and the number of cores to be used by the driver of the current
Job. -
Executor memory: enter
the allocation size of memory to be used by each Spark executor. -
Core per executor: select
this check box and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default allocation
defined by Spark is used, for example, all available cores are used by one
single executor in the Standalone
mode. -
Set Web UI port: if you
need to change the default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select
the broadcast implementation to be used to cache variables on each worker
machine. -
Customize Spark
serializer: if you need to import an external Spark serializer,
select this check box and in the field that is displayed, enter the fully
qualified class name of the serializer to be used.
Spark Yarn client mode
-
Executor memory: enter
the allocation size of memory to be used by each Spark executor. -
Set executor memory overhead:
select this check box and in the field that is displayed, enter the amount of
off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select
this check box and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default allocation
defined by Spark is used, for example, all available cores are used by one
single executor in the Standalone
mode. -
Yarn resource allocation:
select how you want Yarn to allocate resources among executors.-
Auto:
you let Yarn use its default number of executors. This number is 2. -
Fixed: you need to enter the number of executors to be used
in the Num executors that
is displayed. -
Dynamic: Yarn adapts the number of executors to suit the
workload. You need to define the scale of this dynamic allocation by
defining the initial number of executors to run in the Initial executors field, the lowest
number of executors in the Min
executors field and the largest number of executors in the
Max executors field.
-
-
Set Web UI port: if you
need to change the default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select
the broadcast implementation to be used to cache variables on each worker
machine. -
Customize Spark
serializer: if you need to import an external Spark serializer,
select this check box and in the field that is displayed, enter the fully
qualified class name of the serializer to be used.
-
Tuning Spark for Apache Spark Streaming Jobs
You can define the tuning parameters in the Spark
configuration tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.
Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.
-
Select the Set Tuning properties check
box to optimize the allocation of the resources to be used to run this Job.
These properties are not mandatory for the Job to run successfully, but they are
useful when Spark is bottlenecked by any resource issue in the cluster such as
CPU, bandwidth or memory. -
Calculate the initial resource allocation as the point to start the
tuning.A generic formula for this calculation is-
Number of executors = (Total cores of the cluster) / 2
-
Number of cores per executor = 2
-
Memory per executor = (Up to total memory of the cluster) / (Number of executors)
-
-
Define each parameter and if needed, revise them until you obtain the
satisfactory performance.Spark Standalone mode
-
Driver memory and
Driver core: enter the allocation size
of memory and the number of cores to be used by the driver of the current
Job. -
Executor memory: enter
the allocation size of memory to be used by each Spark executor. -
Core per executor: select
this check box and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default allocation
defined by Spark is used, for example, all available cores are used by one
single executor in the Standalone
mode. -
Set Web UI port: if you
need to change the default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select
the broadcast implementation to be used to cache variables on each worker
machine. -
Customize Spark
serializer: if you need to import an external Spark serializer,
select this check box and in the field that is displayed, enter the fully
qualified class name of the serializer to be used. -
Activate backpressure:
select this check box to enable the backpressure feature of Spark. The
backpressure feature is available in the Spark verson 1.5 and onwards. With
backpress enabled, Spark automatically finds the optimal receiving rate and
dynamically adapts the rate based on current batch scheduling delays and
processing time, in order to receive data only as fast as it can process.
Spark Yarn client mode
-
Executor memory: enter
the allocation size of memory to be used by each Spark executor. -
Set executor memory overhead:
select this check box and in the field that is displayed, enter the amount of
off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property. -
Core per executor: select
this check box and in the displayed field, enter the number of cores to be used
by each executor. If you leave this check box clear, the default allocation
defined by Spark is used, for example, all available cores are used by one
single executor in the Standalone
mode. -
Yarn resource allocation:
select how you want Yarn to allocate resources among executors.-
Auto:
you let Yarn use its default number of executors. This number is 2. -
Fixed: you need to enter the number of executors to be used
in the Num executors that
is displayed. -
Dynamic: Yarn adapts the number of executors to suit the
workload. You need to define the scale of this dynamic allocation by
defining the initial number of executors to run in the Initial executors field, the lowest
number of executors in the Min
executors field and the largest number of executors in the
Max executors field.
-
-
Set Web UI port: if you
need to change the default port of the Spark Web UI, select this check box and
enter the port number you want to use. -
Broadcast factory: select
the broadcast implementation to be used to cache variables on each worker
machine. -
Customize Spark
serializer: if you need to import an external Spark serializer,
select this check box and in the field that is displayed, enter the fully
qualified class name of the serializer to be used. -
Activate backpressure:
select this check box to enable the backpressure feature of Spark. The
backpressure feature is available in the Spark verson 1.5 and onwards. With
backpress enabled, Spark automatically finds the optimal receiving rate and
dynamically adapts the rate based on current batch scheduling delays and
processing time, in order to receive data only as fast as it can process.
-
Logging and checkpointing the activities of your Apache Spark Job
Spark configuration tab of the Run
view of your Spark Job, in order to help debug and resume your Spark Job when issues
arise.
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
-
In the Yarn client mode, you can enable
the Spark application logs of this Job to be persistent in the file system. To do this,
select the Enable Spark event logging check box.The parameters relevant to Spark logs are displayed:-
Spark event logs
directory: enter the directory in which Spark events are logged.
This is actually the spark.eventLog.dir property. -
Spark history server
address: enter the location of the history server. This is actually
the spark.yarn.historyServer.address property. -
Compress Spark event
logs: if needs be, select this check box to compress the logs. This
is actually the spark.eventLog.compress property.
Since the administrator of your cluster could have defined these properties in
the cluster configuration files, it is recommended to contact the administrator for the
exact values. -
Adding advanced Spark properties to solve issues
Depending on the distribution you are using or the issues you encounter, you may need to
add specific Spark properties to the Advanced properties table in
the Spark configuration tab of the Run
view of your Job.
The advanced properties required by different Hadoop distributions and their values are listed below:
For further information about the valid Spark properties, see Spark
documentation at https://spark.apache.org/docs/latest/configuration.
Hortonworks Data Platform V2.4 |
In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM |
MapR V5.1 and V5.2 |
When the cluster is used with the HBase or the MapRDB spark.hadoop.yarn.application.classpath: enter the value For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added For a step-by-step explanation about how to add this parameter, see You can find more details about how to run HBase/MapR-DB on Spark with a MapR distribution in Talend Help Center (https://help.talend.com). |