August 15, 2023

Spark configuration – Docs for ESB 6.x

Spark configuration

The Spark configuration view contains the Spark specific properties you can define for your Job depending on the distribution and the Spark mode you are using.

The information in this section is only for users that have subscribed to
one of the
Talend
solutions with Big Data and is not applicable to Talend Open Studio for Big Data users.

Defining the EMR connection parameters

Complete the EMR connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of
cluster.

  1. Enter the basic connection information to EMR:

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to enter the addresses of the following different
    services in their corresponding fields (if you leave the check box of a
    service clear, then at runtime, the configuration about this parameter
    in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tS3Configuration, the component used to provides S3
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.

  4. If the Spark cluster cannot recognize the machine in which the Job is
    launched, select this Define the driver hostname or IP
    address
    check box and enter the host name or the IP address of this machine.
    This allows the Spark master and its workers to recognize this machine to find the Job
    and thus its driver.

    Note that in this situation, you also need to add the name and the IP address
    of this machine to its host file.

  5. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.
After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the Cloudera connection parameters

Complete the Cloudera connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Select the type of the Spark cluster you need to connect to.

    Standalone

    The Studio connects to a Spark-enabled cluster to run the Job from this
    cluster.

    If you are using the Standalone mode, you need to
    set the following parameters:

    • In the Spark host field, enter the URI of
      the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location
      of the Spark executable installed in the Hadoop cluster to be used.

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to enter the addresses of the following different
    services in their corresponding fields (if you leave the check box of a
    service clear, then at runtime, the configuration about this parameter
    in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.

  4. If the Spark cluster cannot recognize the machine in which the Job is
    launched, select this Define the driver hostname or IP
    address
    check box and enter the host name or the IP address of this machine.
    This allows the Spark master and its workers to recognize this machine to find the Job
    and thus its driver.

    Note that in this situation, you also need to add the name and the IP address
    of this machine to its host file.


  5. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the Dataproc connection parameters

Complete the Google Dataproc connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

  1. Enter the basic connection information to Dataproc:

    Project identifier

    Enter the ID of your Google Cloud Platform project.

    If you are not certain about your project ID, check it in the Manage
    Resources page of your Google Cloud Platform services.

    Cluster identifier

    Enter the ID of your Dataproc cluster to be used.

    Region

    Enter the geographic zones in which the computing resources are used and your
    data is stored and processed. If you do not need to specify a particular
    region, leave the default value global.

    For further information about the available regions and the zones each region
    groups, see Regions and Zones.

    Google Storage staging bucket

    As a Talend Job expects its
    dependent jar files for execution, specify the Google Storage directory to
    which these jar files are transferred so that your Job can access these
    files at execution.

    The directory to be entered must end with a slash (/). If not existing, the
    directory is created on the fly but the bucket to be used must already
    exist.

  2. Provide the authentication information to your Google Dataproc cluster:

    Provide Google Credentials in file

    Leave this check box clear, when you
    launch your Job from a given machine in which Google Cloud SDK has been
    installed and authorized to use your user account credentials to access
    Google Cloud Platform. In this situation, this machine is often your
    local machine.

    When you launch your Job from a remote
    machine, such as a Jobserver, select this check box and in the
    Path to Google Credentials file field that is
    displayed, enter the directory in which this JSON file is stored in the
    Jobserver machine.

    For further information about this Google
    Credentials file, see the administrator of your Google Cloud Platform or
    visit Google Cloud Platform Auth
    Guide
    .


  3. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  4. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the Hortonworks connection parameters

Complete the Hortonworks connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of
cluster.

  1. Enter the basic connection information to Hortonworks

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to enter the addresses of the following different
    services in their corresponding fields (if you leave the check box of a
    service clear, then at runtime, the configuration about this parameter
    in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.

  4. If the Spark cluster cannot recognize the machine in which the Job is
    launched, select this Define the driver hostname or IP
    address
    check box and enter the host name or the IP address of this machine.
    This allows the Spark master and its workers to recognize this machine to find the Job
    and thus its driver.

    Note that in this situation, you also need to add the name and the IP address
    of this machine to its host file.


  5. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the MapR connection parameters

Complete the MapR connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Select the type of the Spark cluster you need to connect to.

    Standalone

    The Studio connects to a Spark-enabled cluster to run the Job from this
    cluster.

    If you are using the Standalone mode, you need to
    set the following parameters:

    • In the Spark host field, enter the URI of
      the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location
      of the Spark executable installed in the Hadoop cluster to be used.

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to enter the addresses of the following different
    services in their corresponding fields (if you leave the check box of a
    service clear, then at runtime, the configuration about this parameter
    in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.

  4. If the Spark cluster cannot recognize the machine in which the Job is
    launched, select this Define the driver hostname or IP
    address
    check box and enter the host name or the IP address of this machine.
    This allows the Spark master and its workers to recognize this machine to find the Job
    and thus its driver.

    Note that in this situation, you also need to add the name and the IP address
    of this machine to its host file.

  5. Verify, for example with your cluster administrator, whether your MapR cluster
    is secured with the MapR ticket authentication mechanism.

    • If the MapR cluster to be used is secured with the MapR ticket authentication mechanism,
      set the MapR ticket authentication configuration by following the explanation in Setting up the MapR ticket authentication.

    • Otherwise, leave the Use MapR Ticket authentication check box clear.


  6. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the HD Insight connection parameters

Complete the HD Insight connnection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of
cluster.

  1. Enter the basic connection information to Microsoft HD Insight:

    Livy configuration

    The Hostname of Livy uses
    the following syntax: your_spark_cluster_name.azurehdinsight.net. For further
    information about the Livy service used by HD Insight, see Submit Spark
    jobs using Livy
    .

    HDInsight
    configuration

    Enter the authentication information of the HD Insight cluster to be
    used.

    Windows Azure Storage
    configuration

    Enter the address and the authentication information of the Azure Storage
    account to be used.

    In the Container field, enter the name
    of the container to be used.

    In the Deployment Blob field, enter the
    location in which you want to store the current Job and its dependent libraries in
    this Azure Storage account.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. In the Spark “scratch” directory field,
    enter the directory in which the Studio stores in the local system the temporary files
    such as the jar files to be transferred. If you launch the Job on Windows, the default
    disk is C:. So if you leave /tmp in this field,
    this directory is C:/tmp.

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Defining the Cloudera Altus connection parameters (technical preview)

Complete the Altus connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The Cloudera Altus client, Altus CLI, must be installed in the machine in which your Job is executed:

  1. In the Spark configuration tab of the
    Run view of your Job, enter the basic connection
    information to Cloudera Altus.

    Force Cloudera Altus credentials

    Select this check box to provide the credentials with your
    Job.

    If you want to provide the credentials separately, for example
    manually using the command altus configure in
    your terminal, clear this check box.

    Path to Cloudera Altus CLI

    Enter the path to the Cloudera Altus client, which must have been
    installed and activated in the machine in which your Job is
    executed. In production environments, this machine is typically
    a Talend
    Jobserver.

  2. Configure the virtual Cloudera cluster to be used.

    Use an existing Cloudera Altus cluster

    Select this check box to use a Cloudera Altus cluster already
    existing in your Cloud service. Otherwise, leave this check box
    clear to allow the Job to create a cluster on the fly.

    With this check box selected, only the Cluster name parameter is useful and the other parameters for the cluster configuration are hidden.

    Cluster name

    Enter the name of the cluster to be used.

    Environment

    Enter the name of the Cloudera Altus environment to be used to describe
    the resources allocated to the given cluster.

    If you do not know which environment to select, contact
    your Cloudera Altus administrator.

    Delete cluster after execution

    Select this check box if you want to remove the given cluster after the execution of your Job.

    Override with a JSON configuration

    Select this check box to manually edit JSON code in the Custom JSON field
    that is displayed to configure the cluster.

    Instance type

    Select the instance type for the instances in the cluster. All
    nodes that are deployed in this cluster use the same instance
    type.

    Worker node

    Enter the number of worker nodes to be created for the
    cluster.

    For details about the allowed number of worker nodes, see the documentation of Cloudera
    Altus
    .

    Cloudera Manager username and
    Cloudera Manager password

    Enter the authentication information to your Cloudera Manager
    service.

    SSH private key

    Browse, or enter the path to the SSH private key in order to
    upload and register it in the region specified in the Cloudera
    Altus environment.

    The Data Engineering service of Cloudera Altus uses this private key to
    access and configure instances of the cluster to be used.

  3. From the Cloud provider list, select the Cloud service
    that runs your Cloudera Altus cluster. Currently, only AWS is available.

    AWS

    • Access key and Secret key: enter the authentication
      information required to connect to the Amazon S3 bucket to be used.

      To enter the password, click the […] button next to the
      password field, and then in the pop-up dialog box enter the password between double quotes
      and click OK to save the settings.

    • Specify the AWS region by selecting a region name from the list or entering
      a region between double quotation marks (e.g. “us-east-1”) in the
      list. For more information about the AWS Region, see Regions and Endpoints.

    • S3 bucket name: enter the name of the bucket to be
      used to store the dependencies of your Job. This bucket must already
      exist.

    • S3 storage path: enter the directory in which you want
      to store the dependencies of your Job in this given bucket, for example,
      altus/jobjar. This directory is created if it
      does not exist at runtime.

    The Amazon S3 you specify here is used to store your Job dependencies only.
    To connect to the S3 system which hosts your actual data, use a
    tS3Configuration component in your Job

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:

Configuring a Spark stream for your Apache Spark streaming Job

Define how often your Spark Job creates and processes micro batches.
  1. In the Batch size field, enter the time
    interval at the end of which the Job reviews the source data to identify changes and
    processes the new micro batches.
  2. If needs be, select the Define a streaming
    timeout
    check box and in the field that is displayed, enter the time frame
    at the end of which the streaming Job automatically stops running.

Tuning Spark for Apache Spark Batch Jobs

You can define the tuning parameters in the Spark
configuration
tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

  1. Select the Set Tuning properties check
    box to optimize the allocation of the resources to be used to run this Job.
    These properties are not mandatory for the Job to run successfully, but they are
    useful when Spark is bottlenecked by any resource issue in the cluster such as
    CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the
    tuning.

    A generic formula for this calculation is

    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the
    satisfactory performance.

    Spark Standalone mode

    • Driver memory and
      Driver core: enter the allocation size
      of memory and the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter
      the allocation size of memory to be used by each Spark executor.

    • Core per executor: select
      this check box and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default allocation
      defined by Spark is used, for example, all available cores are used by one
      single executor in the Standalone
      mode.

    • Set Web UI port: if you
      need to change the default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select
      the broadcast implementation to be used to cache variables on each worker
      machine.

    • Customize Spark
      serializer
      : if you need to import an external Spark serializer,
      select this check box and in the field that is displayed, enter the fully
      qualified class name of the serializer to be used.

    Spark Yarn client mode

    • Executor memory: enter
      the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead:
      select this check box and in the field that is displayed, enter the amount of
      off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select
      this check box and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default allocation
      defined by Spark is used, for example, all available cores are used by one
      single executor in the Standalone
      mode.

    • Yarn resource allocation:
      select how you want Yarn to allocate resources among executors.

      • Auto:
        you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used
        in the Num executors that
        is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the
        workload. You need to define the scale of this dynamic allocation by
        defining the initial number of executors to run in the Initial executors field, the lowest
        number of executors in the Min
        executors
        field and the largest number of executors in the
        Max executors field.

    • Set Web UI port: if you
      need to change the default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select
      the broadcast implementation to be used to cache variables on each worker
      machine.

    • Customize Spark
      serializer
      : if you need to import an external Spark serializer,
      select this check box and in the field that is displayed, enter the fully
      qualified class name of the serializer to be used.

Tuning Spark for Apache Spark Streaming Jobs

You can define the tuning parameters in the Spark
configuration
tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

  1. Select the Set Tuning properties check
    box to optimize the allocation of the resources to be used to run this Job.
    These properties are not mandatory for the Job to run successfully, but they are
    useful when Spark is bottlenecked by any resource issue in the cluster such as
    CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the
    tuning.

    A generic formula for this calculation is

    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the
    satisfactory performance.

    Spark Standalone mode

    • Driver memory and
      Driver core: enter the allocation size
      of memory and the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter
      the allocation size of memory to be used by each Spark executor.

    • Core per executor: select
      this check box and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default allocation
      defined by Spark is used, for example, all available cores are used by one
      single executor in the Standalone
      mode.

    • Set Web UI port: if you
      need to change the default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select
      the broadcast implementation to be used to cache variables on each worker
      machine.

    • Customize Spark
      serializer
      : if you need to import an external Spark serializer,
      select this check box and in the field that is displayed, enter the fully
      qualified class name of the serializer to be used.

    • Activate backpressure:
      select this check box to enable the backpressure feature of Spark. The
      backpressure feature is available in the Spark verson 1.5 and onwards. With
      backpress enabled, Spark automatically finds the optimal receiving rate and
      dynamically adapts the rate based on current batch scheduling delays and
      processing time, in order to receive data only as fast as it can process.

    Spark Yarn client mode

    • Executor memory: enter
      the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead:
      select this check box and in the field that is displayed, enter the amount of
      off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select
      this check box and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default allocation
      defined by Spark is used, for example, all available cores are used by one
      single executor in the Standalone
      mode.

    • Yarn resource allocation:
      select how you want Yarn to allocate resources among executors.

      • Auto:
        you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used
        in the Num executors that
        is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the
        workload. You need to define the scale of this dynamic allocation by
        defining the initial number of executors to run in the Initial executors field, the lowest
        number of executors in the Min
        executors
        field and the largest number of executors in the
        Max executors field.

    • Set Web UI port: if you
      need to change the default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select
      the broadcast implementation to be used to cache variables on each worker
      machine.

    • Customize Spark
      serializer
      : if you need to import an external Spark serializer,
      select this check box and in the field that is displayed, enter the fully
      qualified class name of the serializer to be used.

    • Activate backpressure:
      select this check box to enable the backpressure feature of Spark. The
      backpressure feature is available in the Spark verson 1.5 and onwards. With
      backpress enabled, Spark automatically finds the optimal receiving rate and
      dynamically adapts the rate based on current batch scheduling delays and
      processing time, in order to receive data only as fast as it can process.

Logging and checkpointing the activities of your Apache Spark Job

It is recommended to activate the Spark logging and checkpointing system in the
Spark configuration tab of the Run
view of your Spark Job, in order to help debug and resume your Spark Job when issues
arise.

  1. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
    Spark checkpointing operation. In the field that is displayed, enter the
    directory in which Spark stores, in the file system of the cluster, the context
    data of the computations such as the metadata and the generated RDDs of this
    computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

  2. In the Yarn client mode, you can enable
    the Spark application logs of this Job to be persistent in the file system. To do this,
    select the Enable Spark event logging check box.

    The parameters relevant to Spark logs are displayed:

    • Spark event logs
      directory
      : enter the directory in which Spark events are logged.
      This is actually the spark.eventLog.dir property.

    • Spark history server
      address
      : enter the location of the history server. This is actually
      the spark.yarn.historyServer.address property.

    • Compress Spark event
      logs
      : if needs be, select this check box to compress the logs. This
      is actually the spark.eventLog.compress property.

    Since the administrator of your cluster could have defined these properties in
    the cluster configuration files, it is recommended to contact the administrator for the
    exact values.

Adding advanced Spark properties to solve issues

Depending on the distribution you are using or the issues you encounter, you may need to
add specific Spark properties to the Advanced properties table in
the Spark configuration tab of the Run
view of your Job.

The advanced properties required by different Hadoop distributions and their values are listed below:

For further information about the valid Spark properties, see Spark
documentation at https://spark.apache.org/docs/latest/configuration.

Hortonworks Data Platform V2.4

  • spark.yarn.am.extraJavaOptions:
    -Dhdp.version=2.4.0.0-169

  • spark.driver.extraJavaOptions:
    -Dhdp.version=2.4.0.0-169

In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM
settings
area either in the Advanced
settings
tab of the Run view or
in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the
[Preferences] window applies it on all the
Jobs that are designed in the same Studio.

MapR V5.1 and V5.2

When the cluster is used with the HBase or the MapRDB
components:

spark.hadoop.yarn.application.classpath: enter the value
of this parameter specific to your cluster and add, if missing, the classpath to HBase
to ensure that the Job to be used can find the required classes and packages in the
cluster.

For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added
paths is where HBase is usually installed in a MapR cluster. If your HBase is installed
elsewhere, contact the administrator of your cluster for details and adapt these paths
accordingly.

For a step-by-step explanation about how to add this parameter, see You can find more details about how to run HBase/MapR-DB on Spark with a MapR distribution in Talend Help Center (https://help.talend.com).


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x