July 30, 2023

Spark configuration – Docs for ESB 7.x

Spark configuration

The Spark configuration view contains the Spark specific properties you can define for your Job depending on the distribution and the Spark mode you are using.

The information in this section is only for users who have subscribed to
Talend Data Fabric or to any Talend product with Big Data but it is not
applicable to Talend Open Studio for Big Data users.

Defining the connection to the Azure Storage account to be used in the Studio

Define the connection metadata to Azure Storage in the
Repository of the Studio.

  • You have an Azure account with appropriate rights and permissions to the
    Azure Storage.
  • The Azure Storage account to be used has been properly created and you have the
    appropriate permissions to access it. For further information about Azure
    Storage, see Azure Storage tutorials from Azure
    documentation.
  • You are using one of the Talend solutions with Big Data.
  1. Obtain the access key to the Azure Storage account to be used on https://portal.azure.com/.

    1. Click All services on the menu bar on the left
      of the Azure welcome page.
    2. Click Storage accounts in the
      STORAGE section.
    3. Click the storage account to be used.
    4. On the list that is displayed, click Access keys
      to open the corresponding blade.
    5. Copy and keep the key that is displayed somewhere appropriate so as to
      use it in the steps to come.
  2. In the Integration perspective
    of the Studio, expand the Metadata node in the
    Repository, right click the Azure
    Storage
    node and from the contextual menu, select
    Create an Azure Storage Connection to open the
    wizard.
  3. Complete the fields in the wizard:

    Spark configuration_1.png
    Name Enter the name you want to use for this connection to be defined.
    Account Name Enter the name of the Azure Storage account to be connected
    to.
    Account Key Enter the access key you got in the previous steps.
  4. Click Test connection to verify the configuration. Once
    a message pops up to say that the connection is successful, the
    Next button is activated.
  5. Click Next to access the list of containers available on
    Azure under this Azure Storage account.

    This list is empty if this Azure Storage account does not contain any
    containers.
  6. Select the container to connect to and click Next or
    just click Next to skip this step. You can revise this
    step anytime later by coming back to this wizard.
  7. Do the same to the query list and the table list that are respectively
    displayed in the wizard.
  8. Click Finish to validate the creation. The connection
    appears under the Azure Storage node in the
    Repository.
The Azure Storage connection has been defined in the Studio and ready to be used by
your Jobs to work with the Azure services that are associated with this Azure Storage
account.

Defining the Azure Databricks connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
  2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
    when submitting
    check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
    automatically restarts the cluster, the Jobs that are launched in parallel interrupt
    each other and thus cause execution failure.
  1. From the Cloud provider drop-down list, select
    Azure.
  2. Enter the basic connection information to Databricks.

    Standalone

    • In the Endpoint
      field, enter the URL address of your Azure Databricks workspace.
      This URL can be found in the Overview blade
      of your Databricks workspace page on your Azure portal. For example,
      this URL could look like https://westeurope.azuredatabricks.net.

    • In the Cluster ID
      field, enter the ID of the Databricks cluster to be used. This ID is
      the value of the
      spark.databricks.clusterUsageTags.clusterId
      property of your Spark cluster. You can find this property on the
      properties list in the Environment tab in the
      Spark UI view of your cluster.

      You can also easily find this ID from
      the URL of your Databricks cluster. It is present immediately after
      cluster/ in this URL.

    • Click the […] button
      next to the Token field to enter the
      authentication token generated for your Databricks user account. You
      can generate or find this token on the User
      settings
      page of your Databricks workspace. For
      further information, see Token management from the
      Azure documentation.

    • In the DBFS dependencies
      folder
      field, enter the directory that is used to
      store your Job related dependencies on Databricks Filesystem at
      runtime, putting a slash (/) at the end of this directory. For
      example, enter /jars/ to store the dependencies
      in a folder named jars. This folder is created
      on the fly if it does not exist then.

    • Poll interval when retrieving Job status (in
      ms)
      : enter, without the quotation marks, the time
      interval (in milliseconds) at the end of which you want the Studio
      to ask Spark for the status of your Job. For example, this status
      could be Pending or Running.

      The default value is 300000, meaning 30
      seconds. This interval is recommended by Databricks to correctly
      retrieve the Job status.

    • Use
      transient cluster
      : you can select this check box to
      leverage the transient Databricks clusters.

      The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.

      1. Autoscale: select or clear this check box to define
        the number of workers to be used by your transient cluster.

        1. If you select this check box,
          autoscaling is enabled. Then define the minimum number
          of workers in Min
          workers
          and the maximum number of
          worders in Max
          workers
          . Your transient cluster is
          scaled up and down within this scope based on its
          workload.

          According to the Databricks
          documentation, autoscaling works best with
          Databricks runtime versions 3.0 or onwards.

        2. If you clear this check box, autoscaling
          is deactivated. Then define the number of workers a
          transient cluster is expected to have. This number does
          not include the Spark driver node.
      2. Node type
        and Driver node type:
        select the node types for the workers and the Spark driver node.
        These types determine the capacity of your nodes and their
        pricing by Databricks.

        For details about
        these node types and the Databricks Units they use, see
        Supported Instance
        Types
        from the Databricks documentation.

      3. Elastic
        disk
        : select this check box to enable your
        transient cluster to automatically scale up its disk space when
        its Spark workers are running low on disk space.

        For more details about this elastic disk
        feature, search for the section about autoscaling local
        storage from your Databricks documentation.

      4. SSH public
        key
        : if an SSH access has been set up for your
        cluster, enter the public key of the generated SSH key pair.
        This public key is automatically added to each node of your
        transient cluster. If no SSH access has been set up, ignore this
        field.

        For further information about SSH
        access to your cluster, see SSH access to
        clusters
        from the Databricks
        documentation.

      5. Configure cluster
        log
        : select this check box to define where to
        store your Spark logs for a long term. This storage system could
        be S3 or DBFS.
    • Do not restart the cluster
      when submitting
      : select this check box to prevent
      the Studio restarting the cluster when the Studio is submitting your
      Jobs. However, if you make changes in your Jobs, clear this check
      box so that the Studio resarts your cluster to take these changes
      into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Adding Azure specific properties to access the Azure storage system from Databricks

Add the Azure specific properties to the Spark configuration of your Databricks
cluster so that your cluster can access Azure Storage.

You need to do this only when you want your Talend
Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with
Databricks.

  • Ensure that your Spark cluster in Databricks has been properly created and is
    running and its version is supported by the Studio. If you use Azure Data
    Lake Storage Gen 2, only Databricks 5.4 is supported.

    For further information, see Create Databricks workspace from
    Azure documentation.

  • You have an Azure account.
  • The Azure Blob Storage or Azure Data Lake Storage service to be used has been
    properly created and you have the appropriate permissions to access it. For
    further information about Azure Storage, see Azure Storage tutorials from Azure
    documentation.
  1. On the Configuration tab of your Databricks cluster
    page, scroll down to the Spark tab at the bottom of the
    page.

    Spark configuration_2.png

  2. Click Edit to make the fields on this page
    editable.
  3. In this Spark tab, enter the Spark properties regarding
    the credentials to be used to access your Azure Storage system.

    Option Description
    Azure Blob Storage

    When you need to use Azure Blob Storage with Azure Databricks, add the
    following Spark property:

    • The parameter to provide account key:

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If you need to append data to an existing file, add this parameter:

    Azure Data Lake Storage (Gen 1) When you need to use Azure Data Lake Storage Gen1 with
    Databricks, add the following Spark properties, each per
    line:

    Azure Data Lake Storage (Gen 2)

    When you need to use Azure Data Lake Storage Gen2 with Databricks,
    add the following Spark properties, each per line:

    • The parameter to provide an account key:

      This key is associated with the storage account to be used.
      You can find it in the Access keys
      blade of this storage account. Two keys are available for
      each account and by default, either of them can be used for
      this access.

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If the ADLS file system to be used does not exist yet, add the following parameter:

    For further information about how to find your application ID and authentication
    key, see Get application ID and authentication
    key
    from the Azure documentation. In the same documentation, you can
    also find details about how to find your tenant ID at Get tenant ID.
  4. If you need to run Spark Streaming Jobs with Databricks, in the same
    Spark tab, add the following property to define a
    default Spark serializer. If you do not plan to run Spark Streaming Jobs, you
    can ignore this step.

  5. Restart your Spark cluster.
  6. In the Spark UI tab of your Databricks cluster page,
    click Environment to display the list of properties and
    verify that each of the properties you added in the previous steps is present on
    that list.

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

    1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
    2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
      when submitting
      check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
      automatically restarts the cluster, the Jobs that are launched in parallel interrupt
      each other and thus cause execution failure.
  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Enter the basic connection information to Databricks on AWS.

Standalone

  • In the Endpoint
    field, enter the URL address of the workspace of your Databricks on
    AWS. For example, this URL could look like
    https://<your_endpoint>.cloud.databricks.com.

  • In the Cluster
    ID
    field, enter the ID of the Databricks cluster to
    be used. This ID is the value of the
    spark.databricks.clusterUsageTags.clusterId
    property of your Spark cluster. You can find this property on the
    properties list in the Environment tab in the
    Spark UI view of your cluster.

    You can also easily find this ID
    from the URL of your Databricks cluster. It is present immediately
    after cluster/ in this URL.

    This field is not used and thus not available if you are using transient clusters.

  • Click the […]
    button next to the Token field to enter the
    authentication token generated for your Databricks user account. You
    can generate or find this token on the User
    settings
    page of your Databricks workspace. For
    further information, see Token management from the
    Databricks documentation.

  • In the DBFS
    dependencies folder
    field, enter the directory that
    is used to store your Job related dependencies on Databricks
    Filesystem at runtime, putting a slash (/) at the end of this
    directory. For example, enter /jars/ to store
    the dependencies in a folder named jars. This
    folder is created on the fly if it does not exist then.

    This directory stores your Job dependencies on DBFS only. In your
    Job, use tS3Configuration,
    tDynamoDBConfiguration or, in a Spark
    Streaming Job, the Kinesis components, to read or write your
    business data to the related systems.

  • Poll interval when retrieving Job status (in
    ms)
    : enter, without the quotation marks, the time
    interval (in milliseconds) at the end of which you want the Studio
    to ask Spark for the status of your Job. For example, this status
    could be Pending or Running.

    The default value is 300000, meaning 30
    seconds. This interval is recommended by Databricks to correctly
    retrieve the Job status.

  • Use
    transient cluster
    : you can select this check box to
    leverage the transient Databricks clusters.

    The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.

    1. Autoscale: select or clear this check box to define
      the number of workers to be used by your transient cluster.

      1. If you select this check box,
        autoscaling is enabled. Then define the minimum number
        of workers in Min
        workers
        and the maximum number of
        worders in Max
        workers
        . Your transient cluster is
        scaled up and down within this scope based on its
        workload.

        According to the Databricks
        documentation, autoscaling works best with
        Databricks runtime versions 3.0 or onwards.

      2. If you clear this check box, autoscaling
        is deactivated. Then define the number of workers a
        transient cluster is expected to have. This number does
        not include the Spark driver node.
    2. Node type
      and Driver node type:
      select the node types for the workers and the Spark driver node.
      These types determine the capacity of your nodes and their
      pricing by Databricks.

      For details about
      these node types and the Databricks Units they use, see
      Supported Instance
      Types
      from the Databricks documentation.

    3. Elastic
      disk
      : select this check box to enable your
      transient cluster to automatically scale up its disk space when
      its Spark workers are running low on disk space.

      For more details about this elastic disk
      feature, search for the section about autoscaling local
      storage from your Databricks documentation.

    4. SSH public
      key
      : if an SSH access has been set up for your
      cluster, enter the public key of the generated SSH key pair.
      This public key is automatically added to each node of your
      transient cluster. If no SSH access has been set up, ignore this
      field.

      For further information about SSH
      access to your cluster, see SSH access to
      clusters
      from the Databricks
      documentation.

    5. Configure cluster
      log
      : select this check box to define where to
      store your Spark logs for a long term. This storage system could
      be S3 or DBFS.
  • Do not restart the cluster
    when submitting
    : select this check box to prevent
    the Studio restarting the cluster when the Studio is submitting your
    Jobs. However, if you make changes in your Jobs, clear this check
    box so that the Studio resarts your cluster to take these changes
    into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Defining the AWS Qubole connection parameters for Spark Jobs

Complete the Qubole connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.

  • You have properly set up your Qubole cluster on AWS. For further information about
    how to do this, see Getting Started with Qubole on AWS from the
    Qubole documentation.
  • Ensure that the Qubole account to be used has the proper IAM role that is allowed to
    read/write to the S3 bucket to be used. For further details, contact the
    administrator of your Qubole system or see Cross-account IAM Role for QDS from
    the Qubole documentation.
  • Ensure that the AWS account to be used has the proper read/write permissions to
    the S3 bucket to be used. For this purpose, contact the administrator of your
    AWS system.
  1. Enter the basic connection information to Qubole.

    Connection configuration

    • Click the button next to the
      API Token field to enter the
      authentication token generated for the Qubole user account
      to be used. For further information about how to obtain this
      token, see Manage Qubole
      account
      from the Qubole documentation.

      This
      token allows you to specify the user account you want to
      use to access Qubole. Your Job automatically uses
      the rights and permissions granted to this user account
      in Qubole.

    • Select the Cluster label check
      box and enter the name of the Qubole cluster to be used. If
      leaving this check box clear, the default cluster is
      used.

      If you need details about your default cluster,
      ask the administrator of your Qubole service. You can
      also read this article
      from the Qubole documentaiton to find more information
      about configuring a default Qubole cluster.

    • Select the Change API endpoint
      check box and select the region to be used. If leaving this
      check box clear, the default region is used.

      For further
      information about the Qubole Endpoints supported on
      QDS-on-AWS, see Supported Qubole
      Endpoints on Different Cloud
      Providers
      .

  2. Configure the connection to the S3 file system to be used to temporarily store the dependencies of your Job so that your Qubole cluster has access to these dependencies.

    This configuration is used for your Job dependencies only. Use a
    tS3Configuration in your Job to write your actual
    business data in the S3 system with Qubole. Without
    tS3Configuration, this business data is written in the
    Qubole HDFS system and destroyed once you shut down your cluster.

    • Access key and
      Secret key:
      enter the authentication information required to connect to the Amazon
      S3 bucket to be used.

      To enter the password, click the […] button next to the
      password field, and then in the pop-up dialog box enter the password between double quotes
      and click OK to save the settings.

    • Bucket name: enter the name of the bucket in
      which you want to store the dependencies of your Job. This bucket must
      already exist on S3.
    • Temporary resource folder: enter the
      directory in which you want to store the dependencies of your Job. For
      example, enter temp_resources to write the dependencies in
      the /temp_resources folder in the bucket.

      If this folder already exists at runtime, its contents are overwritten
      by the upcoming dependencies; otherwise, this folder is automatically
      created.

    • Region: specify the AWS region by selecting a region name from the
      list. For more information about the AWS Region, see Regions and Endpoints.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
    Spark checkpointing operation. In the field that is displayed, enter the
    directory in which Spark stores, in the file system of the cluster, the context
    data of the computations such as the metadata and the generated RDDs of this
    computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Defining the EMR connection parameters

Complete the EMR connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Enter the basic connection information to EMR:

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to set the following parameters in their corresponding
    fields (if you leave the check box of a service clear, then at runtime,
    the configuration about this parameter in the Hadoop cluster to be used
    will be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job
    should be performed.

    If you are using the Yarn cluster mode, you need
    to define the following parameters in their corresponding fields (if you
    leave the check box of a service clear, then at runtime, the
    configuration about this parameter in the Hadoop cluster to be used will
    be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop
      configuration JAR
      : if you are using
      connections defined in Repository to
      connect to your Cloudera or Hortonworks cluster, you can
      select this check box in the
      Repository wizard and in the
      field that is displayed, specify the path to the JAR file
      that provides the connection parameters of your Hadoop
      environment. Note that this file must be accessible from the
      machine where you Job is launched.

      This kind of Hadoop configuration JAR file is
      automatically generated when you build a Big Data Job from the
      Studio. This JAR file is by default named with this
      pattern: You
      can also download this JAR file from the web console of your
      cluster or simply create a JAR file yourself by putting the
      configuration files in the root of your JAR file. For
      example:

      The parameters from your custom JAR file override the parameters
      you put in the Spark configuration field.
      They also override the configuration you set in the
      configuration components such as
      tHDFSConfiguration or
      tHBaseConfiguration when the related
      storage system such as HDFS, HBase or Hive are native to Hadoop.
      But they do not override the configuration set in the
      configuration components for the third-party storage system such
      as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or,
      if you use Talend
      Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
      is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
      it is generally useful to select this check box when running a Spark Batch Job,
      it makes more sense to keep this check box clear when running a Spark Streaming
      Job.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tS3Configuration, the component used to provides S3
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.


  4. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

Defining the Cloudera connection parameters

Complete the Cloudera connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

If you cannot
find the Cloudera version to be used from this drop-down list, you can add your distribution
via some dynamic distribution settings in the Studio.

  1. Select the type of the Spark cluster you need to connect to.

    Standalone

    The Studio connects to a Spark-enabled cluster to run the Job from this
    cluster.

    If you are using the Standalone mode, you need to
    set the following parameters:

    • In the Spark host field, enter the URI
      of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the
      location of the Spark executable installed in the Hadoop cluster to be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to set the following parameters in their corresponding
    fields (if you leave the check box of a service clear, then at runtime,
    the configuration about this parameter in the Hadoop cluster to be used
    will be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job
    should be performed.

    If you are using the Yarn cluster mode, you need
    to define the following parameters in their corresponding fields (if you
    leave the check box of a service clear, then at runtime, the
    configuration about this parameter in the Hadoop cluster to be used will
    be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop
      configuration JAR
      : if you are using
      connections defined in Repository to
      connect to your Cloudera or Hortonworks cluster, you can
      select this check box in the
      Repository wizard and in the
      field that is displayed, specify the path to the JAR file
      that provides the connection parameters of your Hadoop
      environment. Note that this file must be accessible from the
      machine where you Job is launched.

      This kind of Hadoop configuration JAR file is
      automatically generated when you build a Big Data Job from the
      Studio. This JAR file is by default named with this
      pattern: You
      can also download this JAR file from the web console of your
      cluster or simply create a JAR file yourself by putting the
      configuration files in the root of your JAR file. For
      example:

      The parameters from your custom JAR file override the parameters
      you put in the Spark configuration field.
      They also override the configuration you set in the
      configuration components such as
      tHDFSConfiguration or
      tHBaseConfiguration when the related
      storage system such as HDFS, HBase or Hive are native to Hadoop.
      But they do not override the configuration set in the
      configuration components for the third-party storage system such
      as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or,
      if you use Talend
      Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
      is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
      it is generally useful to select this check box when running a Spark Batch Job,
      it makes more sense to keep this check box clear when running a Spark Streaming
      Job.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.


  4. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

  • If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch
    Jobs, you can make use of Cloudera Navigator to trace the lineage of given
    data flow to discover how this data flow was generated by a Job.

Defining the Dataproc connection parameters

Complete the Google Dataproc connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn client mode is available for this type of cluster.

  1. Enter the basic connection information to Dataproc:

    Project identifier

    Enter the ID of your Google Cloud Platform project.

    If you are not certain about your project ID, check it in the Manage
    Resources page of your Google Cloud Platform services.

    Cluster identifier

    Enter the ID of your Dataproc cluster to be used.

    Region

    From this drop-down list, select the Google Cloud region to
    be used.

    Google Storage staging bucket

    As a Talend Job expects its
    dependent jar files for execution, specify the Google Storage directory to
    which these jar files are transferred so that your Job can access these
    files at execution.

    The directory to be entered must end with a slash (/). If not existing, the
    directory is created on the fly but the bucket to be used must already
    exist.

  2. Provide the authentication information to your Google Dataproc cluster:

    Provide Google Credentials in file

    Leave this check box clear, when you
    launch your Job from a given machine in which Google Cloud SDK has been
    installed and authorized to use your user account credentials to access
    Google Cloud Platform. In this situation, this machine is often your
    local machine.

    When you launch your Job from a remote
    machine, such as a Jobserver, select this check box and in the
    Path to Google Credentials file field that is
    displayed, enter the directory in which this JSON file is stored in the
    Jobserver machine.

    For further information about this Google
    Credentials file, see the administrator of your Google Cloud Platform or
    visit Google Cloud Platform Auth
    Guide
    .


  3. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  4. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

Defining the Hortonworks connection parameters

Complete the Hortonworks connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Enter the basic connection information to Hortonworks

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to set the following parameters in their corresponding
    fields (if you leave the check box of a service clear, then at runtime,
    the configuration about this parameter in the Hadoop cluster to be used
    will be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job
    should be performed.

    If you are using the Yarn cluster mode, you need
    to define the following parameters in their corresponding fields (if you
    leave the check box of a service clear, then at runtime, the
    configuration about this parameter in the Hadoop cluster to be used will
    be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop
      configuration JAR
      : if you are using
      connections defined in Repository to
      connect to your Cloudera or Hortonworks cluster, you can
      select this check box in the
      Repository wizard and in the
      field that is displayed, specify the path to the JAR file
      that provides the connection parameters of your Hadoop
      environment. Note that this file must be accessible from the
      machine where you Job is launched.

      This kind of Hadoop configuration JAR file is
      automatically generated when you build a Big Data Job from the
      Studio. This JAR file is by default named with this
      pattern: You
      can also download this JAR file from the web console of your
      cluster or simply create a JAR file yourself by putting the
      configuration files in the root of your JAR file. For
      example:

      The parameters from your custom JAR file override the parameters
      you put in the Spark configuration field.
      They also override the configuration you set in the
      configuration components such as
      tHDFSConfiguration or
      tHBaseConfiguration when the related
      storage system such as HDFS, HBase or Hive are native to Hadoop.
      But they do not override the configuration set in the
      configuration components for the third-party storage system such
      as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or,
      if you use Talend
      Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
      is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
      it is generally useful to select this check box when running a Spark Batch Job,
      it makes more sense to keep this check box clear when running a Spark Streaming
      Job.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.


  4. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  5. If you encounter the hdp.version is not found issue when
    executing your Job, select the Set hdp.version check box
    to define the hdp.version variable in your Job and also in
    your cluster.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

  • If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce
    or Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks
    cluster, you can make use of Atlas to trace the lineage of given data flow
    to discover how this data was generated by a Job.

Defining the MapR connection parameters

Complete the MapR connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Select the type of the Spark cluster you need to connect to.

    Standalone

    The Studio connects to a Spark-enabled cluster to run the Job from this
    cluster.

    If you are using the Standalone mode, you need to
    set the following parameters:

    • In the Spark host field, enter the URI
      of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the
      location of the Spark executable installed in the Hadoop cluster to be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to set the following parameters in their corresponding
    fields (if you leave the check box of a service clear, then at runtime,
    the configuration about this parameter in the Hadoop cluster to be used
    will be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job
    should be performed.

    If you are using the Yarn cluster mode, you need
    to define the following parameters in their corresponding fields (if you
    leave the check box of a service clear, then at runtime, the
    configuration about this parameter in the Hadoop cluster to be used will
    be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop
      configuration JAR
      : if you are using
      connections defined in Repository to
      connect to your Cloudera or Hortonworks cluster, you can
      select this check box in the
      Repository wizard and in the
      field that is displayed, specify the path to the JAR file
      that provides the connection parameters of your Hadoop
      environment. Note that this file must be accessible from the
      machine where you Job is launched.

      This kind of Hadoop configuration JAR file is
      automatically generated when you build a Big Data Job from the
      Studio. This JAR file is by default named with this
      pattern: You
      can also download this JAR file from the web console of your
      cluster or simply create a JAR file yourself by putting the
      configuration files in the root of your JAR file. For
      example:

      The parameters from your custom JAR file override the parameters
      you put in the Spark configuration field.
      They also override the configuration you set in the
      configuration components such as
      tHDFSConfiguration or
      tHBaseConfiguration when the related
      storage system such as HDFS, HBase or Hive are native to Hadoop.
      But they do not override the configuration set in the
      configuration components for the third-party storage system such
      as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or,
      if you use Talend
      Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
      is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
      it is generally useful to select this check box when running a Spark Batch Job,
      it makes more sense to keep this check box clear when running a Spark Streaming
      Job.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tHDFSConfiguration, the component used to provide HDFS
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.

  4. Verify, for example with your cluster administrator, whether your MapR cluster
    is secured with the MapR ticket authentication mechanism.

    • If the MapR cluster to be used is secured with the MapR ticket authentication mechanism,
      set the MapR ticket authentication configuration by following the explanation in Setting up the MapR ticket authentication.

    • Otherwise, leave the Use MapR Ticket
      authentication
      check box clear.


  5. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

Creating an HDInsight cluster on Azure

Log in to your Azure account to create the HDInsight cluster to be
used.

  • You have an Azure account with appropriate rights and permissions to the
    HDInsight service.
  1. Navigate to the Azure portal: https://portal.azure.com/.
  2. Click All services on the menu bar on the left of the
    Azure welcome page.
  3. In the Analytics section, click HDInsight clusters to open the corresponding blade.
  4. Click Add to create an HDInsight cluster.
  5. Click Quick Create to display the Basics blade and enter the basic configuration information on this blade.

    Among the parameters, the Cluster
    type
    to be used must be the one officially supported by Talend. For example, select
    Spark 2.1.0 (HDI 3.6).

    For further information about the supported versions, search for supported
    Big Data platforms on Talend Help Center (https://help.talend.com).

    For further information about the parameters on this blade,
    see Create clusters from Azure
    documentation.

  6. Click Next to open the Storage
    blade to set the storage settings for the cluster.

    1. From the Primary storage type list, select
      Azure storage.
    2. For the storage account to be used, select the My
      subscriptions
      radio button, then click Select
      a storage account
      and choose the account to be used from
      the blade that is displayed.
    3. For the other parameters, leave them as they are to use the default
      values or enter the values you want to use.
  7. Click Next to pass to the Summary
    step, in which you review and confirm the configuration made in the previous
    steps.
  8. Once you confirm the configuration, click Create.
Azure starts to create your cluster. Once done, your cluster appears on the
HDInsight clusters list.

Defining the HD Insight connection parameters

Complete the HD Insight connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of
cluster.

  1. Enter the basic connection information to Microsoft HD Insight:

    Livy configuration

    • The Hostname of
      Livy is the URL of your HDInsight cluster. This URL can be
      found in the Overview blade of your
      cluster. Enter this URL without the https:// part.
    • The default Port is 443.
    • The Username is the one defined when
      creating your cluster. You can find it in the SSH
      + Cluster login
      blade of your cluster.

    For
    further information about the Livy service used by HD Insight, see
    Submit Spark
    jobs using Livy
    .

    HDInsight
    configuration

    • The Username is the one defined when
      creating your cluster. You can find it in the SSH
      + Cluster login
      blade of your cluster.
    • The Password is defined when creating your HDInsight
      cluster for authentication to this cluster.

    Windows Azure Storage
    configuration

    Enter the address and the authentication information of the Azure Storage
    account to be used. In this configuration, you do not define where to read or write
    your business data but define where to deploy your Job only. Therefore always use
    the Azure
    Storage
    system for this configuration.

    In the Container field, enter the name
    of the container to be
    used. You can
    find the available containers in the Blob blade of the Azure
    Storage account to be used.

    In the Deployment Blob field, enter the
    location in which you want to store the current Job and its dependent libraries in
    this Azure Storage account.

    In the Hostname field, enter the
    Primary Blob Service Endpoint of your Azure Storage account without the https:// part. You can find this endpoint in the Properties blade of this storage account.

    In the Username field, enter the name of the Azure Storage account to be used.

    In the Password field, enter the access key of the Azure Storage account to be used. This key can be found in the Access keys blade of this storage account.


  2. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  3. Select the Wait for the Job to complete check box to make your Studio or,
    if you use Talend
    Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
    is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
    it is generally useful to select this check box when running a Spark Batch Job,
    it makes more sense to keep this check box clear when running a Spark Streaming
    Job.
  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

Defining the Cloudera Altus connection parameters

Complete the Altus connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

Prerequisites:

The Cloudera Altus client, Altus CLI, must be installed in the machine in which your
Job is executed:

  1. In the Spark configuration tab of the
    Run view of your Job, enter the basic connection
    information to Cloudera Altus.

    Force Cloudera Altus credentials

    Select this check box to provide the credentials with your
    Job.

    If you want to provide the credentials separately, for example
    manually using the command altus configure in
    your terminal, clear this check box.

    Path to Cloudera Altus CLI

    Enter the path to the Cloudera Altus
    client, which must have been installed and activated in the
    machine in which your Job is executed. In production
    environments, this machine is typically a Talend Jobserver.

  2. Configure the virtual Cloudera cluster to be used.

    Use an existing Cloudera Altus cluster

    Select this check box to use a Cloudera Altus cluster already
    existing in your Cloud service. Otherwise, leave this check box
    clear to allow the Job to create a cluster on the fly.

    With this check box selected, only the Cluster name parameter is
    useful and the other parameters for the cluster configuration
    are hidden.

    Cluster name

    Enter the name of the cluster to be
    used.

    Environment

    Enter the name of the Cloudera Altus
    environment to be used to describe the resources allocated to
    the given cluster.

    If you do not know which environment to select, contact your
    Cloudera Altus administrator.

    Delete cluster after execution

    Select this check box if you want to remove the given cluster
    after the execution of your Job.

    Override with a JSON configuration

    Select this check box to manually edit JSON code in the
    Custom JSON field that is displayed
    to configure the cluster.

    Instance type

    Select the instance type for the instances in the cluster. All
    nodes that are deployed in this cluster use the same instance
    type.

    Worker node

    Enter the number of worker nodes to
    be created for the cluster.

    For details about the allowed number of worker nodes, see the documentation of Cloudera
    Altus
    .

    Cloudera Manager username and
    Cloudera Manager password

    Enter the authentication
    information to your Cloudera Manager service.

    SSH private key

    Browse, or enter the path to the SSH private key in order to
    upload and register it in the region specified in the Cloudera
    Altus environment.

    The Data Engineering service of Cloudera Altus uses this private
    key to access and configure instances of the cluster to be
    used.

    Custom bootstrap script

    If you want to create a cluster with a bootstrap script you
    provide, browse, or enter the path to this script in the
    Custom Bootstrap script field.

    For an example of an Altus bootstrap script, see Install a custom Python
    environment when creating a cluster
    from the Cloudera
    documentation.

  3. From the Cloud provider list, select the Cloud service
    that runs your Cloudera Altus cluster.

    • If your cloud provider is AWS, select AWS and
      define the Amazon S3 directory in which you store your Job
      dependencies.

      AWS

      • Access key and
        Secret key:
        enter the authentication information required to connect to the Amazon
        S3 bucket to be used.

        To enter the password, click the […] button next to the
        password field, and then in the pop-up dialog box enter the password between double quotes
        and click OK to save the settings.

      • Specify the AWS region by selecting a region name from the
        list or entering a region between double quotation marks (e.g. “us-east-1”) in the list. For more information about the AWS
        Region, see Regions and Endpoints.

      • S3 bucket name: enter the name of the bucket
        to be used to store the dependencies of your Job. This bucket must already
        exist.

      • S3 storage path: enter the directory in
        which you want to store the dependencies of your Job in this given bucket,
        for example, altus/jobjar. This directory is created if
        it does not exist at runtime.

      The Amazon S3 you specify here is used to store your Job dependencies
      only. To connect to the S3 system which hosts your actual data, use a
      tS3Configuration component in your Job.

    • If your cloud provider is Azure, select Azure to
      store your Job dependencies in your Azure Data Lake Storage.

      1. In your Azure portal, assign the Read/Write/Execute permissions
        to the Azure application to be used by the Job to access your
        Azure Data Lake Storage. For details about how to assign
        permissions, see Azure documentation: Assign the Azure AD
        application to the Azure Data Lake Storage account file or
        folder
        . For example:

        Spark configuration_3.png

        Without appropriate permissions, your Job dependencies cannot be
        transferred to your Azure Data Lake Storage.

      2. In your Altus console, identify the Data Lake Storage AAD Group
        Name used by your Altus environment in the Instance
        Settings
        section.

      3. In your Azure portal, assign the Read/Write/Execute permissions
        to this AAD group using the same procedure explained in Azure
        documentation: Assign the Azure AD
        application to the Azure Data Lake Storage account file or
        folder
        .

        Without appropriate permissions, your Job dependencies cannot be
        transferred to your Azure Data Lake Storage.

      4. In the Spark configuration tab, configure
        the connection to your Azure Data Lake Storage.

        Azure (technical preview)

        • ADLS account FQDN:

          Enter the address without the scheme part of the Azure Data Lake Storage
          account to be used, for example,
          ychendls.azuredatalakestore.net.

          This account must already exist in your Azure portal.

        • Azure App ID and Azure App
          key
          :

          In the
          Client ID and the Client
          key
          fields, enter, respectively, the authentication
          ID and the authentication key generated upon the registration of the
          application that the current Job you are developing uses to access
          Azure Data Lake Storage.

          This application must be the one to
          which you assigned permissions to access your Azure Data Lake Storage in
          the previous step.

        • Token endpoint:

          In the
          Token endpoint field, copy-paste the
          OAuth 2.0 token endpoint that you can obtain from the
          Endpoints list accessible on the
          App registrations page on your Azure
          portal.

      The Azure Data Lake Storage you specify here is used to store your Job
      dependencies only. To connect to the Azure system which hosts your
      actual data, use a tAzureFSConfiguration component in
      your Job.

  4. Select the Wait for the Job to complete check box to make your Studio or,
    if you use Talend
    Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
    is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
    it is generally useful to select this check box when running a Spark Batch Job,
    it makes more sense to keep this check box clear when running a Spark Streaming
    Job.
  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

  • If you need to consult the Altus related logs, check them in your Cloudera
    Manager service or on your Altus cluster instances.

Configuring a Spark stream for your Apache Spark streaming Job

Define how often your Spark Job creates and processes micro batches.
  1. In the Batch size
    field, enter the time interval at the end of which the Job reviews the source
    data to identify changes and processes the new micro batches.
  2. If needs be, select the Define a streaming
    timeout
    check box and in the field that is displayed, enter the
    time frame at the end of which the streaming Job automatically stops
    running.

Tuning Spark for Apache Spark Batch Jobs

You can define the tuning parameters in the Spark
configuration
tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

  1. Select the Set Tuning properties
    check box to optimize the allocation of the resources to be used to run this
    Job. These properties are not mandatory for the Job to run successfully, but
    they are useful when Spark is bottlenecked by any resource issue in the cluster
    such as CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the
    tuning.

    A generic formula for this calculation is

    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the
    satisfactory performance.

    The following table provides the exhaustive list of the tuning properties.
    The actual properties available in the Spark
    configuration
    tab could vary depending on the distribution you
    are using.

    Spark Standalone mode

    • Driver memory and Driver
      core
      : enter the allocation size of memory and
      the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Job progress polling rate (in
      ms)
      : when using Spark V2.3 and onwards, enter the
      time interval (in milliseconds) at the end of which you want the
      Studio to ask Spark for the execution progress of your Job. Before
      V2.3, Spark automatically sends this information to the Studio when
      updates occur; the default value, 50 milliseconds, of this parameter
      allows the Studio to reproduce more or less the same scenario with
      Spark V2.3 and onwards.

      If you set this interval too long, you may
      lose information about the progress; if too short, you may send
      too many requests to Spark for only insignificant progress
      information.

    Spark Yarn client mode

    • Set application master tuning properties:
      select this check box and in the fields that are displayed,
      enter the amount of memory and the number of CPUs to be
      allocated to the ApplicationMaster service of your cluster.

      If you want to use the default allocation of your cluster, leave
      this check box clear.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Set executor memory overhead: select this
      check box and in the field that is displayed, enter the amount
      of off-heap memory (in MB) to be allocated per executor. This is
      actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Yarn resource allocation: select how you
      want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its
        default number of executors. This number is
        2.

      • Fixed: you need to enter the
        number of executors to be used in the Num
        executors
        that is displayed.

      • Dynamic: Yarn adapts the
        number of executors to suit the workload. You need
        to define the scale of this dynamic allocation by
        defining the initial number of executors to run in
        the Initial executors field,
        the lowest number of executors in the Min
        executors
        field and the largest number
        of executors in the Max
        executors
        field.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Job progress polling rate (in
      ms)
      : when using Spark V2.3 and onwards, enter the
      time interval (in milliseconds) at the end of which you want the
      Studio to ask Spark for the execution progress of your Job. Before
      V2.3, Spark automatically sends this information to the Studio when
      updates occur; the default value, 50 milliseconds, of this parameter
      allows the Studio to reproduce more or less the same scenario with
      Spark V2.3 and onwards.

      If you set this interval too long, you may
      lose information about the progress; if too short, you may send
      too many requests to Spark for only insignificant progress
      information.

    Spark Yarn cluster mode

    • Driver memory and Driver
      core
      : enter the allocation size of memory and
      the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Set executor memory overhead: select this
      check box and in the field that is displayed, enter the amount
      of off-heap memory (in MB) to be allocated per executor. This is
      actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Yarn resource allocation: select how you
      want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its
        default number of executors. This number is
        2.

      • Fixed: you need to enter the
        number of executors to be used in the Num
        executors
        that is displayed.

      • Dynamic: Yarn adapts the
        number of executors to suit the workload. You need
        to define the scale of this dynamic allocation by
        defining the initial number of executors to run in
        the Initial executors field,
        the lowest number of executors in the Min
        executors
        field and the largest number
        of executors in the Max
        executors
        field.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Job progress polling rate (in
      ms)
      : when using Spark V2.3 and onwards, enter the
      time interval (in milliseconds) at the end of which you want the
      Studio to ask Spark for the execution progress of your Job. Before
      V2.3, Spark automatically sends this information to the Studio when
      updates occur; the default value, 50 milliseconds, of this parameter
      allows the Studio to reproduce more or less the same scenario with
      Spark V2.3 and onwards.

      If you set this interval too long, you may
      lose information about the progress; if too short, you may send
      too many requests to Spark for only insignificant progress
      information.

Tuning Spark for Apache Spark Streaming Jobs

You can define the tuning parameters in the Spark
configuration
tab of the Run view of your Spark
Job to obtain better performance of the Job, if the default values of these parameters
does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

  1. Select the Set Tuning properties
    check box to optimize the allocation of the resources to be used to run this
    Job. These properties are not mandatory for the Job to run successfully, but
    they are useful when Spark is bottlenecked by any resource issue in the cluster
    such as CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the
    tuning.

    A generic formula for this calculation is

    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) /
      (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the
    satisfactory performance.

    Spark Standalone mode

    • Driver memory and Driver
      core
      : enter the allocation size of memory and
      the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Activate backpressure: select this check
      box to enable the backpressure feature of Spark. The
      backpressure feature is available in the Spark version 1.5 and
      onwards. With backpress enabled, Spark automatically finds the
      optimal receiving rate and dynamically adapts the rate based on
      current batch scheduling delays and processing time, in order to
      receive data only as fast as it can process.

    Spark Yarn client mode

    • Set application master tuning properties:
      select this check box and in the fields that are displayed,
      enter the amount of memory and the number of CPUs to be
      allocated to the ApplicationMaster service of your cluster.

      If you want to use the default allocation of your cluster, leave
      this check box clear.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Set executor memory overhead: select this
      check box and in the field that is displayed, enter the amount
      of off-heap memory (in MB) to be allocated per executor. This is
      actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Yarn resource allocation: select how you
      want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its
        default number of executors. This number is
        2.

      • Fixed: you need to enter the
        number of executors to be used in the Num
        executors
        that is displayed.

      • Dynamic: Yarn adapts the
        number of executors to suit the workload. You need
        to define the scale of this dynamic allocation by
        defining the initial number of executors to run in
        the Initial executors field,
        the lowest number of executors in the Min
        executors
        field and the largest number
        of executors in the Max
        executors
        field.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Activate backpressure: select this check
      box to enable the backpressure feature of Spark. The
      backpressure feature is available in the Spark version 1.5 and
      onwards. With backpress enabled, Spark automatically finds the
      optimal receiving rate and dynamically adapts the rate based on
      current batch scheduling delays and processing time, in order to
      receive data only as fast as it can process.

    Spark Yarn cluster mode

    • Driver memory and Driver
      core
      : enter the allocation size of memory and
      the number of cores to be used by the driver of the current
      Job.

    • Executor memory: enter the allocation size
      of memory to be used by each Spark executor.

    • Set executor memory overhead: select this
      check box and in the field that is displayed, enter the amount
      of off-heap memory (in MB) to be allocated per executor. This is
      actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box
      and in the displayed field, enter the number of cores to be used
      by each executor. If you leave this check box clear, the default
      allocation defined by Spark is used, for example, all available
      cores are used by one single executor in the
      Standalone mode.

    • Yarn resource allocation: select how you
      want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its
        default number of executors. This number is
        2.

      • Fixed: you need to enter the
        number of executors to be used in the Num
        executors
        that is displayed.

      • Dynamic: Yarn adapts the
        number of executors to suit the workload. You need
        to define the scale of this dynamic allocation by
        defining the initial number of executors to run in
        the Initial executors field,
        the lowest number of executors in the Min
        executors
        field and the largest number
        of executors in the Max
        executors
        field.

    • Set Web UI port: if you need to change the
      default port of the Spark Web UI, select this check box and
      enter the port number you want to use.

    • Broadcast factory: select the broadcast
      implementation to be used to cache variables on each worker
      machine.

    • Customize Spark serializer: if you need to
      import an external Spark serializer, select this check box and
      in the field that is displayed, enter the fully qualified class
      name of the serializer to be used.

    • Activate backpressure: select this check
      box to enable the backpressure feature of Spark. The
      backpressure feature is available in the Spark version 1.5 and
      onwards. With backpress enabled, Spark automatically finds the
      optimal receiving rate and dynamically adapts the rate based on
      current batch scheduling delays and processing time, in order to
      receive data only as fast as it can process.

Logging and checkpointing the activities of your Apache Spark Job

It is recommended to activate the Spark logging and checkpointing system in the
Spark configuration tab of the Run
view of your Spark Job, in order to help debug and resume your Spark Job when issues
arise.

  1. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
    Spark checkpointing operation. In the field that is displayed, enter the
    directory in which Spark stores, in the file system of the cluster, the context
    data of the computations such as the metadata and the generated RDDs of this
    computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

  2. In the Yarn client mode or the
    Yarn cluster mode, you can enable the Spark
    application logs of this Job to be persistent in the file system. To do this,
    select the Enable Spark event logging check
    box.

    The parameters relevant to Spark logs are displayed:

    • Spark event logs directory:
      enter the directory in which Spark events are logged. This is actually
      the spark.eventLog.dir property.

    • Spark history server address:
      enter the location of the history server. This is actually the spark.yarn.historyServer.address
      property.

    • Compress Spark event logs: if
      needs be, select this check box to compress the logs. This is actually
      the spark.eventLog.compress property.

    Since the administrator of your cluster could have
    defined these properties in the cluster configuration files, it is recommended
    to contact the administrator for the exact values.

  3. If you want to print the Spark context that your Job starts in the log, add the
    spark.logConf property in the Advanced
    properties
    table and enter, within double quotation marks,
    true in the Value column of
    this table.

    Spark configuration_4.png

    Since the administrator of your cluster could have
    defined these properties in the cluster configuration files, it is recommended
    to contact the administrator for the exact values.

Adding advanced Spark properties to solve issues

Depending on the distribution you are using or the issues you encounter, you may need to
add specific Spark properties to the Advanced properties table in
the Spark configuration tab of the Run
view of your Job.

Alternatively, define a Hadoop connection metadata in the
Repository and in its wizard, select the Use Spark
properties
check box to open the properties table and add the property
or properties to be used, for example, from spark-defaults.conf of
your cluster. When you reuse this connection in your Apache Spark Jobs, the advanced
Spark properties you have added there are automatically added to the Spark
configurations for those Jobs.

The advanced properties required by different Hadoop distributions or by some common
issues and their values are listed below:

For further information about the valid Spark properties, see Spark
documentation at https://spark.apache.org/docs/latest/configuration.

Specific Spark timeout

When encountering network issues, Spark by
default waits for up to 45 minutes before stopping its attempts to
submits Jobs. Then, Spark triggers the automatic stop of your Job.

Add the following properties to
the Hadoop properties table of
tHDFSConfiguration to reduce this duration.

  • ipc.client.ping:
    false. This prevents pinging if
    server does not answer.

  • ipc.client.connect.max.retries:
    0. This indicates the number of
    retries if the demand for connection is answered but
    refused.

  • yarn.resourcemanager.connect.retry-interval.ms:
    any number. This indicates how often to try to connect to
    the ResourceManager service until Spark gives up.

Hortonworks Data Platform V2.4

  • spark.yarn.am.extraJavaOptions:
    -Dhdp.version=2.4.0.0-169

  • spark.driver.extraJavaOptions:
    -Dhdp.version=2.4.0.0-169

In addition, you need to add -Dhdp.version=2.4.0.0-169
to the JVM settings area either in the
Advanced settings tab of the
Run view or in the Talend > Run/Debug view of the
Preferences window. Setting this argument in the
Preferences window applies it on all the Jobs that
are designed in the same Studio.

MapR V5.1 and V5.2

When the cluster is used with the HBase or the MapRDB components:

spark.hadoop.yarn.application.classpath: enter the value of this
parameter specific to your cluster and add, if missing, the classpath to HBase to
ensure that the Job to be used can find the required classes and packages in the
cluster.

For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your
cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a
comma(,). The added paths is where HBase is usually installed in a MapR cluster. If
your HBase is installed elsewhere, contact the administrator of your cluster for
details and adapt these paths accordingly.

For a step-by-step explanation about how to add this
parameter, see You can find more
details about how to run HBase/MapR-DB on Spark with a MapR distribution in
Talend Help Center
(https://help.talend.com)
.

Security

In the machine where the Studio with Big Data is installed, some scanning
tools could report a CVE vulnerability issue related to Spark, while
this issue does not actually impact Spark, as is explained by the Spark community, because this vulnerability
concerns the Apache Thrift Go client library only but Spark does not use
this library. Therefore this alert is not relevant to the Studio and thus no
action is required.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x