July 30, 2023

tS3Configuration – Docs for ESB 7.x

tS3Configuration

Reuses the connection configuration to S3N or S3A in the same Job. The Spark
cluster to be used reads this configuration to eventually connect to S3N (S3 Native
Filesystem) or S3A.

Only one tS3Configuration component is allowed per Job.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tS3Configuration properties for Apache Spark Batch

These properties are used to configure tS3Configuration running in the Spark
Batch
Job framework.

The Spark Batch
tS3Configuration component belongs to the Storage family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Access Key

Enter the access key ID that uniquely identifies an AWS
Account. For further information about how to get your Access Key and Secret Key,
see Getting Your AWS Access
Keys
.

Access Secret

Enter the secret access key, constituting the security
credentials in combination with the access Key.

To enter the secret key, click the […] button next to
the secret key field, and then in the pop-up dialog box enter the password between double
quotes and click OK to save the settings.

Bucket name

Enter the bucket name and its folder you need to use. You
need to separate the bucket name and the folder name using a slash (/).

Temp folder

Enter the location of the temp folder in S3. This folder
will be automatically created if it has not existed by the time of the
execution.

Use s3a filesystem

Select this check box to use the S3A filesystem instead
of S3N, the filesystem used by default by tS3Configuration.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus
Inherit credentials from AWS If you are using the S3A filesystem with EMR, you can select this check box
to obtain AWS security credentials from your EMR instance metadata. To use
this option, the Amazon EMR cluster must be started and your Job must be
running on this cluster. For more information, see Using an IAM Role to Grant
Permissions to Applications Running on Amazon EC2
Instances
.

This option enables you to develop your Job without having to put any
AWS keys in the Job, thus easily comply with the security policy of your
organization.

Use SSE-KMS encryption with CMK If you are using the S3A filesystem with EMR, you can select this check box
to use the SSE-KMS encryption service enabled on AWS to read or write the
encrypted data on S3.

On the EMR side, the SSE-KMS service must have been enabled with the
Default encryption feature
and a customer managed CMK specified for the encryption.

For further information about the AWS SSE-KMS encryption, see Protecting Data Using Server-Side
Encryption
from the AWS documentation.

For further information about how to enbale the Default Encryption feature for an
Amazon S3 bucket, see Default encryption from the
AWS documentation.

Use S3 bucket policy If you have defined bucket policy for the bucket to be used, select
this check box and add the following parameter about AWS signature versions
to the JVM argument list of your Job in the Advanced
settings
of the Run
tab:

Assume Role

If you are using the S3A filesystem, you can select this check box to
make your Job temporarily assume a role and the permissions associated
with this role.

Ensure that access to this role has been
granted to your user account by the trust policy associated to this role. If you are not
certain about this, ask the owner of this role or your AWS administrator.

After selecting this check box, specify the parameters the
administrator of the AWS system to be used defined for this role.

  • Role ARN: the Amazon Resource Name (ARN) of the role to assume. You
    can find this ARN name on the Summary page
    of the role to be used on your AWS portal, for example, this role ARN could read
    like am:aws:iam::[aws_account_number]:role/[role_name].

  • Role session name: enter the name you want to use to uniquely
    identify your assumed role session. This name can contain upper- and lower-case
    alphanumeric characters with no spaces. You can also include underscores or any of
    the following characters: =,.@-.

  • Session duration (minutes): the duration (in minutes) for which you
    want the assumed role session to be active. This duration cannot exceed the
    maximum duration which your AWS administrator has set.

The External ID
parameter is required only if your AWS administrator or the owner of
this role has defined an external ID when they set up a trust policy for
this role.

In addition, if the AWS administrator has enabled the STS endpoints for
given regions you want to use for better response performance, use the
Set STS region check box or the
Set STS endpoint check box in the
Advanced settings tab.

This check box is available only for the following distributions
Talend supports:

  • CDH 5.10 and onwards (including the dynamic
    support for the latest Cloudera distributions)

  • HDP 2.5 and onwards

This check box is also available when you are using Spark V1.6 and
onwards in the Local Spark mode in the Spark configuration tab.

Set region

Select this check box and select the region to connect
to.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Set endpoint

Select this check box and in the Endpoint field that is displayed, enter the Amazon
region endpoint you need to use. For a list of the available endpoints, see Regions and Endpoints.

If you leave this check box clear, the endpoint will be
the default one defined by your Hadoop distribution, while this check box is not
available when you have selected the Set
region
check box and in this situation the value selected from the
Set region list is
used.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Advanced settings

Set STS region and Set STS endpoint

If the AWS administrator has enabled the STS endpoints for the regions
you want to use for better response performance, select the Set STS region check box and then select
the regional endpoint to be used.

If the endpoint you want to use is not available in this regional
endpoint list, clear the Set STS region check box, then select
the Set STS endpoint check box and enter the
endpoint to be used.

This service allows you to request temporary,
limited-privilege credentials for the AWS user you authenticate; therefore, you still
need to provide the access key and secret key to authenticate the AWS account to be
used.

For a list of the STS endpoints you can use, see
AWS Security Token Service. For further information about the
STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS
documentation.

These check boxes are available only when you have selected the
Assume Role check box in the Basic settings tab.

Usage

Usage rule

This component is used with no need to be connected to other
components.

You need to drop tS3Configuration along with the file system related Subjob to be
run in the same Job so that the configuration is used by the whole Job at
runtime.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Limitation

Due to license incompatibility, one or more JARs required to use
this component are not provided. You can install the missing JARs for this particular
component by clicking the Install button
on the Component tab view. You can also
find out and add all missing JARs easily on the Modules tab in the
Integration
perspective of your studio. You can find more details about how to install
external modules in Talend Help Center (https://help.talend.com)
.

Creating an IAM role on AWS

You need an IAM role to delegate permissions to the AWS service to be used by your Job. If this IAM role does not exist, define it on AWS.

  • You have the appropriate rights and permissions to create a new role on AWS.
  1. Log in to your account on AWS and navigate to the AWS console.
  2. Select IAM.
  3. In the navigation pane of the IAM console, select Roles, and then select Create role.
  4. Select AWS service and in the Choose the service that will use this role section, select the AWS service to be run with your Job. For example, select Redshift.
  5. Select the use case to be used for this service. An use case in terms of AWS is defined by the service to include the trust policy that this service requires. Depending on the service and the use case that you selected, the available options vary. For example, with Redshift, you can choose an use case from:

    • Redshift (with a pre-defined Amazon Redshift
      Service Linked Role Policy);
    • Redshift – Customizable. In this use case, you are prompted to select either read-only policies or full-access policies.
  6. In the Role name field, enter the name to be used for the role being created.
  7. Select Create role.
A custom role has been created to delegate permissions to an AWS service. For the
full documentation about creating a role on AWS, see Role creation from the AWS
documentation.

Setting up SSE KMS for your EMR cluster

If required by the security policy of your organization, you need to set up SSE KMS, the server-side encryption service of Amazon, for the EMR cluster to be used, before creating this cluster.
This procedure explains only the
SSE KMS related operations for getting started with the security configuration for EMR.
If you need the complete information about all the available EMR security configurations
provided by AWS, see Create a Security Configuration from the
Amazon documentation.
  1. If not yet done, go to https://console.aws.amazon.com/kms
    to create a customer managed CMK to be used by the SSE KMS service. For detailed
    instructions about how to do this, see this tutorial from the AWS
    documentation.

    • When adding roles, among other roles to be added depending on your
      security policy, you must add the EMR_EC2_DefaultRole role.

      The EMR_EC2_DefaultRole role allows your
      Jobs for Apache Spark to read or write files encrypted with SSE-KMS on
      S3.

      This role is a default AWS role that is
      automatically created along with the creation of your first EMR
      cluster. If this role and its associated policies do not exist in
      your account, see Use Default IAM Roles and
      Managed Policies
      from the AWS documentation

  2. On the Amazon EMR page of
    AWS, select the Security configurations
    tab and click Create to open the
    Create security configuration
    view.
  3. Select the At-rest encryption check box
    to enable SSE KMS.
  4. Under S3 data encryption, select
    SSE-KMS for Encryption mode
    and select the CMK key mentioned at the beginning of this procedure for
    AWS KMS Key.
  5. Under Local disk encryption, select AWS
    KMS
    for Key provider type and select the
    CMK key mentioned at the beginning of this procedure for AWS KMS
    Key
    .

    tS3Configuration_1.png

  6. Click Create to validate your security configuration.

    In the real-world practice, you can also configure the other security options such as Kerberos and IAM roles for EMRFS before clicking this Create button.
  7. Click Clusters and once the Create Cluster page is open, click Go to advanced options to start creating the EMR cluster step by step.
  8. At the last step called Security, in the Authentication and
    encryption
    section, select the Security Configuration created in the previous steps.

Setting up SSE KMS for your S3 bucket

If required by the security policy of your organization, you need to set up SSE KMS for the S3 bucket to be used.
Prerequisite: you must have created the CMK key to be used. For detailed
instructions about how to do this, see this tutorial from the AWS
documentation.
This procedure explains only the
SSE KMS related operations for getting started with the security configuration for EMR.
If you need the complete information about all the available EMR security configurations
provided by AWS, see Create a Security Configuration from the
Amazon documentation.
  1. Open your S3 service at https://s3.console.aws.amazon.com/.
  2. From the S3 bucket list, select the bucket to be used. Ensure
    that you have proper rights and permissions to access this bucket.
  3. Select the Properties tab
    and then Default encryption.
  4. Select AWS-KMS.
  5. Select the KMS CMK key to be used.

    tS3Configuration_2.png

  6. Select the Permissions tab, then select
    Bucket Policy and enter your policy in the
    console.

    This article from AWS provides detailed explanations and a simple policy
    example: How to Prevent Uploads of Unencrypted Objects
    to Amazon S3
    .
  7. Click Save to save your policy.
Now your bucket policy is set up. When you need to use this bucket with a Job, enter
the following parameter about AWS signature versions to the JVM argument list of this Job:

tS3Configuration_3.png

For further information about AWS Signature Versions, see Specifying the Signature Version in Request
Authentication
.

Writing and reading data from S3 (Databricks on AWS)

In this scenario, you create a Spark Batch Job using
tS3Configuration and the Parquet components to write data on
S3 and then read the data from S3.

This scenario applies only to subscription-based Talend products with Big
Data
.

tS3Configuration_4.png
The sample data reads as
follows:

This data contains a user name and the ID number distributed to this user.

Note that the sample data is created for demonstration purposes only.

Design the data flow of the Job working with S3 and Databricks on AWS

  1. In the
    Integration
    perspective of the Studio, create an empty
    Spark Batch Job from the Job Designs node in
    the Repository tree view.
  2. In the workspace, enter the name of the component to be used and select this
    component from the list that appears. In this scenario, the components are
    tS3Configuration,
    tFixedFlowInput,
    tFileOutputParquet,
    tFileInputParquet and
    tLogRow.

    The tFixedFlowInput component is used to load the
    sample data into the data flow. In the real-world practice, you could use the
    File input components, as well as the processing components, to design a
    sophisticated process to prepare your data to be processed.
  3. Connect tFixedFlowInput to tFileOutputParquet using the Row > Main link.
  4. Connect tFileInputParquet to tLogRow using the Row > Main link.
  5. Connect tFixedFlowInput to tFileInputParquet using the Trigger > OnSubjobOk link.
  6. Leave tS3Configuration alone without any
    connection.

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

    1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
    2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
      when submitting
      check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
      automatically restarts the cluster, the Jobs that are launched in parallel interrupt
      each other and thus cause execution failure.
  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Enter the basic connection information to Databricks on AWS.

Standalone

  • In the Endpoint
    field, enter the URL address of the workspace of your Databricks on
    AWS. For example, this URL could look like
    https://<your_endpoint>.cloud.databricks.com.

  • In the Cluster
    ID
    field, enter the ID of the Databricks cluster to
    be used. This ID is the value of the
    spark.databricks.clusterUsageTags.clusterId
    property of your Spark cluster. You can find this property on the
    properties list in the Environment tab in the
    Spark UI view of your cluster.

    You can also easily find this ID
    from the URL of your Databricks cluster. It is present immediately
    after cluster/ in this URL.

    This field is not used and thus not available if you are using transient clusters.

  • Click the […]
    button next to the Token field to enter the
    authentication token generated for your Databricks user account. You
    can generate or find this token on the User
    settings
    page of your Databricks workspace. For
    further information, see Token management from the
    Databricks documentation.

  • In the DBFS
    dependencies folder
    field, enter the directory that
    is used to store your Job related dependencies on Databricks
    Filesystem at runtime, putting a slash (/) at the end of this
    directory. For example, enter /jars/ to store
    the dependencies in a folder named jars. This
    folder is created on the fly if it does not exist then.

    This directory stores your Job dependencies on DBFS only. In your
    Job, use tS3Configuration,
    tDynamoDBConfiguration or, in a Spark
    Streaming Job, the Kinesis components, to read or write your
    business data to the related systems.

  • Poll interval when retrieving Job status (in
    ms)
    : enter, without the quotation marks, the time
    interval (in milliseconds) at the end of which you want the Studio
    to ask Spark for the status of your Job. For example, this status
    could be Pending or Running.

    The default value is 300000, meaning 30
    seconds. This interval is recommended by Databricks to correctly
    retrieve the Job status.

  • Use
    transient cluster
    : you can select this check box to
    leverage the transient Databricks clusters.

    The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.

    1. Autoscale: select or clear this check box to define
      the number of workers to be used by your transient cluster.

      1. If you select this check box,
        autoscaling is enabled. Then define the minimum number
        of workers in Min
        workers
        and the maximum number of
        worders in Max
        workers
        . Your transient cluster is
        scaled up and down within this scope based on its
        workload.

        According to the Databricks
        documentation, autoscaling works best with
        Databricks runtime versions 3.0 or onwards.

      2. If you clear this check box, autoscaling
        is deactivated. Then define the number of workers a
        transient cluster is expected to have. This number does
        not include the Spark driver node.
    2. Node type
      and Driver node type:
      select the node types for the workers and the Spark driver node.
      These types determine the capacity of your nodes and their
      pricing by Databricks.

      For details about
      these node types and the Databricks Units they use, see
      Supported Instance
      Types
      from the Databricks documentation.

    3. Elastic
      disk
      : select this check box to enable your
      transient cluster to automatically scale up its disk space when
      its Spark workers are running low on disk space.

      For more details about this elastic disk
      feature, search for the section about autoscaling local
      storage from your Databricks documentation.

    4. SSH public
      key
      : if an SSH access has been set up for your
      cluster, enter the public key of the generated SSH key pair.
      This public key is automatically added to each node of your
      transient cluster. If no SSH access has been set up, ignore this
      field.

      For further information about SSH
      access to your cluster, see SSH access to
      clusters
      from the Databricks
      documentation.

    5. Configure cluster
      log
      : select this check box to define where to
      store your Spark logs for a long term. This storage system could
      be S3 or DBFS.
  • Do not restart the cluster
    when submitting
    : select this check box to prevent
    the Studio restarting the cluster when the Studio is submitting your
    Jobs. However, if you make changes in your Jobs, clear this check
    box so that the Studio resarts your cluster to take these changes
    into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Configuring the connection to the S3 service to be used by Spark

  1. Double-click tS3Configuration to open its Component view.

    Spark uses this component to connect to the S3 system in which your Job writes
    the actual business data. If you place neither
    tS3Configuration nor any other configuration
    component that supports Databricks on AWS, this business data is written in the
    Databricks Filesystem (DBFS).
    tS3Configuration_5.png

  2. In the Access key and the Secret
    key
    fields, enter the keys to be used to authenticate to
    S3.
  3. In the Bucket name field, enter the name of the bucket
    and the folder in this bucket to be used to store the business data, for
    example, mybucket/myfolder. This folder is created on the
    fly if it does not exist but the bucket must already exist at runtime.

Write the sample data to S3

  1. Double-click the tFixedFlowIput component to
    open its Component view.

    tS3Configuration_6.png

  2. Click the […] button next to Edit schema to open the schema editor.
  3. Click the [+] button to add the schema
    columns as shown in this image.

    tS3Configuration_7.png

  4. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  5. In the Mode area, select the Use Inline
    Content
    radio button and paste the previously mentioned sample data
    into the Content field that is displayed.
  6. In the Field separator field, enter a
    semicolon (;).
  7. Double-click the tFileOutputParquet component to
    open its Component view.

    tS3Configuration_8.png
  8. Select the Define a storage configuration component
    check box and then select the tS3Configuration component
    you configured in the previous steps.
  9. Click Sync columns to ensure that
    tFileOutputParquet has the same schema as
    tFixedFlowInput.
  10. In the Folder/File field, enter the name of the S3
    folder to be used to store the sample data. For example, enter
    /sample_user, then as you have specified
    my_bucket/my_folder to use in
    tS3Configuration to store the business data on S3,
    the eventual directory on S3 becomes
    my_bucket/my_folder/sample_user.
  11. From the Action drop-down list, select
    Create if the sample_user folder
    does not exist yet; if this folder already exists, select
    Overwrite.

Reading the sample data from S3

  1. Double-click tFileInputParquet to open its
    Component view.

    tS3Configuration_9.png

  2. Select the Define a storage configuration component
    check box and then select the tS3Configuration
    component you configured in the previous steps.
  3. Click the […] button next to Edit
    schema
    to open the schema editor.
  4. Click the [+] button to add the schema columns for
    output as shown in this image.

    tS3Configuration_10.png

  5. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.
  6. In the Folder/File field, enter the name of the folder
    from which you need to read data. In this scenario, it is
    sample_user.
  7. Double click tLogRow to open its
    Component view and select the Table radio button to present the result in a table.
  8. Press F6 to run this Job.
Once done, you can find your Job on the Job page on the Web UI
of your Databricks cluster and then check the execution log of your Job.

tS3Configuration_11.png

Writing server-side KMS encrypted data on EMR

If the AWS SSE-KMS encryption (at-rest encryption) service is
enabled to set Default encryption to protect data
on the S3A system of your EMR cluster, select the SSE-KMS option in tS3Configuration when writing data to that S3A
system.

The sample data used in this scenario is about different types of
incidents that people reported occurring on Paris streets within one
day.
The sample data is used for demonstration purposes only.

The Job calculates the occurrence of each incident type.

tS3Configuration_12.png

Prerequisites:

  • The S3 system to be used is S3A.
  • The SSE-KMS encryption service on AWS is enabled with the
    Default encryption feature and a
    customer managed CMK has been specified for it.
  • The EMR cluster to be used is created with SSE-KMS and the
    EMR_EC2_DefaultRole role has been added to the above-mentioned CMK.
  • The administrator of your EMR cluster has granted the appropriate
    rights and permissions to the AWS account you are using in your Jobs.
  • Your EMR cluster has been properly set up and is running.
  • A Talend
    Jobserver has been deployed on an instance within the network of your EMR
    cluster, such as the instance for the master of your cluster.
All these operations are done on the AWS side.

  • In your Studio or in Talend Administration Center, define this Jobserver as the execution server of your
    Jobs.

Ensure that the client machine on which the Talend Jobs are executed can
recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add
the IP address/hostname mapping entries for the services of that Hadoop cluster in the
hosts file of the client machine.

If this is the first time your EMR cluster is set up to run with Talend Jobs, search for Amazon EMR –
Getting Started on Talend Help Center (https://help.talend.com) to verify your setup so as to help your
Jobs work more efficiently on top of EMR.

Designing the flow of the data to write and encrypt onto EMR

Link the components to construct the data flow.
  1. In the
    Integration
    perspective of the Studio, create an empty Spark Batch Job from the Job
    Designs
    node in the Repository tree view.
  2. In the workspace, enter the name of the component to be used and select this
    component from the list that appears. In this scenario, the components are
    tHDFSConfiguration (labeled emr_hdfs), tS3Configuration, tFixedFlowInput, tAggregateRow and tFileOutputParquet.

    The tFixedFlowInput component is used to load
    the sample data into the data flow. In the real-world practice, use the input component specific to the data format or the source system to be used instead of tFixedFlowInput.
  3. Connect tFixedFlowInput, tAggregateRow and
    tFileOutputParquet using the Row > Main link.
  4. Leave the tHDFSConfiguration component and the tS3Configuration component alone
    without any connection.

Configuring the connection to the HDFS file system of your EMR
cluster

  1. Double-click tHDFSConfiguration to open its Component view.

    Spark uses this component to connect to the HDFS system to
    which the jar files dependent on the Job are transferred.

  2. If you have defined the HDFS connection metadata under the
    Hadoop cluster node in Repository, select Repository from the Property
    type
    drop-down list and then click the […] button to select the HDFS connection you
    have defined from the Repository content
    wizard.

    For further information about setting up a reusable HDFS
    connection, search for centralizing HDFS metadata on Talend Help Center (https://help.talend.com).

    If you complete this step, you can skip the following steps
    about configuring tHDFSConfiguration
    because all the required fields should have been filled automatically.

  3. In the Version area,
    select the Hadoop distribution you need to connect to and its version.
  4. In the NameNode URI
    field, enter the location of the machine hosting the NameNode service of the
    cluster. If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; WebHDFS with SSL is not
    supported yet.
  5. In the Username field,
    enter the authentication information used to connect to the HDFS system to be
    used. Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the connection to S3 to be used to store the business data

  1. Double-click tS3Configuration to open its Component view.

    Spark uses this component to connect to the S3 system to which the business data is stored. In this scenario, the sample data about the street incidents is written and ecrypted on S3.

  2. Select the Use s3a Filesystem check box. The Inherit credentials from AWS role check box and the use SSE-KMS encryption
  3. Enter the access credentials of the AWS account to be used.

    • If allowed by the security policy of your organization, in the Access Key and the Secret Key fields, enter the credentials.

      If you do not know the credentials to be used, contact the administrator of your AWS system or check Getting Your AWS Access
      Keys
      from the AWS documentation.

    • If the security policy of your organization does not allow you to expose the credentials in a client application, select Inherit credentials from AWS role to obtain the role-based temporary AWS security credentials from your EMR instance metadata. An IAM role must have been specified to associate with this EMR instance.

      For further information about using an IAM role to grant permissions, see Using IAM roles from the AWS documentation.

  4. Select Use SSE-KMS encryption check box to enable the Job to verify and use the SSE-KMS encryption service of your cluster.
  5. In the Bucket name field, enter the name of the bucket to be used to store the sample data. This bucket must have existed when you launch your Job. For example, enter my_bucket/my_folder.

Loading the sample data about street incidents to the Job.

  1. Double-click tFixedFlowInput to display its
    Basic settings view.

    tS3Configuration_13.png

  2. Click the […] button next to Edit
    Schema
    to open the schema editor.

    tS3Configuration_14.png

  3. Click the [+] button to add four columns, namely
    id, address,
    incident_type and description.
  4. Click Ok to close the schema editor and accept
    propagating the changes when prompted by the system.

  5. In the Mode area, select the Use Inline Content(delimited file) radio button to display the Content area and enter the sample data.

Calculating the incident occurrence

  1. Double-click tAggregateRow to open its
    Component view.

    tS3Configuration_15.png

  2. Click the […] button next to Edit schema to open the schema editor.
  3. On the output side (right), click the [+]
    button twice to add two rows and in the Column column, rename them to incident_type and incident_number, respectively.

    tS3Configuration_16.png

  4. In the Type column of the incident_number row of the output side, select Integer.
  5. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  6. In the Group by table, add one row by
    clicking the [+] button and configure
    this row as follows to group the outputted data:

    • Output column:
      select the columns from the output schema to be used as the
      conditions to group the outputted data. In this example, it is the
      incident_type from the output schema.

    • Input column
      position
      : select the columns from the input schema
      to send data to the output columns you have selected in the
      Output
      column
      column. In this scenario, it is the
      incident_type column from the input schema.

  7. In the Operations table, add one row by clicking the [+] button once and configure this
    row as follows to calculate the occurrence of each incident type:

    • Output column:
      select the column from the output schema to store the calculation
      results. In this scenario, it is incident_number.

    • Function: select
      the function to be used to process the incoming data. In this
      scenario, select count. It counts the frequency of each
      incident.

    • Input column
      position
      : select the column from the input schema to
      provide the data to be processed. In this scenario, it is incident_type.

Writing the aggregated data about street incidents to EMR

  1. Double-click the tFileOutputParquet component to
    open its Component view.

    tS3Configuration_17.png
  2. Select the Define a storage
    configuration component
    check box and then select the tS3Configuration component you configured in the
    previous steps.
  3. Click Sync columns to
    ensure that tFileOutputParquet retrieve the
    schema from the output side of tAggregateRow.
  4. In the Folder/File
    field, enter the name of the folder to be used to store the aggregated data in
    the S3 bucket specified in tS3Configuration. For example,
    enter /sample_user, then at runtime, the folder called
    sample_user at the root of the bucket is used to
    store the output of your Job.
  5. From the Action
    drop-down list, select Create if the
    folder to be used does not exist yet in the bucket to be used; if this folder
    already exists, select Overwrite.
  6. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  7. Select the Use local mode check box to test your Job locally before eventually submitting it to the remote Spark cluster.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

  8. In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.
  9. In the Component view of tFileOutputParquet, change the file path in the Folder/File field to a local directory and adapt the action to be taken on the Action drop-down list, that is to say, creating a new folder or overwriting the existing one.
  10. On the Run tab, click Basic
    Run
    and in this view, click Run to execute your Job locally to test its design
    logic.
  11. When your Job runs successfully, clear the Use local
    mode
    check box in the Spark Configuration
    view of the Run tab, then in the design workspace of your
    Job, activate the configuration components and revert the changes you just made
    in tFileOutputParquet for the local test.

Defining the EMR connection parameters

Complete the EMR connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. Enter the basic connection information to EMR:

    Yarn client

    The Studio runs the Spark driver to orchestrate how the Job should be
    performed and then send the orchestration to the Yarn service of a given
    Hadoop cluster so that the Resource Manager of this Yarn service
    requests execution resources accordingly.

    If you are using the Yarn client
    mode, you need to set the following parameters in their corresponding
    fields (if you leave the check box of a service clear, then at runtime,
    the configuration about this parameter in the Hadoop cluster to be used
    will be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • If the Spark cluster cannot recognize the machine in which the Job is
      launched, select this Define the driver hostname or IP
      address
      check box and enter the host name or the IP address of
      this machine. This allows the Spark master and its workers to recognize this
      machine to find the Job and thus its driver.

      Note that in this situation, you also need to add the name and the IP
      address of this machine to its host file.

    Yarn cluster

    The Spark driver runs in your Yarn cluster to orchestrate how the Job
    should be performed.

    If you are using the Yarn cluster mode, you need
    to define the following parameters in their corresponding fields (if you
    leave the check box of a service clear, then at runtime, the
    configuration about this parameter in the Hadoop cluster to be used will
    be ignored):

    • In the Resource managerUse datanode
      field, enter the address of the ResourceManager service of the Hadoop cluster to
      be used.

    • Select the Set resourcemanager
      scheduler address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory
      address
      check box and enter the location of the JobHistory
      server of the Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging
      directory
      check box and enter this directory defined in your
      Hadoop cluster for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
      such as yarn-site.xml or mapred-site.xml of your distribution.

    • Set path to custom Hadoop
      configuration JAR
      : if you are using
      connections defined in Repository to
      connect to your Cloudera or Hortonworks cluster, you can
      select this check box in the
      Repository wizard and in the
      field that is displayed, specify the path to the JAR file
      that provides the connection parameters of your Hadoop
      environment. Note that this file must be accessible from the
      machine where you Job is launched.

      This kind of Hadoop configuration JAR file is
      automatically generated when you build a Big Data Job from the
      Studio. This JAR file is by default named with this
      pattern: You
      can also download this JAR file from the web console of your
      cluster or simply create a JAR file yourself by putting the
      configuration files in the root of your JAR file. For
      example:

      The parameters from your custom JAR file override the parameters
      you put in the Spark configuration field.
      They also override the configuration you set in the
      configuration components such as
      tHDFSConfiguration or
      tHBaseConfiguration when the related
      storage system such as HDFS, HBase or Hive are native to Hadoop.
      But they do not override the configuration set in the
      configuration components for the third-party storage system such
      as tAzureFSConfiguration.

    • If you are accessing the Hadoop cluster running with Kerberos security,
      select this check box, then, enter the Kerberos principal names for the
      ResourceManager service and the JobHistory service in the displayed fields. This
      enables you to use your user name to authenticate against the credentials stored in
      Kerberos. These principals can be found in the configuration files of your
      distribution, such as in yarn-site.xml and in mapred-site.xml.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
      pairs of Kerberos principals and encrypted keys. You need to enter the principal to
      be used in the Principal field and the access
      path to the keytab file itself in the Keytab
      field. This keytab file must be stored in the machine in which your Job actually
      runs, for example, on a Talend
      Jobserver.

      Note that the user that executes a keytab-enabled Job is not necessarily
      the one a principal designates but must have the right to read the keytab file being
      used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
      situation, ensure that user1 has the right to read the keytab
      file to be used.

    • The User name field is available when you are not using
      Kerberos to authenticate. In the User name field, enter the
      login user name for your distribution. If you leave it empty, the user name of the machine
      hosting the Studio will be used.

    • Select the Wait for the Job to complete check box to make your Studio or,
      if you use Talend
      Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
      is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
      it is generally useful to select this check box when running a Spark Batch Job,
      it makes more sense to keep this check box clear when running a Spark Streaming
      Job.

    Ensure that the user name in the Yarn
    client
    mode is the same one you put in
    tS3Configuration, the component used to provides S3
    connection information to Spark.


  2. With the Yarn client mode, the
    Property type list is displayed to allow you
    to select an established Hadoop connection from the Repository, on the condition that you have created this connection
    in the Repository. Then the Studio will reuse
    that set of connection information for this Job.


  3. If you need to launch from Windows, it is recommended to specify where
    the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
      and enter the directory where your winutils.exe is
      stored.

    • Otherwise, leave this check box clear, the Studio generates one
      by itself and automatically uses it for this Job.


  4. In the Spark “scratch” directory
    field, enter the directory in which the Studio stores in the local system the
    temporary files such as the jar files to be transferred. If you launch the Job
    on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  • After the connection is configured, you can tune
    the Spark performance, although not required, by following the process explained in:

    • for Spark Batch Jobs.

    • for Spark Streaming Jobs.

  • It is recommended to activate the Spark logging and
    checkpointing system in the Spark configuration tab of the Run view of your Spark
    Job, in order to help debug and resume your Spark Job when issues arise:

    • .

Running the Job to write KMS encrypted data to EMR

  1. Press Ctrl+S to save your Job.
  2. In the Run tab, click Run to execute the Job.

tS3Configuration properties for Apache Spark Streaming

These properties are used to configure tS3Configuration running in the Spark Streaming Job framework.

The Spark Streaming
tS3Configuration component belongs to the Storage family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Access Key

Enter the access key ID that uniquely identifies an AWS
Account. For further information about how to get your Access Key and Secret Key,
see Getting Your AWS Access
Keys
.

Access Secret

Enter the secret access key, constituting the security
credentials in combination with the access Key.

To enter the secret key, click the […] button next to
the secret key field, and then in the pop-up dialog box enter the password between double
quotes and click OK to save the settings.

Bucket name

Enter the bucket name and its folder you need to use. You
need to separate the bucket name and the folder name using a slash (/).

Use s3a filesystem

Select this check box to use the S3A filesystem instead
of S3N, the filesystem used by default by tS3Configuration.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus
Inherit credentials from AWS If you are using the S3A filesystem with EMR, you can select this check box
to obtain AWS security credentials from your EMR instance metadata. To use
this option, the Amazon EMR cluster must be started and your Job must be
running on this cluster. For more information, see Using an IAM Role to Grant
Permissions to Applications Running on Amazon EC2
Instances
.

This option enables you to develop your Job without having to put any
AWS keys in the Job, thus easily comply with the security policy of your
organization.

Use SSE-KMS encryption with CMK If you are using the S3A filesystem with EMR, you can select this check box
to use the SSE-KMS encryption service enabled on AWS to read or write the
encrypted data on S3.

On the EMR side, the SSE-KMS service must have been enabled with the
Default encryption feature
and a customer managed CMK specified for the encryption.

For further information about the AWS SSE-KMS encryption, see Protecting Data Using Server-Side
Encryption
from the AWS documentation.

For further information about how to enbale the Default Encryption feature for an
Amazon S3 bucket, see Default encryption from the
AWS documentation.

Use S3 bucket policy If you have defined bucket policy for the bucket to be used, select
this check box and add the following parameter about AWS signature versions
to the JVM argument list of your Job in the Advanced
settings
of the Run
tab:

Assume Role

If you are using the S3A filesystem, you can select this check box to
make your Job temporarily assume a role and the permissions associated
with this role.

Ensure that access to this role has been
granted to your user account by the trust policy associated to this role. If you are not
certain about this, ask the owner of this role or your AWS administrator.

After selecting this check box, specify the parameters the
administrator of the AWS system to be used defined for this role.

  • Role ARN: the Amazon Resource Name (ARN) of the role to assume. You
    can find this ARN name on the Summary page
    of the role to be used on your AWS portal, for example, this role ARN could read
    like am:aws:iam::[aws_account_number]:role/[role_name].

  • Role session name: enter the name you want to use to uniquely
    identify your assumed role session. This name can contain upper- and lower-case
    alphanumeric characters with no spaces. You can also include underscores or any of
    the following characters: =,.@-.

  • Session duration (minutes): the duration (in minutes) for which you
    want the assumed role session to be active. This duration cannot exceed the
    maximum duration which your AWS administrator has set.

The External ID
parameter is required only if your AWS administrator or the owner of
this role has defined an external ID when they set up a trust policy for
this role.

In addition, if the AWS administrator has enabled the STS endpoints for
given regions you want to use for better response performance, use the
Set STS region check box or the
Set STS endpoint check box in the
Advanced settings tab.

This check box is available only for the following distributions
Talend supports:

  • CDH 5.10 and onwards (including the dynamic
    support for the latest Cloudera distributions)

  • HDP 2.5 and onwards

This check box is also available when you are using Spark V1.6 and
onwards in the Local Spark mode in the Spark configuration tab.

Set region

Select this check box and select the region to connect
to.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Set endpoint

Select this check box and in the Endpoint field that is displayed, enter the Amazon
region endpoint you need to use. For a list of the available endpoints, see Regions and Endpoints.

If you leave this check box clear, the endpoint will be
the default one defined by your Hadoop distribution, while this check box is not
available when you have selected the Set
region
check box and in this situation the value selected from the
Set region list is
used.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Advanced settings

Set STS region and Set STS endpoint

If the AWS administrator has enabled the STS endpoints for the regions
you want to use for better response performance, select the Set STS region check box and then select
the regional endpoint to be used.

If the endpoint you want to use is not available in this regional
endpoint list, clear the Set STS region check box, then select
the Set STS endpoint check box and enter the
endpoint to be used.

This service allows you to request temporary,
limited-privilege credentials for the AWS user you authenticate; therefore, you still
need to provide the access key and secret key to authenticate the AWS account to be
used.

For a list of the STS endpoints you can use, see
AWS Security Token Service. For further information about the
STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS
documentation.

These check boxes are available only when you have selected the
Assume Role check box in the Basic settings tab.

Usage

Usage rule

This component is used with no need to be connected to other components.

You need to drop tS3Configuration along with the file system related Subjob to be
run in the same Job so that the configuration is used by the whole Job at
runtime.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Limitation

Due to license incompatibility, one or more JARs required to use
this component are not provided. You can install the missing JARs for this particular
component by clicking the Install button
on the Component tab view. You can also
find out and add all missing JARs easily on the Modules tab in the
Integration
perspective of your studio. You can find more details about how to install
external modules in Talend Help Center (https://help.talend.com)
.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x