July 31, 2023

tS3Configuration properties for Apache Spark Batch – Docs for ESB Google Drive 7.x

tS3Configuration properties for Apache Spark Batch

These properties are used to configure tS3Configuration running in the Spark
Batch
Job framework.

The Spark Batch
tS3Configuration component belongs to the Storage family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Access Key

Enter the access key ID that uniquely identifies an AWS
Account. For further information about how to get your Access Key and Secret Key,
see Getting Your AWS Access
Keys
.

Access Secret

Enter the secret access key, constituting the security
credentials in combination with the access Key.

To enter the secret key, click the […] button next to
the secret key field, and then in the pop-up dialog box enter the password between double
quotes and click OK to save the settings.

Bucket name

Enter the bucket name and its folder you need to use. You
need to separate the bucket name and the folder name using a slash (/).

Temp folder

Enter the location of the temp folder in S3. This folder
will be automatically created if it has not existed by the time of the
execution.

Use s3a filesystem

Select this check box to use the S3A filesystem instead
of S3N, the filesystem used by default by tS3Configuration.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus
Inherit credentials from AWS If you are using the S3A filesystem with EMR, you can select this check box
to obtain AWS security credentials from your EMR instance metadata. To use
this option, the Amazon EMR cluster must be started and your Job must be
running on this cluster. For more information, see Using an IAM Role to Grant
Permissions to Applications Running on Amazon EC2
Instances
.

This option enables you to develop your Job without having to put any
AWS keys in the Job, thus easily comply with the security policy of your
organization.

Use SSE-KMS encryption with CMK If you are using the S3A filesystem with EMR, you can select this check box
to use the SSE-KMS encryption service enabled on AWS to read or write the
encrypted data on S3.

On the EMR side, the SSE-KMS service must have been enabled with the
Default encryption feature
and a customer managed CMK specified for the encryption.

For further information about the AWS SSE-KMS encryption, see Protecting Data Using Server-Side
Encryption
from the AWS documentation.

For further information about how to enbale the Default Encryption feature for an
Amazon S3 bucket, see Default encryption from the
AWS documentation.

Use S3 bucket policy If you have defined bucket policy for the bucket to be used, select
this check box and add the following parameter about AWS signature versions
to the JVM argument list of your Job in the Advanced
settings
of the Run
tab:

Assume Role

If you are using the S3A filesystem, you can select this check box to
make your Job temporarily assume a role and the permissions associated
with this role.

Ensure that access to this role has been
granted to your user account by the trust policy associated to this role. If you are not
certain about this, ask the owner of this role or your AWS administrator.

After selecting this check box, specify the parameters the
administrator of the AWS system to be used defined for this role.

  • Role ARN: the Amazon Resource Name (ARN) of the role to assume. You
    can find this ARN name on the Summary page
    of the role to be used on your AWS portal, for example, this role ARN could read
    like am:aws:iam::[aws_account_number]:role/[role_name].

  • Role session name: enter the name you want to use to uniquely
    identify your assumed role session. This name can contain upper- and lower-case
    alphanumeric characters with no spaces. You can also include underscores or any of
    the following characters: =,.@-.

  • Session duration (minutes): the duration (in minutes) for which you
    want the assumed role session to be active. This duration cannot exceed the
    maximum duration which your AWS administrator has set.

The External ID
parameter is required only if your AWS administrator or the owner of
this role has defined an external ID when they set up a trust policy for
this role.

In addition, if the AWS administrator has enabled the STS endpoints for
given regions you want to use for better response performance, use the
Set STS region check box or the
Set STS endpoint check box in the
Advanced settings tab.

This check box is available only for the following distributions
Talend supports:

  • CDH 5.10 and onwards (including the dynamic
    support for the latest Cloudera distributions)

  • HDP 2.5 and onwards

This check box is also available when you are using Spark V1.6 and
onwards in the Local Spark mode in the Spark configuration tab.

Set region

Select this check box and select the region to connect
to.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Set endpoint

Select this check box and in the Endpoint field that is displayed, enter the Amazon
region endpoint you need to use. For a list of the available endpoints, see Regions and Endpoints.

If you leave this check box clear, the endpoint will be
the default one defined by your Hadoop distribution, while this check box is not
available when you have selected the Set
region
check box and in this situation the value selected from the
Set region list is
used.

This feature is available when you are using one of the
following distributions with Spark:

  • Amazon EMR V4.5 and onwards

  • MapR V5.0 and onwards

  • Hortonworks Data Platform V2.4 and
    onwards

  • Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
    2.0.

  • Cloudera
    Altus

Advanced settings

Set STS region and Set STS endpoint

If the AWS administrator has enabled the STS endpoints for the regions
you want to use for better response performance, select the Set STS region check box and then select
the regional endpoint to be used.

If the endpoint you want to use is not available in this regional
endpoint list, clear the Set STS region check box, then select
the Set STS endpoint check box and enter the
endpoint to be used.

This service allows you to request temporary,
limited-privilege credentials for the AWS user you authenticate; therefore, you still
need to provide the access key and secret key to authenticate the AWS account to be
used.

For a list of the STS endpoints you can use, see
AWS Security Token Service. For further information about the
STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS
documentation.

These check boxes are available only when you have selected the
Assume Role check box in the Basic settings tab.

Usage

Usage rule

This component is used with no need to be connected to other
components.

You need to drop tS3Configuration along with the file system related Subjob to be
run in the same Job so that the configuration is used by the whole Job at
runtime.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Limitation

Due to license incompatibility, one or more JARs required to use
this component are not provided. You can install the missing JARs for this particular
component by clicking the Install button
on the Component tab view. You can also
find out and add all missing JARs easily on the Modules tab in the
Integration
perspective of your studio. You can find more details about how to install
external modules in Talend Help Center (https://help.talend.com)
.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x