tS3Configuration

Reuses the connection configuration to S3N or S3A in the same Job. The Spark
cluster to be used reads this configuration to eventually connect to S3N (S3 Native
Filesystem) or S3A.

Only one tS3Configuration component is allowed per Job.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Spark Batch: see tS3Configuration properties for Apache Spark Batch.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Spark Streaming: see tS3Configuration properties for Apache Spark Streaming.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

tS3Configuration properties for Apache Spark Batch

These properties are used to configure tS3Configuration running in the Spark
Batch Job framework.

The Spark Batch
tS3Configuration component belongs to the Storage family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Access Key

Enter the access key ID that uniquely identifies an AWS
Account. For further information about how to get your Access Key and Secret Key,
see Getting Your AWS Access
Keys.

Access Secret

Enter the secret access key, constituting the security
credentials in combination with the access Key.

To enter the secret key, click the […] button next to
the secret key field, and then in the pop-up dialog box enter the password between double
quotes and click OK to save the settings.

Bucket name

Enter the bucket name and its folder you need to use. You
need to separate the bucket name and the folder name using a slash (/).

Temp folder

Enter the location of the temp folder in S3. This folder
will be automatically created if it has not existed by the time of the
execution.

Use s3a filesystem

Select this check box to use the S3A filesystem instead
of S3N, the filesystem used by default by tS3Configuration.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Inherit credentials from AWS

If you are using the S3A filesystem with EMR, you can select this check box
to obtain AWS security credentials from your EMR instance metadata. To use
this option, the Amazon EMR cluster must be started and your Job must be
running on this cluster. For more information, see Using an IAM Role to Grant
Permissions to Applications Running on Amazon EC2
Instances.

This option enables you to develop your Job without having to put any
AWS keys in the Job, thus easily comply with the security policy of your
organization.

Use SSE-KMS encryption with CMK

If you are using the S3A filesystem with EMR, you can select this check box
to use the SSE-KMS encryption service enabled on AWS to read or write the
encrypted data on S3.

On the EMR side, the SSE-KMS service must have been enabled with the
Default encryption feature
and a customer managed CMK specified for the encryption.

For further information about the AWS SSE-KMS encryption, see Protecting Data Using Server-Side
Encryption from the AWS documentation.

For further information about how to enbale the Default Encryption feature for an
Amazon S3 bucket, see Default encryption from the
AWS documentation.

Use S3 bucket policy

If you have defined bucket policy for the bucket to be used, select
this check box and add the following parameter about AWS signature versions
to the JVM argument list of your Job in the Advanced
settings of the Run
tab:

-Dcom.amazonaws.services.s3.enableV4

1	-Dcom.amazonaws.services.s3.enableV4

Assume Role

If you are using the S3A filesystem, you can select this check box to
make your Job temporarily assume a role and the permissions associated
with this role.

Ensure that access to this role has been
granted to your user account by the trust policy associated to this role. If you are not
certain about this, ask the owner of this role or your AWS administrator.

After selecting this check box, specify the parameters the
administrator of the AWS system to be used defined for this role.

Role ARN: the Amazon Resource Name (ARN) of the role to assume. You
can find this ARN name on the Summary page
of the role to be used on your AWS portal, for example, this role ARN could read
like am:aws:iam::[aws_account_number]:role/[role_name].
Role session name: enter the name you want to use to uniquely
identify your assumed role session. This name can contain upper- and lower-case
alphanumeric characters with no spaces. You can also include underscores or any of
the following characters: =,.@-.
Session duration (minutes): the duration (in minutes) for which you
want the assumed role session to be active. This duration cannot exceed the
maximum duration which your AWS administrator has set.

The External ID
parameter is required only if your AWS administrator or the owner of
this role has defined an external ID when they set up a trust policy for
this role.

In addition, if the AWS administrator has enabled the STS endpoints for
given regions you want to use for better response performance, use the
Set STS region check box or the
Set STS endpoint check box in the
Advanced settings tab.

This check box is available only for the following distributions
Talend supports:

CDH 5.10 and onwards (including the dynamic
support for the latest Cloudera distributions)
HDP 2.5 and onwards

This check box is also available when you are using Spark V1.6 and
onwards in the Local Spark mode in the Spark configuration tab.

Set region

Select this check box and select the region to connect
to.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Set endpoint

Select this check box and in the Endpoint field that is displayed, enter the Amazon
region endpoint you need to use. For a list of the available endpoints, see Regions and Endpoints.

If you leave this check box clear, the endpoint will be
the default one defined by your Hadoop distribution, while this check box is not
available when you have selected the Set
region check box and in this situation the value selected from the
Set region list is
used.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Advanced settings

Set STS region and Set STS endpoint	If the AWS administrator has enabled the STS endpoints for the regions you want to use for better response performance, select the Set STS region check box and then select the regional endpoint to be used. If the endpoint you want to use is not available in this regional endpoint list, clear the Set STS region check box, then select the Set STS endpoint check box and enter the endpoint to be used. This service allows you to request temporary, limited-privilege credentials for the AWS user you authenticate; therefore, you still need to provide the access key and secret key to authenticate the AWS account to be used. For a list of the STS endpoints you can use, see AWS Security Token Service. For further information about the STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS documentation. These check boxes are available only when you have selected the Assume Role check box in the Basic settings tab.

Set STS region and Set STS endpoint

If the AWS administrator has enabled the STS endpoints for the regions
you want to use for better response performance, select the Set STS region check box and then select
the regional endpoint to be used.

If the endpoint you want to use is not available in this regional
endpoint list, clear the Set STS region check box, then select
the Set STS endpoint check box and enter the
endpoint to be used.

This service allows you to request temporary,
limited-privilege credentials for the AWS user you authenticate; therefore, you still
need to provide the access key and secret key to authenticate the AWS account to be
used.

For a list of the STS endpoints you can use, see
AWS Security Token Service. For further information about the
STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS
documentation.

These check boxes are available only when you have selected the
Assume Role check box in the Basic settings tab.

Usage

Usage rule	This component is used with no need to be connected to other components. You need to drop tS3Configuration along with the file system related Subjob to be run in the same Job so that the configuration is used by the whole Job at runtime.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster. When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis.
Limitation	Due to license incompatibility, one or more JARs required to use this component are not provided. You can install the missing JARs for this particular component by clicking the Install button on the Component tab view. You can also find out and add all missing JARs easily on the Modules tab in the Integration perspective of your studio. You can find more details about how to install external modules in Talend Help Center (https://help.talend.com).

Usage rule

This component is used with no need to be connected to other
components.

You need to drop tS3Configuration along with the file system related Subjob to be
run in the same Job so that the configuration is used by the whole Job at
runtime.

Spark Connection

In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the
  Google Storage staging bucket
  field in the Spark configuration
  tab.
- When using HDInsight, specify the blob to be used for Job
  deployment in the Windows Azure Storage
  configuration area in the Spark
  configuration tab.
- When using Altus, specify the S3 bucket or the Azure
  Data Lake Storage for Job deployment in the Spark
  configuration tab.
- When using Qubole, add a
  tS3Configuration to your Job to write
  your actual business data in the S3 system with Qubole. Without
  tS3Configuration, this business data is
  written in the Qubole HDFS system and destroyed once you shut
  down your cluster.
- When using on-premise
  distributions, use the configuration component corresponding
  to the file system your cluster is using. Typically, this
  system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the
configuration component corresponding to the file system your cluster is
using, such as tHDFSConfiguration or
tS3Configuration.

If you are using Databricks without any configuration component present
in your Job, your business data is written directly in DBFS (Databricks
Filesystem).

This connection is effective on a per-Job basis.

Limitation

Due to license incompatibility, one or more JARs required to use
this component are not provided. You can install the missing JARs for this particular
component by clicking the Install button
on the Component tab view. You can also
find out and add all missing JARs easily on the Modules tab in the
Integration
perspective of your studio. You can find more details about how to install
external modules in Talend Help Center (https://help.talend.com).

Creating an IAM role on AWS

You need an IAM role to delegate permissions to the AWS service to be used by your Job. If this IAM role does not exist, define it on AWS.

You have the appropriate rights and permissions to create a new role on AWS.

Log in to your account on AWS and navigate to the AWS console.
Select IAM.
In the navigation pane of the IAM console, select Roles, and then select Create role.
Select AWS service and in the Choose the service that will use this role section, select the AWS service to be run with your Job. For example, select Redshift.
Select the use case to be used for this service. An use case in terms of AWS is defined by the service to include the trust policy that this service requires. Depending on the service and the use case that you selected, the available options vary. For example, with Redshift, you can choose an use case from:
- Redshift (with a pre-defined Amazon Redshift
  Service Linked Role Policy);
- Redshift – Customizable. In this use case, you are prompted to select either read-only policies or full-access policies.
In the Role name field, enter the name to be used for the role being created.
Select Create role.

A custom role has been created to delegate permissions to an AWS service. For the
full documentation about creating a role on AWS, see Role creation from the AWS
documentation.

Setting up SSE KMS for your EMR cluster

If required by the security policy of your organization, you need to set up SSE KMS, the server-side encryption service of Amazon, for the EMR cluster to be used, before creating this cluster.

This procedure explains only the
SSE KMS related operations for getting started with the security configuration for EMR.
If you need the complete information about all the available EMR security configurations
provided by AWS, see Create a Security Configuration from the
Amazon documentation.

If not yet done, go to https://console.aws.amazon.com/kms
to create a customer managed CMK to be used by the SSE KMS service. For detailed
instructions about how to do this, see this tutorial from the AWS
documentation.
- When adding roles, among other roles to be added depending on your
  security policy, you must add the EMR_EC2_DefaultRole role.
  
  The EMR_EC2_DefaultRole role allows your
  Jobs for Apache Spark to read or write files encrypted with SSE-KMS on
  S3.
  
  This role is a default AWS role that is
  automatically created along with the creation of your first EMR
  cluster. If this role and its associated policies do not exist in
  your account, see Use Default IAM Roles and
  Managed Policies from the AWS documentation
On the Amazon EMR page of
AWS, select the Security configurations
tab and click Create to open the
Create security configuration
view.
Select the At-rest encryption check box
to enable SSE KMS.
Under S3 data encryption, select
SSE-KMS for Encryption mode
and select the CMK key mentioned at the beginning of this procedure for
AWS KMS Key.
Under Local disk encryption, select AWS
KMS for Key provider type and select the
CMK key mentioned at the beginning of this procedure for AWS KMS
Key.
Click Create to validate your security configuration.

In the real-world practice, you can also configure the other security options such as Kerberos and IAM roles for EMRFS before clicking this Create button.
Click Clusters and once the Create Cluster page is open, click Go to advanced options to start creating the EMR cluster step by step.
At the last step called Security, in the Authentication and
encryption section, select the Security Configuration created in the previous steps.

Setting up SSE KMS for your S3 bucket

If required by the security policy of your organization, you need to set up SSE KMS for the S3 bucket to be used.

Prerequisite: you must have created the CMK key to be used. For detailed
instructions about how to do this, see this tutorial from the AWS
documentation.

Open your S3 service at https://s3.console.aws.amazon.com/.
From the S3 bucket list, select the bucket to be used. Ensure
that you have proper rights and permissions to access this bucket.
Select the Properties tab
and then Default encryption.
Select AWS-KMS.
Select the KMS CMK key to be used.
Select the Permissions tab, then select
Bucket Policy and enter your policy in the
console.

This article from AWS provides detailed explanations and a simple policy
example: How to Prevent Uploads of Unencrypted Objects
to Amazon S3.
Click Save to save your policy.

Now your bucket policy is set up. When you need to use this bucket with a Job, enter
the following parameter about AWS signature versions to the JVM argument list of this Job:

-Dcom.amazonaws.services.s3.enableV4

1 2	-Dcom.amazonaws.services.s3.enableV4

For further information about AWS Signature Versions, see Specifying the Signature Version in Request
Authentication.

Writing and reading data from S3 (Databricks on AWS)

In this scenario, you create a Spark Batch Job using
tS3Configuration and the Parquet components to write data on
S3 and then read the data from S3.

This scenario applies only to subscription-based Talend products with Big
Data.

The sample data reads as
follows:

01;ychen

01;ychen

This data contains a user name and the ID number distributed to this user.

Note that the sample data is created for demonstration purposes only.

Design the data flow of the Job working with S3 and Databricks on AWS

In the
Integration
perspective of the Studio, create an empty
Spark Batch Job from the Job Designs node in
the Repository tree view.
In the workspace, enter the name of the component to be used and select this
component from the list that appears. In this scenario, the components are
tS3Configuration,
tFixedFlowInput,
tFileOutputParquet,
tFileInputParquet and
tLogRow.

The tFixedFlowInput component is used to load the
sample data into the data flow. In the real-world practice, you could use the
File input components, as well as the processing components, to design a
sophisticated process to prepare your data to be processed.
Connect tFixedFlowInput to tFileOutputParquet using the Row > Main link.
Connect tFileInputParquet to tLogRow using the Row > Main link.
Connect tFixedFlowInput to tFileInputParquet using the Trigger > OnSubjobOk link.
Leave tS3Configuration alone without any
connection.

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
  when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
  automatically restarts the cluster, the Jobs that are launched in parallel interrupt
  each other and thus cause execution failure.
Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Enter the basic connection information to Databricks on AWS.

Standalone	In the Endpoint field, enter the URL address of the workspace of your Databricks on AWS. For example, this URL could look like `https://<your_endpoint>.cloud.databricks.com`. In the Cluster ID field, enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster. You can also easily find this ID from the URL of your Databricks cluster. It is present immediately after cluster/ in this URL. This field is not used and thus not available if you are using transient clusters. Click the […] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Token management from the Databricks documentation. In the DBFS dependencies folder field, enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then. This directory stores your Job dependencies on DBFS only. In your Job, use tS3Configuration, tDynamoDBConfiguration or, in a Spark Streaming Job, the Kinesis components, to read or write your business data to the related systems. Poll interval when retrieving Job status (in ms): enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. For example, this status could be Pending or Running. The default value is `300000`, meaning 30 seconds. This interval is recommended by Databricks to correctly retrieve the Job status. Use transient cluster: you can select this check box to leverage the transient Databricks clusters. The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime. Autoscale: select or clear this check box to define the number of workers to be used by your transient cluster. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of worders in Max workers. Your transient cluster is scaled up and down within this scope based on its workload. According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards. If you clear this check box, autoscaling is deactivated. Then define the number of workers a transient cluster is expected to have. This number does not include the Spark driver node. Node type and Driver node type: select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks. For details about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation. Elastic disk: select this check box to enable your transient cluster to automatically scale up its disk space when its Spark workers are running low on disk space. For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation. SSH public key: if an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your transient cluster. If no SSH access has been set up, ignore this field. For further information about SSH access to your cluster, see SSH access to clusters from the Databricks documentation. Configure cluster log: select this check box to define where to store your Spark logs for a long term. This storage system could be S3 or DBFS. Do not restart the cluster when submitting: select this check box to prevent the Studio restarting the cluster when the Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that the Studio resarts your cluster to take these changes into account.

Standalone

In the Endpoint
field, enter the URL address of the workspace of your Databricks on
AWS. For example, this URL could look like
https://<your_endpoint>.cloud.databricks.com.
In the Cluster
ID field, enter the ID of the Databricks cluster to
be used. This ID is the value of the
spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on the
properties list in the Environment tab in the
Spark UI view of your cluster.

You can also easily find this ID
from the URL of your Databricks cluster. It is present immediately
after cluster/ in this URL.

This field is not used and thus not available if you are using transient clusters.
Click the […]
button next to the Token field to enter the
authentication token generated for your Databricks user account. You
can generate or find this token on the User
settings page of your Databricks workspace. For
further information, see Token management from the
Databricks documentation.
In the DBFS
dependencies folder field, enter the directory that
is used to store your Job related dependencies on Databricks
Filesystem at runtime, putting a slash (/) at the end of this
directory. For example, enter /jars/ to store
the dependencies in a folder named jars. This
folder is created on the fly if it does not exist then.

This directory stores your Job dependencies on DBFS only. In your
Job, use tS3Configuration,
tDynamoDBConfiguration or, in a Spark
Streaming Job, the Kinesis components, to read or write your
business data to the related systems.
Poll interval when retrieving Job status (in
ms): enter, without the quotation marks, the time
interval (in milliseconds) at the end of which you want the Studio
to ask Spark for the status of your Job. For example, this status
could be Pending or Running.

The default value is 300000, meaning 30
seconds. This interval is recommended by Databricks to correctly
retrieve the Job status.
Use
transient cluster: you can select this check box to
leverage the transient Databricks clusters.

The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.
1. Autoscale: select or clear this check box to define
  the number of workers to be used by your transient cluster.
  1. If you select this check box,
    autoscaling is enabled. Then define the minimum number
    of workers in Min
    workers and the maximum number of
    worders in Max
    workers. Your transient cluster is
    scaled up and down within this scope based on its
    workload.
    
    According to the Databricks
    documentation, autoscaling works best with
    Databricks runtime versions 3.0 or onwards.
  2. If you clear this check box, autoscaling
    is deactivated. Then define the number of workers a
    transient cluster is expected to have. This number does
    not include the Spark driver node.
2. Node type
  and Driver node type:
  select the node types for the workers and the Spark driver node.
  These types determine the capacity of your nodes and their
  pricing by Databricks.
  
  For details about
  these node types and the Databricks Units they use, see
  Supported Instance
  Types from the Databricks documentation.
3. Elastic
  disk: select this check box to enable your
  transient cluster to automatically scale up its disk space when
  its Spark workers are running low on disk space.
  
  For more details about this elastic disk
  feature, search for the section about autoscaling local
  storage from your Databricks documentation.
4. SSH public
  key: if an SSH access has been set up for your
  cluster, enter the public key of the generated SSH key pair.
  This public key is automatically added to each node of your
  transient cluster. If no SSH access has been set up, ignore this
  field.
  
  For further information about SSH
  access to your cluster, see SSH access to
  clusters from the Databricks
  documentation.
5. Configure cluster
  log: select this check box to define where to
  store your Spark logs for a long term. This storage system could
  be S3 or DBFS.
Do not restart the cluster
when submitting: select this check box to prevent
the Studio restarting the cluster when the Studio is submitting your
Jobs. However, if you make changes in your Jobs, clear this check
box so that the Studio resarts your cluster to take these changes
into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Configuring the connection to the S3 service to be used by Spark

Double-click tS3Configuration to open its Component view.

Spark uses this component to connect to the S3 system in which your Job writes
the actual business data. If you place neither
tS3Configuration nor any other configuration
component that supports Databricks on AWS, this business data is written in the
Databricks Filesystem (DBFS).
In the Access key and the Secret
key fields, enter the keys to be used to authenticate to
S3.
In the Bucket name field, enter the name of the bucket
and the folder in this bucket to be used to store the business data, for
example, mybucket/myfolder. This folder is created on the
fly if it does not exist but the bucket must already exist at runtime.

Write the sample data to S3

Double-click the tFixedFlowIput component to
open its Component view.
Click the […] button next to Edit schema to open the schema editor.
Click the [+] button to add the schema
columns as shown in this image.
Click OK to validate these changes and accept
the propagation prompted by the pop-up dialog box.
In the Mode area, select the Use Inline
Content radio button and paste the previously mentioned sample data
into the Content field that is displayed.
In the Field separator field, enter a
semicolon (;).
Double-click the tFileOutputParquet component to
open its Component view.
Select the Define a storage configuration component
check box and then select the tS3Configuration component
you configured in the previous steps.
Click Sync columns to ensure that
tFileOutputParquet has the same schema as
tFixedFlowInput.
In the Folder/File field, enter the name of the S3
folder to be used to store the sample data. For example, enter
/sample_user, then as you have specified
my_bucket/my_folder to use in
tS3Configuration to store the business data on S3,
the eventual directory on S3 becomes
my_bucket/my_folder/sample_user.
From the Action drop-down list, select
Create if the sample_user folder
does not exist yet; if this folder already exists, select
Overwrite.

Reading the sample data from S3

Double-click tFileInputParquet to open its
Component view.
Select the Define a storage configuration component
check box and then select the tS3Configuration
component you configured in the previous steps.
Click the […] button next to Edit
schema to open the schema editor.
Click the [+] button to add the schema columns for
output as shown in this image.
Click OK to validate these changes and accept the
propagation prompted by the pop-up dialog box.
In the Folder/File field, enter the name of the folder
from which you need to read data. In this scenario, it is
sample_user.
Double click tLogRow to open its
Component view and select the Table radio button to present the result in a table.
Press F6 to run this Job.

Once done, you can find your Job on the Job page on the Web UI
of your Databricks cluster and then check the execution log of your Job.

Writing server-side KMS encrypted data on EMR

If the AWS SSE-KMS encryption (at-rest encryption) service is
enabled to set Default encryption to protect data
on the S3A system of your EMR cluster, select the SSE-KMS option in tS3Configuration when writing data to that S3A
system.

The sample data used in this scenario is about different types of
incidents that people reported occurring on Paris streets within one
day.

1;226 rue marcadet, 75018 Paris;abandoned object;garbage on the street
2;2 rue marcadet, 75018 Paris;shift and damage;direction sign damaged
3;45 boulevard de la villette, 75010 Paris; abandoned object; suspicious package
4;10 rue emile lepeu, 75011 Paris;graffiti and improper poster;graffiti
5;27 avenue emile zola, 75015 Paris;shift and damage;deformed road

1;226 rue marcadet, 75018 Paris;abandoned object;garbage on the street

2;2 rue marcadet, 75018 Paris;shift and damage;direction sign damaged

3;45 boulevard de la villette, 75010 Paris; abandoned object; suspicious package

4;10 rue emile lepeu, 75011 Paris;graffiti and improper poster;graffiti

5;27 avenue emile zola, 75015 Paris;shift and damage;deformed road

The sample data is used for demonstration purposes only.

The Job calculates the occurrence of each incident type.

Prerequisites:

The S3 system to be used is S3A.
The SSE-KMS encryption service on AWS is enabled with the
Default encryption feature and a
customer managed CMK has been specified for it.
The EMR cluster to be used is created with SSE-KMS and the
EMR_EC2_DefaultRole role has been added to the above-mentioned CMK.
The administrator of your EMR cluster has granted the appropriate
rights and permissions to the AWS account you are using in your Jobs.
Your EMR cluster has been properly set up and is running.
A Talend
Jobserver has been deployed on an instance within the network of your EMR
cluster, such as the instance for the master of your cluster.

All these operations are done on the AWS side.

In your Studio or in Talend Administration Center, define this Jobserver as the execution server of your
Jobs.

Ensure that the client machine on which the Talend Jobs are executed can
recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add
the IP address/hostname mapping entries for the services of that Hadoop cluster in the
hosts file of the client machine.

If this is the first time your EMR cluster is set up to run with Talend Jobs, search for Amazon EMR –
Getting Started on Talend Help Center (https://help.talend.com) to verify your setup so as to help your
Jobs work more efficiently on top of EMR.

Designing the flow of the data to write and encrypt onto EMR

Link the components to construct the data flow.

In the
Integration
perspective of the Studio, create an empty Spark Batch Job from the Job
Designs node in the Repository tree view.
In the workspace, enter the name of the component to be used and select this
component from the list that appears. In this scenario, the components are
tHDFSConfiguration (labeled emr_hdfs), tS3Configuration, tFixedFlowInput, tAggregateRow and tFileOutputParquet.

The tFixedFlowInput component is used to load
the sample data into the data flow. In the real-world practice, use the input component specific to the data format or the source system to be used instead of tFixedFlowInput.
Connect tFixedFlowInput, tAggregateRow and
tFileOutputParquet using the Row > Main link.
Leave the tHDFSConfiguration component and the tS3Configuration component alone
without any connection.

Configuring the connection to the HDFS file system of your EMR
cluster

Double-click tHDFSConfiguration to open its Component view.

Spark uses this component to connect to the HDFS system to
which the jar files dependent on the Job are transferred.
If you have defined the HDFS connection metadata under the
Hadoop cluster node in Repository, select Repository from the Property
type drop-down list and then click the […] button to select the HDFS connection you
have defined from the Repository content
wizard.

For further information about setting up a reusable HDFS
connection, search for centralizing HDFS metadata on Talend Help Center (https://help.talend.com).

If you complete this step, you can skip the following steps
about configuring tHDFSConfiguration
because all the required fields should have been filled automatically.
In the Version area,
select the Hadoop distribution you need to connect to and its version.
In the NameNode URI
field, enter the location of the machine hosting the NameNode service of the
cluster. If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet.
In the Username field,
enter the authentication information used to connect to the HDFS system to be
used. Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the connection to S3 to be used to store the business data

Double-click tS3Configuration to open its Component view.

Spark uses this component to connect to the S3 system to which the business data is stored. In this scenario, the sample data about the street incidents is written and ecrypted on S3.
Select the Use s3a Filesystem check box. The Inherit credentials from AWS role check box and the use SSE-KMS encryption
Enter the access credentials of the AWS account to be used.
- If allowed by the security policy of your organization, in the Access Key and the Secret Key fields, enter the credentials.
  If you do not know the credentials to be used, contact the administrator of your AWS system or check Getting Your AWS Access
  Keys from the AWS documentation.
- If the security policy of your organization does not allow you to expose the credentials in a client application, select Inherit credentials from AWS role to obtain the role-based temporary AWS security credentials from your EMR instance metadata. An IAM role must have been specified to associate with this EMR instance.
  For further information about using an IAM role to grant permissions, see Using IAM roles from the AWS documentation.
Select Use SSE-KMS encryption check box to enable the Job to verify and use the SSE-KMS encryption service of your cluster.
In the Bucket name field, enter the name of the bucket to be used to store the sample data. This bucket must have existed when you launch your Job. For example, enter my_bucket/my_folder.

Loading the sample data about street incidents to the Job.

Double-click tFixedFlowInput to display its
Basic settings view.
Click the […] button next to Edit
Schema to open the schema editor.
Click the [+] button to add four columns, namely
id, address,
incident_type and description.
Click Ok to close the schema editor and accept
propagating the changes when prompted by the system.
In the Mode area, select the Use Inline Content(delimited file) radio button to display the Content area and enter the sample data.

Calculating the incident occurrence

Double-click tAggregateRow to open its
Component view.
Click the […] button next to Edit schema to open the schema editor.
On the output side (right), click the [+]
button twice to add two rows and in the Column column, rename them to incident_type and incident_number, respectively.
In the Type column of the incident_number row of the output side, select Integer.
Click OK to validate these changes and accept
the propagation prompted by the pop-up dialog box.
In the Group by table, add one row by
clicking the [+] button and configure
this row as follows to group the outputted data:
- Output column:
  select the columns from the output schema to be used as the
  conditions to group the outputted data. In this example, it is the
  incident_type from the output schema.
- Input column
  position: select the columns from the input schema
  to send data to the output columns you have selected in the
  Output
  column column. In this scenario, it is the
  incident_type column from the input schema.
In the Operations table, add one row by clicking the [+] button once and configure this
row as follows to calculate the occurrence of each incident type:
- Output column:
  select the column from the output schema to store the calculation
  results. In this scenario, it is incident_number.
- Function: select
  the function to be used to process the incoming data. In this
  scenario, select count. It counts the frequency of each
  incident.
- Input column
  position: select the column from the input schema to
  provide the data to be processed. In this scenario, it is incident_type.

Writing the aggregated data about street incidents to EMR

Double-click the tFileOutputParquet component to
open its Component view.
Select the Define a storage
configuration component check box and then select the tS3Configuration component you configured in the
previous steps.
Click Sync columns to
ensure that tFileOutputParquet retrieve the
schema from the output side of tAggregateRow.
In the Folder/File
field, enter the name of the folder to be used to store the aggregated data in
the S3 bucket specified in tS3Configuration. For example,
enter /sample_user, then at runtime, the folder called
sample_user at the root of the bucket is used to
store the output of your Job.
From the Action
drop-down list, select Create if the
folder to be used does not exist yet in the bucket to be used; if this folder
already exists, select Overwrite.
Click Run to open its view and then click the
Spark Configuration tab to display its view
for configuring the Spark connection.
Select the Use local mode check box to test your Job locally before eventually submitting it to the remote Spark cluster.

In the local mode, the Studio builds the Spark environment in itself on the fly in order to
run the Job in. Each processor of the local machine is used as a Spark
worker to perform the computations.
In this mode, your local file system is used; therefore, deactivate the
configuration components such as tS3Configuration or
tHDFSConfiguration that provides connection
information to a remote file system, if you have placed these components
in your Job.
In the Component view of tFileOutputParquet, change the file path in the Folder/File field to a local directory and adapt the action to be taken on the Action drop-down list, that is to say, creating a new folder or overwriting the existing one.
On the Run tab, click Basic
Run and in this view, click Run to execute your Job locally to test its design
logic.
When your Job runs successfully, clear the Use local
mode check box in the Spark Configuration
view of the Run tab, then in the design workspace of your
Job, activate the configuration components and revert the changes you just made
in tFileOutputParquet for the local test.

Defining the EMR connection parameters

Complete the EMR connection configuration in the Spark
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

Enter the basic connection information to EMR:

Yarn client

The Studio runs the Spark driver to orchestrate how the Job should be
performed and then send the orchestration to the Yarn service of a given
Hadoop cluster so that the Resource Manager of this Yarn service
requests execution resources accordingly.

If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.

If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):

In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.

This kind of Hadoop configuration JAR file is
automatically generated when you build a Big Data Job from the
Studio. This JAR file is by default named with this
pattern:

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

1

hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar

You
can also download this JAR file from the web console of your
cluster or simply create a JAR file yourself by putting the
configuration files in the root of your JAR file. For
example:

hdfs-sidt.xml core-site.xml

1
2

hdfs-sidt.xml
core-site.xml

The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the
ResourceManager service and the JobHistory service in the displayed fields. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos. These principals can be found in the configuration files of your
distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
The User name field is available when you are not using
Kerberos to authenticate. In the User name field, enter the
login user name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or,
if you use Talend
Jobserver, your Job JVM keep monitoring the Job until the execution of the Job
is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While
it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.

Ensure that the user name in the Yarn
client mode is the same one you put in
tS3Configuration, the component used to provides S3
connection information to Spark.

With the Yarn client mode, the
Property type list is displayed to allow you
to select an established Hadoop connection from the Repository, on the condition that you have created this connection
in the Repository. Then the Studio will reuse
that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where
the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box
  and enter the directory where your winutils.exe is
  stored.
- Otherwise, leave this check box clear, the Studio generates one
  by itself and automatically uses it for this Job.
In the Spark “scratch” directory
field, enter the directory in which the Studio stores in the local system the
temporary files such as the jar files to be transferred. If you launch the Job
on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

After the connection is configured, you can tune
the Spark performance, although not required, by following the process explained in:
- for Spark Batch Jobs.
- for Spark Streaming Jobs.
It is recommended to activate the Spark logging and
checkpointing system in the Spark configuration tab of the Run view of your Spark
Job, in order to help debug and resume your Spark Job when issues arise:
- .

Running the Job to write KMS encrypted data to EMR

Press Ctrl+S to save your Job.
In the Run tab, click Run to execute the Job.

tS3Configuration properties for Apache Spark Streaming

These properties are used to configure tS3Configuration running in the Spark Streaming Job framework.

The Spark Streaming
tS3Configuration component belongs to the Storage family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Access Key

Enter the access key ID that uniquely identifies an AWS
Account. For further information about how to get your Access Key and Secret Key,
see Getting Your AWS Access
Keys.

Access Secret

Enter the secret access key, constituting the security
credentials in combination with the access Key.

To enter the secret key, click the […] button next to
the secret key field, and then in the pop-up dialog box enter the password between double
quotes and click OK to save the settings.

Bucket name

Enter the bucket name and its folder you need to use. You
need to separate the bucket name and the folder name using a slash (/).

Use s3a filesystem

Select this check box to use the S3A filesystem instead
of S3N, the filesystem used by default by tS3Configuration.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Inherit credentials from AWS

This option enables you to develop your Job without having to put any
AWS keys in the Job, thus easily comply with the security policy of your
organization.

Use SSE-KMS encryption with CMK

If you are using the S3A filesystem with EMR, you can select this check box
to use the SSE-KMS encryption service enabled on AWS to read or write the
encrypted data on S3.

On the EMR side, the SSE-KMS service must have been enabled with the
Default encryption feature
and a customer managed CMK specified for the encryption.

For further information about the AWS SSE-KMS encryption, see Protecting Data Using Server-Side
Encryption from the AWS documentation.

For further information about how to enbale the Default Encryption feature for an
Amazon S3 bucket, see Default encryption from the
AWS documentation.

Use S3 bucket policy

-Dcom.amazonaws.services.s3.enableV4

1	-Dcom.amazonaws.services.s3.enableV4

Assume Role

If you are using the S3A filesystem, you can select this check box to
make your Job temporarily assume a role and the permissions associated
with this role.

After selecting this check box, specify the parameters the
administrator of the AWS system to be used defined for this role.

Role ARN: the Amazon Resource Name (ARN) of the role to assume. You
can find this ARN name on the Summary page
of the role to be used on your AWS portal, for example, this role ARN could read
like am:aws:iam::[aws_account_number]:role/[role_name].
Role session name: enter the name you want to use to uniquely
identify your assumed role session. This name can contain upper- and lower-case
alphanumeric characters with no spaces. You can also include underscores or any of
the following characters: =,.@-.
Session duration (minutes): the duration (in minutes) for which you
want the assumed role session to be active. This duration cannot exceed the
maximum duration which your AWS administrator has set.

The External ID
parameter is required only if your AWS administrator or the owner of
this role has defined an external ID when they set up a trust policy for
this role.

This check box is available only for the following distributions
Talend supports:

CDH 5.10 and onwards (including the dynamic
support for the latest Cloudera distributions)
HDP 2.5 and onwards

This check box is also available when you are using Spark V1.6 and
onwards in the Local Spark mode in the Spark configuration tab.

Set region

Select this check box and select the region to connect
to.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Set endpoint

Select this check box and in the Endpoint field that is displayed, enter the Amazon
region endpoint you need to use. For a list of the available endpoints, see Regions and Endpoints.

This feature is available when you are using one of the
following distributions with Spark:

Amazon EMR V4.5 and onwards
MapR V5.0 and onwards
Hortonworks Data Platform V2.4 and
onwards
Cloudera V5.8 and onwards. For Cloudera V5.8, the Spark version must be
2.0.
Cloudera
Altus

Advanced settings

Set STS region and Set STS endpoint	If the AWS administrator has enabled the STS endpoints for the regions you want to use for better response performance, select the Set STS region check box and then select the regional endpoint to be used. If the endpoint you want to use is not available in this regional endpoint list, clear the Set STS region check box, then select the Set STS endpoint check box and enter the endpoint to be used. This service allows you to request temporary, limited-privilege credentials for the AWS user you authenticate; therefore, you still need to provide the access key and secret key to authenticate the AWS account to be used. For a list of the STS endpoints you can use, see AWS Security Token Service. For further information about the STS temporary credentials, see Temporary Security Credentials. Both articles are from the AWS documentation. These check boxes are available only when you have selected the Assume Role check box in the Basic settings tab.

Set STS region and Set STS endpoint

If the endpoint you want to use is not available in this regional
endpoint list, clear the Set STS region check box, then select
the Set STS endpoint check box and enter the
endpoint to be used.

These check boxes are available only when you have selected the
Assume Role check box in the Basic settings tab.

Usage

Usage rule	This component is used with no need to be connected to other components. You need to drop tS3Configuration along with the file system related Subjob to be run in the same Job so that the configuration is used by the whole Job at runtime.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster. When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis.
Limitation	Due to license incompatibility, one or more JARs required to use this component are not provided. You can install the missing JARs for this particular component by clicking the Install button on the Component tab view. You can also find out and add all missing JARs easily on the Modules tab in the Integration perspective of your studio. You can find more details about how to install external modules in Talend Help Center (https://help.talend.com).

Usage rule

This component is used with no need to be connected to other components.

You need to drop tS3Configuration along with the file system related Subjob to be
run in the same Job so that the configuration is used by the whole Job at
runtime.

Spark Connection

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the
  Google Storage staging bucket
  field in the Spark configuration
  tab.
- When using HDInsight, specify the blob to be used for Job
  deployment in the Windows Azure Storage
  configuration area in the Spark
  configuration tab.
- When using Altus, specify the S3 bucket or the Azure
  Data Lake Storage for Job deployment in the Spark
  configuration tab.
- When using Qubole, add a
  tS3Configuration to your Job to write
  your actual business data in the S3 system with Qubole. Without
  tS3Configuration, this business data is
  written in the Qubole HDFS system and destroyed once you shut
  down your cluster.
- When using on-premise
  distributions, use the configuration component corresponding
  to the file system your cluster is using. Typically, this
  system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the
configuration component corresponding to the file system your cluster is
using, such as tHDFSConfiguration or
tS3Configuration.

If you are using Databricks without any configuration component present
in your Job, your business data is written directly in DBFS (Databricks
Filesystem).

This connection is effective on a per-Job basis.

Limitation

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 7.x

0 Comments

Inline Feedbacks

View all comments

tS3Configuration – Docs for ESB 7.x

tS3Configuration

tS3Configuration properties for Apache Spark Batch

Basic settings

Advanced settings

Usage

Creating an IAM role on AWS

Setting up SSE KMS for your EMR cluster

Setting up SSE KMS for your S3 bucket

Writing and reading data from S3 (Databricks on AWS)

Design the data flow of the Job working with S3 and Databricks on AWS

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Configuring the connection to the S3 service to be used by Spark

Write the sample data to S3

Reading the sample data from S3

Writing server-side KMS encrypted data on EMR

Designing the flow of the data to write and encrypt onto EMR

Configuring the connection to the HDFS file system of your EMR
cluster

Configuring the connection to S3 to be used to store the business data

Loading the sample data about street incidents to the Job.

Calculating the incident occurrence

Writing the aggregated data about street incidents to EMR

Defining the EMR connection parameters

Running the Job to write KMS encrypted data to EMR

tS3Configuration properties for Apache Spark Streaming

Basic settings

Advanced settings

Usage

My Website Links

Tags

tS3Configuration

tS3Configuration properties for Apache Spark Batch

Basic settings

Advanced settings

Usage

Creating an IAM role on AWS

Setting up SSE KMS for your EMR cluster

Setting up SSE KMS for your S3 bucket

Writing and reading data from S3 (Databricks on AWS)

Design the data flow of the Job working with S3 and Databricks on AWS

Defining the Databricks-on-AWS connection parameters for Spark Jobs

Configuring the connection to the S3 service to be used by Spark

Write the sample data to S3

Reading the sample data from S3

Writing server-side KMS encrypted data on EMR

Designing the flow of the data to write and encrypt onto EMR

Configuring the connection to the HDFS file system of your EMR cluster

Configuring the connection to S3 to be used to store the business data

Loading the sample data about street incidents to the Job.

Calculating the incident occurrence

Writing the aggregated data about street incidents to EMR

Defining the EMR connection parameters

Running the Job to write KMS encrypted data to EMR

tS3Configuration properties for Apache Spark Streaming

Basic settings

Advanced settings

Usage

Configuring the connection to the HDFS file system of your EMR
cluster