tHDFSConfiguration
Enables the reuse of the connection configuration to HDFS in the same
Job.
tHDFSConfiguration provides HDFS
connection information for the file system related components used in
the same Spark Job. The Spark cluster to be used reads this
configuration to eventually connect to HDFS.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Spark Batch: see tHDFSConfiguration properties for Apache Spark Batch.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Streaming: see tHDFSConfiguration properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
tHDFSConfiguration properties for Apache Spark Batch
These properties are used to configure tHDFSConfiguration running in the Spark Batch Job framework.
The Spark Batch
tHDFSConfiguration component belongs to the Storage family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
|
Hadoop version |
Select the version of the Hadoop distribution you are using. The available |
Use kerberos authentication |
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos.
This check box is available depending on the Hadoop distribution you are |
Use a keytab to authenticate |
Select the Use a keytab to authenticate Note that the user that executes a keytab-enabled Job is not necessarily |
NameNode URI |
Type in the URI of the Hadoop NameNode, the master node of a |
User name |
The User name field is available when you are not using |
Group |
Enter the membership including the authentication user under which the HDFS instances were |
Use datanode hostname |
Select the Use datanode hostname check box to allow the |
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform operations in a Hadoop distribution. If you need to use a custom configuration in a specific situation, complete this table with the property or properties to be customized. Then at runtime, the customized property or properties will override those default ones.
For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:
|
Setup HDFS encryption configurations |
If the HDFS transparent encryption has been enabled in your cluster, select For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS. |
Usage
Usage rule |
This component is used with no need to be connected to other You need to drop tHDFSConfiguration along with the file This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Specific Spark timeout |
When encountering network issues, Spark by Add the following properties to
|
Related scenarios
For a related scenario, see Performing download analysis using a Spark Batch Job.
tHDFSConfiguration properties for Apache Spark Streaming
These properties are used to configure tHDFSConfiguration running in the Spark Streaming Job framework.
The Spark Streaming
tHDFSConfiguration component belongs to the Storage family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
|
Hadoop version |
Select the version of the Hadoop distribution you are using. The available |
Use kerberos authentication |
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos.
This check box is available depending on the Hadoop distribution you are |
Use a keytab to authenticate |
Select the Use a keytab to authenticate Note that the user that executes a keytab-enabled Job is not necessarily |
NameNode URI |
Type in the URI of the Hadoop NameNode, the master node of a |
User name |
The User name field is available when you are not using |
Group |
Enter the membership including the authentication user under which the HDFS instances were |
Use datanode hostname |
Select the Use datanode hostname check box to allow the |
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform operations in a Hadoop distribution. If you need to use a custom configuration in a specific situation, complete this table with the property or properties to be customized. Then at runtime, the customized property or properties will override those default ones.
For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:
|
Setup HDFS encryption configurations |
If the HDFS transparent encryption has been enabled in your cluster, select For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS. |
Usage
Usage rule |
This component is used with no need to be connected to other components. You need to drop tHDFSConfiguration along with the file This component, along with the Spark Streaming component Palette it belongs to, appears Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Specific Spark timeout |
When encountering network issues, Spark by Add the following properties to
|
Related scenarios
For a related scenario, see Analyzing a Twitter flow in near real-time.