tHDFSInput
Extracts the data in a HDFS file for other components to process it.
tHDFSInput reads a file located on a given Hadoop distributed
file system (HDFS) and puts the data of interest from this file into a
Talend schema. Then it passes the data to the component that follows.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Standard: see tHDFSInput Standard properties.
The component in this framework is available in all Talend products with Big Data
and in Talend Data Fabric. -
MapReduce: see tHDFSInput MapReduce properties (deprecated).
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
tHDFSInput Standard properties
These properties are used to configure tHDFSInput running in the Standard Job framework.
The Standard
tHDFSInput component belongs to the Big Data and the File families.
The component in this framework is available in all Talend products with Big Data
and in Talend Data Fabric.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Use an existing connection |
Select this check box and in the Component List click the HDFS connection component from which Note that when a Job contains the parent Job and the child Job, |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
|
Version |
Select the version of the Hadoop distribution you are using. The available |
Scheme | Select the URI scheme of the file system to be used from the Scheme drop-down list. This scheme could be
The schemes present on this list vary depending on the distribution you Once a scheme is If you have selected
ADLS, the connection parameters to be defined become:
For a |
NameNode URI |
Type in the URI of the Hadoop NameNode, the master node of a |
Use kerberos authentication |
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos.
This check box is available depending on the Hadoop distribution you are |
Use a keytab to authenticate |
Select the Use a keytab to authenticate Note that the user that executes a keytab-enabled Job is not necessarily |
User name |
The User name field is available when you are not using |
Group |
Enter the membership including the authentication user under which the HDFS instances were |
File Name |
Browse to, or enter the path pointing to the data to be used in the file system. If the path you set points to a folder, this component will read all of the |
Type |
Select the type of the file to be processed. The type of the file may be:
|
Row separator |
The separator used to identify the end of a row. This field is not available for a Sequence file. |
Field separator |
Enter character, string or regular expression to separate fields for the transferred This field is not available for a Sequence file. |
Header |
Set values to ignore the header of the transferred data. For example, enter This field is not available for a Sequence file. |
Custom encoding |
You may encounter encoding issues when you process the stored data. In that Select the encoding from the list or select Custom This option is not available for a Sequence file. |
Compression |
Select the Uncompress the data check box to uncompress Hadoop provides different compression formats that help reduce the space needed for This option is not available for a Sequence file. |
Advanced settings
Include sub-directories if path is |
Select this check box to read not only the folder you have |
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform operations in a Hadoop distribution. If you need to use a custom configuration in a specific situation, complete this table with the property or properties to be customized. Then at runtime, the customized property or properties will override those default ones.
For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:
|
tStatCatcher Statistics |
Select this check box to collect log data at the component level. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component needs an output link. |
Dynamic settings |
Click the [+] button to add a row in the The Dynamic settings table is For examples on using dynamic parameters, see Reading data from databases through context-based dynamic connections and Reading data from different MySQL databases using dynamically loaded connection parameters. For more information on Dynamic |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Limitations |
JRE 1.6+ is required. |
Using HDFS components to work with Azure Data Lake Storage (ADLS)
This scenario applies only to Talend products with Big Data.
-
tFixedFlowInput: it provides sample data to the Job.
-
tHDFSOutput: it writes sample data to Azure Data Lake
Store. -
tHDFSInput: it reads sample data from Azure Data Lake
Store. -
tLogRow: it displays the output of the Job on the console of
the Run view of the Job.
Grant your application the access to your ADLS Gen2
An Azure subscription is required.
-
Create your Azure Data Lake Storage Gen2 account if you do not have it
yet.- For more details, see Create an Azure Data Lake Storage
Gen2 account from the Azure documentation.
- For more details, see Create an Azure Data Lake Storage
-
Create an Azure Active Directory application on your Azure portal. For more
details about how to do this, see the “Create an Azure Active Directory
application” section in Azure documentation: Use portal to create an Azure Active Directory
application. -
Obtain the application ID, object ID and the client secret of the application
to be used from the portal.-
On the list of the registered applications, click the application you
created and registered in the previous step to display its information
blade. -
Click Overview to open its blade, and from the
top section of the blade, copy the Object ID and
the application ID displayed as Application (client)
ID. Keep them somewhere safe for later use. -
Click Certificates & secrets to open its
blade and then create the authentication key (client secret) to be used
on this blade in the Client secrets
section.
-
On the list of the registered applications, click the application you
-
Back to the Overview blade of the application to be
used, click Endpoints on the top of this blade, copy the
value of OAuth 2.0 token endpoint (v1) from the endpoint
list that appears and keep it somewhere safe for later use. -
Set the read and write permissions to the ADLS Gen2 filesystem to be used for
the service principal of your application.It is very likely that the administrator of your Azure system has included
your account and your applications in the group that has access to a given ADLS
Gen2 storage account and a given ADLS Gen2 filesystem. In this case, ask your
administrator to ensure that you have the proper access and then ignore this
step.-
Start your Microsoft Azure Storage Explorer and find your ADLS Gen2
storage account on the Storage Accounts
list.If you have not installed Microsoft Azure Storage Explorer, you can
download it from the Microsoft Azure official site. -
Expand this account and the Blob Containers node
under it; then click the ADLS Gen2 hierarchical filesystem to be used
under this node.The filesystem in this image is for demonstration purposes only.
Create the filesystem to be used under the Blob
Containers node in your Microsoft Azure Storage
Explorer, if you do not have one yet. -
On the blade that is opened, click Manage Access
to open its wizard. -
At the bottom of this wizard, add the object ID of your application to
the Add user or group field and click
Add. -
Select the object ID just added from the Users and
groups list and select all the permission for
Access and
Default. -
Click Save to validate these changes and close
this wizard.
-
Start your Microsoft Azure Storage Explorer and find your ADLS Gen2
Creating an HDFS Job in the Studio
-
On the Integration
perspective, drop the following components from the
Palette onto the design workspace:
tFixedFlowInput, tHDFSOutput,
tHDFSInput and tLogRow. -
Connect tFixedFlowInput to
tHDFSOutput using a Row > Main link. -
Do the same to connect tHDFSInput to
tLogRow. - Connect tFixedFlowInput to tHDFSInput using a Trigger > OnSubjobOk link.
Configuring the HDFS components to work with Azure Data Lake Storage
-
Double-click tFixedFlowInput to open its
Component view to provide sample data to the
Job.The sample data to be used contains only one row with two column:
id and name. -
Click the […] button next to Edit
schema to open the schema editor. -
Click the [+] button to add the two columns and rename
them to id and name. -
Click OK to close the schema editor and validate the
schema. -
In the Mode area, select Use single
table.The id and the name columns
automatically appear in the Value table and you can
enter the values you want within double quotation marks in the
Value column for the two schema values. -
Double-click tHDFSOutput to open its
Component view. -
In the Version area, select
Hortonworks or Cloudera
depending on the distribution you are using. In the Standard
framework, only these two distributions with ADLS are supported by the HDFS
components. -
From the Scheme drop-down list, select
ADLS. The ADLS related parameters appear in the
Component view. -
In the URI field, enter the NameNode service of your
application. The location of this service is actually the address of your Data
Lake Store.For example, if your Data Lake Storage name is
data_lake_store_name, the NameNode URI to be used
is
adl://data_lake_store_name.azuredatalakestore.net. -
In the
Client ID and the Client
key fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.Ensure that the application to be used has appropriate
permissions to access Azure Data Lake. You can check this on the
Required permissions view of this application on Azure. For further
information, see Azure documentation Assign the Azure AD application to
the Azure Data Lake Storage account file or folder.This application must be the one to
which you assigned permissions to access your Azure Data Lake Storage in
the previous step. -
In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal. -
In the File name field, enter the directory to be used
to store the sample data on Azure Data Lake Storage. -
From the Action drop-down list, select
Create if the directory to be used does not exist yet
on Azure Data Lake Storage; if this folder already exists, select
Overwrite. - Do the same configuration for tHDFSInput.
- If you run your Job on Windows, following this procedure to add the winutils.exe program to your Job.
- Press F6 to run your Job.
tHDFSInput MapReduce properties (deprecated)
These properties are used to configure tHDFSInput running in the MapReduce Job framework.
The MapReduce
tHDFSInput component belongs to the MapReduce family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the The properties are stored centrally under the Hadoop For further information about the Hadoop |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
schema to make changes to the schema. Note: If you
make changes, the schema automatically becomes built-in. |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Folder/File |
Browse to, or enter the path pointing to the data to be used in the file system. If the path you set points to a folder, this component will read all of the files stored in that folder, for example,/user/talend/in; if sub-folders exist, the sub-folders are automatically ignored unless you define the property mapreduce.input.fileinputformat.input.dir.recursive to be If you want to specify more than one files or directories in this If the file to be read is a compressed one, enter the file name with
Note that you need |
Die on error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Type |
Select the type of the file to be processed. The type of the file may be:
|
Row separator |
The separator used to identify the end of a row. This field is not available for a Sequence file. |
Field separator |
Enter character, string or regular expression to separate fields for the transferred This field is not available for a Sequence file. |
Header |
Enter the number of rows to be skipped in the beginning of file. For example, enter 0 to ignore no This field is not available for a Sequence file. |
Custom Encoding |
You may encounter encoding issues when you process the stored data. In that Then select the encoding to be used from the list or select Custom and define it manually. This option is not available for a Sequence file. |
Advanced settings
Advanced separator (for number) |
Select this check box to change the separator used for numbers. By default, the thousands separator is a comma (,) and the decimal separator is a period (.). |
Trim all columns |
Select this check box to remove the leading and trailing whitespaces from all |
Check column to trim |
This table is filled automatically with the schema being used. Select the check |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
In a Once a Map/Reduce Job is opened in the workspace, tHDFSInput as well as the MapReduce family Note that in this documentation, unless otherwise |
Hadoop Connection |
You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. |
Related scenarios
If you are a subscription-based Big Data user, you can consult a
Talend
Map/Reduce Job using
the Map/Reduce version of tHDFSInput: