Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.
Component family |
Big Data / Hadoop |
|||
Function |
tHDFSGet copies files from Hadoop |
|||
Purpose |
tHDFSGet connects to Hadoop |
|||
Basic settings |
Property type |
Either Built-in or Repository
Built-in: No property data stored
Repository: Select the repository Since version 5.6, both the Built-In mode and the Repository mode are |
||
Use an existing connection |
Select this check box and in the Component List click the NoteWhen a Job contains the parent Job and the child Job, Component |
|||
Version |
Distribution |
Select the cluster you are using from the drop-down list. The options in the list vary
In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can
|
||
Hadoop version |
Select the version of the Hadoop distribution you are using. The available options vary
|
|||
|
Use kerberos authentication |
If you are accessing the Hadoop cluster running with Kerberos security, select this check This check box is available depending on the Hadoop distribution you are connecting |
||
Use a keytab to authenticate |
Select the Use a keytab to authenticate check box to log Note that the user that executes a keytab-enabled Job is not necessarily the one a |
|||
Connection |
NameNode URI |
Type in the URI of the Hadoop NameNode. The NameNode is the master node of a Hadoop system. |
||
|
User name |
Enter the user authentication name of HDFS. |
||
Group |
Enter the membership including the authentication user under which the HDFS instances were |
|||
HDFS directory |
Browse to, or enter the directory in HDFS where the data you need to use is. |
|||
|
Local directory |
Browse to, or enter the local directory to store the files |
||
|
Overwrite file |
Options to overwrite or not the existing file with the new |
||
|
Append |
Select this check box to add the new rows at the end of the |
||
Include subdirectories |
Select this check box if the selected input source type includes |
|||
|
Files |
In the Files area, the fields to – File mask: type in the file – New name: give a new name to |
||
|
Die on error |
This check box is selected by default. Clear the check box to skip |
||
Advanced settings |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
||
|
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform
For further information about the properties required by Hadoop and its related systems such
|
||
Dynamic settings |
Click the [+] button to add a row in the table and fill the The Dynamic settings table is available only when the For more information on Dynamic settings and context |
|||
Global Variables |
NB_FILE: the number of files processed. This is an After CURRENT_STATUS: the execution result of the component. TRANSFER_MESSAGES: file transferred information. This is ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|||
Usage |
This component combines HDFS connection and data extraction, thus Different from the tHDFSInput and the It is often connected to the Job using OnSubjobOk or OnComponentOk link, depending on the context. |
|||
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
|||
Log4j |
The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |
|||
Limitations |
JRE 1.6+ is required. |
The following scenario describes a simple Job that creates a file in a defined
directory, get it into and out of HDFS, subsequently store it to another local
directory, and read it at the end of the Job.
-
Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tFileOutputDelimited, tHDFSPut, tHDFSGet,
tFileInputDelimited and tLogRow. -
Connect tFixedFlowInput to tFileOutputDelimited using a Row > Main
connection. -
Connect tFileInputDelimited to tLogRow using a Row > Main
connection. -
Connect tFixedFlowInput to tHDFSPut using an OnSubjobOk connection.
-
Connect tHDFSPut to tHDFSGet using an OnSubjobOk connection.
-
Connect tHDFSGet to tFileInputDelimitedusing an OnSubjobOk connection.
-
Double-click tFixedFlowInput to define
the component in its Basic settings
view. -
Set the Schema to Built-In and click the three-dot […] button next to Edit
Schema to describe the data structure you want to create from
internal variables. In this scenario, the schema contains one column:
content. -
Click the plus button to add the parameter line.
-
Click OK to close the dialog box and
accept to propagate the changes when prompted by the studio. -
In Basic settings, define the
corresponding value in the Mode area using
the Use Single Table option. In this
scenario, the value is “Hello world!”.
-
Double-click tFileOutputDelimited to
define the component in its Basic settings
view. -
Click the […] button next to the
File Name field and browse to the
output file you want to write data in, in.txt in this
example.
-
Double-click tHDFSPut to define the
component in its Basic settings
view. -
Select, for example, Apache 0.20.2 from the Hadoop
version list. -
In the NameNode URI, the Username and the Group fields, enter the connection parameters to the
HDFS. -
Next to the Local directory field, click
the three-dot […] button to browse to the
folder with the file to be loaded into the HDFS. In this scenario, the
directory has been specified while configuring tFileOutputDelimited:
C:/hadoopfiles/putFile/. -
In the HDFS directory field, type in the
intended location in HDFS to store the file to be loaded. In this example,
it is /testFile. -
Click the Overwrite file field to stretch
the drop-down. -
From the menu, select always.
-
In the Files area, click the plus button
to add a row in which you define the file to be loaded. -
In the File mask column, enter
*.txt to replace newLine
between quotation marks and leave the New
name column as it is. This allows you to extract all the
.txt files in the specified directory without
changing their names. In this example, the file is
in.txt.
-
Double-click tHDFSGet to define the
component in its Basic settings
view. -
Select, for example, Apache 0.20.2 from the Hadoop
version list. -
In the NameNode URI, the Username, the Group fields, enter the connection parameters to the
HDFS. -
In the HDFS directory field, type in
location storing the loaded file in HDFS. In this example, it is
/testFile. -
Next to the Local directory field, click
the three-dot […] button to browse to the
folder intended to store the files that are extracted out of the HDFS. In
this scenario, the directory is:
C:/hadoopfiles/getFile/. -
Click the Overwrite file field to stretch
the drop-down. -
From the menu, select always.
-
In the Files area, click the plus button
to add a row in which you define the file to be extracted. -
In the File mask column, enter
*.txt to replace newLine
between quotation marks and leave the New
name column as it is. This allows you to extract all the
.txt files from the specified directory in the HDFS
without changing their names. In this example, the file is
in.txt.
-
Double-click tFileInputDelimited to
define the component in its Basic settings
view. -
Set property type to Built-In.
-
Next to the File Name/Stream field, click
the three-dot button to browse to the file you have obtained from the HDFS.
In this scenario, the directory is
C:/hadoopfiles/getFile/in.txt. -
Set Schema to Built-In and click Edit
schema to define the data to pass on to the tLogRow component. -
Click the plus button to add a new column.
-
Click OK to close the dialog box and
accept to propagate the changes when prompted by the studio.