Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.
Component family |
Big Data/File |
|||
Function |
tHDFSList iterates on files or |
|||
Purpose |
tHDFSList retrieves a list of |
|||
Basic settings |
Property type |
Either Built-in or Repository
Built-in: No property data stored
Repository: Select the repository Since version 5.6, both the Built-In mode and the Repository mode are |
||
Use an existing connection |
Select this check box and in the Component List click the NoteWhen a Job contains the parent Job and the child Job, Component |
|||
Version |
Distribution |
Select the cluster you are using from the drop-down list. The options in the list vary
In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can
|
||
Hadoop version |
Select the version of the Hadoop distribution you are using. The available options vary
|
|||
Authentication |
Use kerberos authentication |
If you are accessing the Hadoop cluster running with Kerberos security, select this check This check box is available depending on the Hadoop distribution you are connecting |
||
Use a keytab to authenticate |
Select the Use a keytab to authenticate check box to log Note that the user that executes a keytab-enabled Job is not necessarily the one a |
|||
|
NameNode URI |
Type in the URI of the Hadoop NameNode. The NameNode is the master node of a Hadoop system. |
||
|
User name |
Enter the user authentication name of HDFS. |
||
Group |
Enter the membership including the authentication user under which the HDFS instances were |
|||
HDFS Directory |
Browse to, or enter the directory in HDFS where the data you need to use is. |
|||
|
FileList Type |
Select the type of input you want to iterate on from the
Files if the input is a set of
Directories if the input is a set
Both if the input is a set of the |
||
|
Include subdirectories |
Select this check box if the selected input source type includes |
||
|
Case Sensitive |
Set the case mode from the list to either create or not create |
||
|
Use Glob Expressions as Filemask |
This check box is selected by default. It filters the results |
||
|
Files |
Click the plus button to add as many filter lines as needed:
Filemask: in the added filter |
||
|
Order by |
The folders are listed first of all, then the files. You can
By default: alphabetical order, by
By file name: alphabetical order or
By file size: smallest to largest
By modified date: most recent to NoteIf ordering by file name, in |
||
Order action |
Select a sort order by clicking one of the following radio ASC: ascending order; DESC: descending order; |
|||
Advanced settings |
Use Exclude Filemask |
Select this check box to enable Exclude Exclude Filemask: Fill in the NoteFile types in this field should be quoted with double |
||
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform
For further information about the properties required by Hadoop and its related systems such
|
|||
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at a Job level as well as at each component level. |
|||
Dynamic settings |
Click the [+] button to add a row in the table and fill the The Dynamic settings table is available only when the For more information on Dynamic settings and context |
|||
Global Variables |
CURRENT_FILE: the current file name. This is a Flow CURRENT_FILEDIRECTORY: the current file directory. This CURRENT_FILEEXTENSION: the extension of the current file. CURRENT_FILEPATH: the current file path. This is a Flow
NB_FILE: the number of files iterated upon so far. This is ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|||
Connections |
Outgoing links (from this component to another): Row: Iterate
Trigger: On Subjob Ok; On Subjob Incoming links (from one component to this one): Row: Iterate.
Trigger: Run if; On Subjob Ok; On For further information regarding connections, see Talend Studio |
|||
Usage |
tHDFSList provides a list of |
|||
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
|||
Log4j |
The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |
|||
Limitation |
JRE 1.6+ is required. |
This scenario uses a two-component Job to iterate on a specified directory in HDFS so
as to select the files from there towards a local directory.
-
Create the files to be iterated on in the HDFS you want to use. In this
scenario, two files are created in the directory: /user/ychen/data/hdfs/out.You can design a Job in the Studio to create the two files. For further
information, see tHDFSPut or tHDFSOutput.
-
In the Integration perspective
of Talend Studio, create an empty Job, named HDFSList for example, from the Job
Designs node in the Repository tree view.For further information about how to create a Job, see the Talend Studio User
Guide. -
Drop tHDFSList and tHDFSGet onto the workspace.
-
Connect them using the Row > Iterate
link.
-
Double-click tHDFSList to open its
Component view. -
In the Version area, select the Hadoop
distribution you are connecting to and its version. -
In the Connection area, enter the values
of the parameters required to connect to the HDFS.In the real-world practice, you may use tHDFSConnection to create a connection and reuse it from the
current component. For further information, see tHDFSConnection. -
In the HDFS Directory field, enter the
path to the folder where the files to be iterated on are. In this example,
as presented earlier, the directory is /user/ychen/data/hdfs/out/. -
In the FileList Type field, select
File. -
In the Files table, click to add one row and enter * between the quotation marks to iterate on any files
existing.
-
Double-click tHDFSGet to open its
Component view. -
In the Version area, select the Hadoop
distribution you are connecting to and its version. -
In the Connection area, enter the values
of the parameters required to connect to the HDFS.In the real-world practice, you may have used tHDFSConnection to create a connection; then you can reuse
it from the current component. For further information, see tHDFSConnection. -
In the HDFS directory field, enter the
path to the folder holding the files to be retrieved.To do this with the auto-completion list, place the mouse pointer in this
field, then, press Ctrl+Space to display
the list and select the
tHDFSList_1_CURRENT_FILEDIRECTORY variable to reuse the
directory you have defined in tHDFSList. In
this variable, tHDFSList_1 is the label
of the component. If you label it differently, select the variable
accordingly.Once selecting this variable, the directory reads, for example, ((String)globalMap.get(“tHDFSList_1_CURRENT_FILEDIRECTORY”))
in this field.For further information about how to label a component, see the Talend Studio User
Guide. -
In the Local directory field, enter the
path, or browse to the folder you want to place the selected files in. This
folder will be created if it does not exist. In this example, it is
C:/hdfsFiles. -
In the Overwrite file field, select
always. -
In the Files table, click to add one row and enter * between the quotation marks in the Filemask column in order to get any files existing.