August 17, 2023

tFileInputDelimited – Docs for ESB 5.x

tFileInputDelimited

tFileInputDelimited.png

tFileInputDelimited properties

Component family

File/Input

 

Function

tFileInputDelimited reads a given
file row by row with simple separated fields.

If you have subscribed to one of the Talend solutions with Big Data, you are
able to use this component in a Talend Map/Reduce Job to generate
Map/Reduce code. For further information, see tFileInputDelimited in Talend
Map/Reduce Jobs
.

Purpose

Opens a file and reads it row by row to split them up into fields
then sends fields as defined in the Schema to the next Job
component, via a Row link.

Basic settings

Property type

Either Built-in or Repository.

 

 

Built-in: No property data stored
centrally.

 

 

Repository: Select the repository
file where the properties are stored. The fields that follow are
completed automatically using the data retrieved.

 

File Name/Stream

File name: Name and path of the
file to be processed.

Stream: The data flow to be
processed. The data must be added to the flow in order for tFileInputDelimited to fetch these data
via the corresponding representative variable.

This variable could be already pre-defined in your Studio or
provided by the context or the components you are using along with
this component; otherwise, you could define it manually and use it
according to the design of your Job, for example, using tJava or tJavaFlex.

In order to avoid the inconvenience of hand writing, you could
select the variable of interest from the auto-completion list
(Ctrl+Space) to fill the
current field on condition that this variable has been properly
defined.

Related topic to the available variables: see Talend Studio User Guide

 

Row separator

Enter the separator used to identify the end of a row.

 

Field separator

Enter character, string or regular expression to separate fields for the transferred
data.

 

CSV options

Select this check box to include CSV specific parameters such as
Escape char and Text enclosure.

 

Header

Enter the number of rows to be skipped in the beginning of file.

Note

When using the dynamic schema feature, the first row of the
input file will be read as the header row whether the Header value is set to 0 or to 1. If
you want to use another row as the header row, set the Header value accordingly.

For further information about dynamic schemas, see
Talend Studio User
Guide
.

 

Footer

Number of rows to be skipped at the end of the file.

 

Limit

Maximum number of rows to be processed. If Limit = 0, no row is
read or processed.

 

Schema and Edit
Schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

Note that if the input value of any non-nullable primitive field is null, the row of
data including that field will be rejected.

This component offers the advantage of the dynamic schema feature. This allows you to
retrieve unknown columns from source files or to copy batches of columns from a source
without mapping each column individually. For further information about dynamic schemas,
see Talend Studio
User Guide.

This dynamic schema feature is designed for the purpose of retrieving unknown columns
of a table and is recommended to be used for this purpose only; it is not recommended
for the use of creating tables.

Warning

When using the dynamic schema feature, the dynamic column does
not contain the actual column names of the input file. If you
want your output flow to include the actual column names, make
sure that your input file has a header row and the Header value is set properly.

 

 

Built-in: The schema will be
created and stored locally for this component only. Related topic:
see Talend Studio User Guide.

 

 

Repository: The schema already
exists and is stored in the Repository, hence can be reused in
various projects and Job flowcharts. Related topic: see
Talend Studio User
Guide
.

 

Skip empty rows

Select this check box to skip the empty rows.

 

Uncompress as zip file

Select this check box to uncompress the input file.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows.
When errors are skipped, you can collect the rows on error using a Row
> Reject
link.

To catch the FileNotFoundException, you also need to
select this check box.

Advanced settings

Advanced separator (for numbers)

Select this check box to modify the separators used for
numbers:

Thousands separator: define
separators for thousands.

Decimal separator: define
separators for decimals.

 

Extract lines at random

Select this check box to set the number of lines to be extracted
randomly.

 

Encoding

Select the encoding from the list or select Custom and
define it manually. This field is compulsory for database data handling.

 

Trim all column

Select this check box to remove the leading and trailing
whitespaces from all columns. When this check box is cleared, the
Check column to trim table is
displayed, which lets you select particular columns to trim.

 

Check each row structure against schema

Select this check box to check whether the total number of columns
in each row is consistent with the schema. If not consistent, an
error message will be displayed on the console.

 

Check date

Select this check box to check the date format strictly against the input schema.

 

Check columns to trim

This table is filled automatically with the schema being used. Select the check box(es)
corresponding to the column(s) to be trimmed.

 

Split row before field

Select this check box to split rows before splitting
fields.

 

Permit hexadecimal (0xNNN) or octal (0NNNN) for numeric
types

Select this check box if any of your numeric types (long, integer, short, or byte type), will
be parsed from a hexadecimal or octal string.

In the table that displays, select the check box next to the
column or columns of interest to transform the input string of each
selected column to the type defined in the schema.

Select the Permit hexadecimal or
octal
check box to select all the columns.

This table appears only when the Permit
hexadecimal (0xNNN) or octal (0NNNN) for numeric
types
check box is selected.

 

tStatCatcher Statistics

Select this check box to gather the processing metadata at the Job
level as well as at each component level.

Global Variables

NB_LINE: the number of rows processed. This is an After
variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

Use this component to read a file and separate fields contained in
this file using a defined separator. It allows you to create a data
flow using a Row > Main link or
via a Row > Reject link in which
case the data is filtered by data that does not correspond to the
type defined. For further information, please see Scenario 2: Extracting correct and erroneous data from an XML field in a delimited
file
.

Log4j

The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User
Guide
.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

Due to license incompatibility, one or more JARs required to use this component are not
provided. You can install the missing JARs for this particular component by clicking the
Install button on the Component tab view. You can also find out and add all missing JARs easily on
the Modules tab in the Integration perspective
of your studio. For details, see https://help.talend.com/display/KB/How+to+install+external+modules+in+the+Talend+products
or the section describing how to configure the Studio in the Talend Installation and Upgrade
Guide
.

tFileInputDelimited in Talend
Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of
the Talend solutions with Big Data and is not applicable to
Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tFileInputDelimited, as well as the whole Map/Reduce Job using it,
generates native Map/Reduce code. This section presents the specific properties of
tFileInputDelimited when it is used in that
situation. For further information about a Talend Map/Reduce Job, see the Talend Big Data Getting Started Guide.

Component family

MapReduce / Input

 

Basic settings

Property type

Either Built-in or Repository.

   

Built-in: no property data stored
centrally.

   

Repository: reuse properties
stored centrally under the Hadoop
Cluster
node of the Repository tree.

The fields that come after are pre-filled in using the fetched
data.

For further information about the Hadoop
Cluster
node, see the Getting Started Guide.

 

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Click Edit Schema to make changes to the schema. Note that if you make changes, the schema automatically becomes
built-in.

   

Built-In: You create and store the schema locally for this
component only. Related topic: see Talend Studio
User Guide.

   

Repository: You have already created the schema and
stored it in the Repository. You can reuse it in various projects and Job designs. Related
topic: see Talend Studio User Guide.

 

Folder/File

Browse to, or enter the directory in HDFS where the data you need to use is.

If the path you set points to a folder, this component will read
all of the files stored in that folder, for example, /user/talend/in; if sub-folders exist,
the sub-folders are automatically ignored unless you define the path
like
/user/talend/in/*
.

If you want to specify more than one files or directories in this
field, separate each path using a coma (,).

If the file to be read is a compressed one, enter the file name
with its extension; then tHDFSInput
automatically decompresses it at runtime. The supported compression
formats and their corresponding extensions are:

  • DEFLATE: *.deflate

  • gzip: *.gz

  • bzip2: *.bz2

  • LZO: *.lzo

Note that you need
to ensure you have properly configured the connection to the Hadoop
distribution to be used in the Hadoop
configuration
tab in the Run view.

 

Die on error

Clear the check box to skip any rows on error and complete the process for error-free rows.
When errors are skipped, you can collect the rows on error using a Row
> Reject
link.

 

Row separator

Enter the separator used to identify the end of a row.

 

Field separator

Enter character, string or regular expression to separate fields for the transferred
data.

 

Header

Enter the number of rows to be skipped in the beginning of file.

 

CSV options

Select this check box to include CSV specific parameters such as
Escape char and Text enclosure.

 

Skip empty rows

Select this check box to skip the empty rows.

Advanced settings

Custom Encoding

You may encounter encoding issues when you process the stored data. In that situation, select
this check box to display the Encoding list.

Then select the encoding to be used from the list or select
Custom and define it
manually.

Advanced separator (for number)

Select this check box to change the separator used for numbers. By
default, the thousands separator is a coma (,) and the decimal separator is a period (.).

 

Trim all columns

Select this check box to remove the leading and trailing
whitespaces from all columns. When this check box is cleared, the
Check column to trim table is
displayed, which lets you select particular columns to trim.

 

Check column to trim

This table is filled automatically with the schema being used. Select the check box(es)
corresponding to the column(s) to be trimmed.

 

Check each row structure against
schema

Select this check box to check whether the total number of columns
in each row is consistent with the schema. If not consistent, an
error message will be displayed on the console.

 

Check date

Select this check box to check the date format strictly against the input schema.

 

Decode String for long, int, short, byte Types

Select this check box if any of your numeric types (long, integer, short, or byte type), will
be parsed from a hexadecimal or octal string.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

In a Talend Map/Reduce Job, it is used as a start component and requires
a transformation component as output link. The other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

Once a Map/Reduce Job is opened in the workspace, tFileInputDelimited as well as the
MapReduce family appears in the Palette of the Studio.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Hadoop Connection

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

This connection is effective on a per-Job basis.

Scenario: Delimited file content display

The following scenario creates a two-component Job, which aims at reading each row of
a file, selecting delimited data and displaying the output in the Run log console.

UseCase_tFileInputDelimited1.png

Dropping and linking components

  1. Drop a tFileInputDelimited component and
    a tLogRow component from the Palette to the design workspace.

  2. Right-click on the tFileInputDelimited
    component and select Row > Main. Then drag it onto the tLogRow component and release when the plug symbol shows
    up.

Configuring the components

  1. Select the tFileInputDelimited component
    again, and define its Basic settings:

    UseCase_tFileInputDelimited.png
  2. Fill in a path to the file in the File
    Name
    field. This field is mandatory.

    Warning

    If the path of the file contains some accented characters, you will
    get an error message when executing your Job. For more information
    regarding the procedures to follow when the support of accented
    characters is missing, see the Talend Installation
    and Upgrade Guide
    of the Talend
    Solution you are using.

  3. Define the Row separator allowing to
    identify the end of a row. Then define the Field
    separator
    used to delimit fields in a row.

  4. In this scenario, the header and footer limits are not set. And the
    Limit number of processed rows is set
    on 50.

  5. Set the Schema as either a local
    (Built-in) or a remotely managed
    (Repository) to define the data to pass
    on to the tLogRow component.

  6. You can load and/or edit the schema via the Edit
    Schema
    function.

    Related topics: see Talend Studio User
    Guide
    .

  7. Enter the encoding standard the input file is encoded in. This setting is
    meant to ensure encoding consistency throughout all input and output
    files.

  8. Select the tLogRow and define the
    Field separator to use for the output
    display. Related topic: tLogRow.

  9. Select the Print schema column name in front of each
    value
    check box to retrieve the column labels in the output
    displayed.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Go to Run tab, and click on Run to execute the Job.

    The file is read row by row and the extracted fields are displayed on the
    Run log as defined in both components
    Basic settings.

    UseCase_tFileInputDelimited3.png

    The Log sums up all parameters in a header followed by the result of the
    Job.

Scenario 2: Reading data from a remote file in streaming mode

This scenario describes a four component Job used to fetch data from a voluminous file
almost as soon as it has been read. The data is displayed in the Run view. The advantage of this technique is that you do not have to
wait for the entire file to be downloaded, before viewing the data.

Use_Case_tFileInputDelimited2_1.png

Dropping and linking components

  1. Drop the following components onto the workspace: tFileFetch, tSleep,
    tFileInputDelimited, and tLogRow.

  2. Connect tSleep and tFileInputDelimited using a Trigger > OnComponentOk
    link and connect tFileInputDelimited to
    tLogRow using a Row > Main link.

Configuring the components

  1. Double-click tFileFetch to display the
    Basic settings tab in the Component view and set the properties.

    Use_Case_tFileInputDelimited2_2.png
  2. From the Protocol list, select the
    appropriate protocol to access the server on which your data is
    stored.

  3. In the URI field, enter the URI required
    to access the server on which your file is stored.

  4. Select the Use cache to save the resource
    check box to add your file data to the cache memory. This option allows you
    to use the streaming mode to transfer the data.

  5. In the workspace, click tSleep to display
    the Basic settings tab in the Component view and set the properties.

    By default, tSleep‘s Pause field is set to 1
    second. Do not change this setting. It pauses the second Job in order to
    give the first Job, containing tFileFetch,
    the time to read the file data.

  6. In the workspace, double-click tFileInputDelimited to display its Basic settings tab in the Component view and set the properties.

    Use_Case_tFileInputDelimited2_3.png
  7. In the File name/Stream field:

    – Delete the default content.

    – Press Ctrl+Space to view the variables
    available for this component.

    – Select tFileFetch_1_INPUT_STREAM from the
    auto-completion list, to add the following variable to the Filename field:
    ((java.io.InputStream)globalMap.get("tFileFetch_1_INPUT_STREAM")).

  8. From the Schema list, select Built-in and click […] next to the Edit
    schema
    field to describe the structure of the file that you
    want to fetch. The US_Employees file is composed of six
    columns: ID, Employee,
    Age, Address,
    State, EntryDate.

    Click [+] to add the six columns and set
    them as indicated in the above screenshot. Click OK.

    Use_Case_tFileInputDelimited2_4.png
  9. In the workspace, double-click tLogRow to
    display its Basic settings in the Component view and click Sync Columns to ensure that the schema structure is properly
    retrieved from the preceding component.

Configuring Job execution and executing the Job

  1. Click the Job tab and then on the
    Extra view.

    Use_Case_tFileInputDelimited2_5.png
  2. Select the Multi thread execution check
    box in order to run the two Jobs at the same time. Bear in mind that the
    second Job has a one second delay according to the properties set in
    tSleep. This option allows you to fetch
    the data almost as soon as it is read by tFileFetch, thanks to the tFileDelimited component.

  3. Save the Job and press F6 to run it.

    Use_Case_tFileInputDelimited2_6.png

    The data is displayed in the console as almost as soon as it is
    read.

For a scenario concerning the use of dynamic
schemas in tFileInputDelimited, see Scenario 4: Writing dynamic columns from a MySQL database to an output file.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x