August 17, 2023

tFileInputJSON – Docs for ESB 5.x

tFileInputJSON

tFileInputJSON_icon32_white.png

tFileInputJSON properties

Component Family

File / Input

 

Function

tFileInputJSON extracts JSON data
from a file according to the JSONPath query.

If you have subscribed to one of the Talend solutions with Big Data, you are
able to use this component in a Talend Map/Reduce Job to generate
Map/Reduce code. For further information, see tFileInputJSON in Talend
Map/Reduce Jobs
. In that
situation, tFileInputJSON belongs
to the MapReduce component family.

Purpose

tFileInputJSON extracts JSON data
from a file according to the JSONPath query, then transferring the
data to a file, a database table, etc.

Basic settings

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: No property data stored
centrally.

 

 

Repository: Select the repository
file where the properties are stored. The fields that follow are
completed automatically using the data retrieved.

 

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

 

Built-in: The schema will be
created and stored locally for this component only. Related topic:
see Talend Studio
User Guide.

 

 

Repository: The schema already
exists and is stored in the Repository, hence can be reused in
various projects and Job flowcharts. Related topic: see
Talend Studio
User Guide.

 

Read By XPath

This check box is selected by default. It allows you to show the
Loop JSONPath query field and
the Get nodes check box in the
Mapping table.

 

Use URL

Select this check box to retrieve data directly from the Web.

URL: type in the URL path from
which you will retrieve data.

 

Filename

This field is not available if you select the Use URL check box.

Click the […] button next to
the field to browse to the file from which you will retrieve data or
enter the full path to the file directly.

 

Loop JSONPath query

JSONPath query to specify the loop node of the JSON data.

Available when Read by XPath is
selected.

 

Mapping

Column: shows the schema defined
in the Schema editor.

JSONPath Query: specifies the JSON
node that holds the desired data. For details about JSONPath
expressions, go to http://goessner.net/articles/JsonPath/.

Get nodes: available when Read by XPath is selected. Select this
check box to extract the JSON data of all the nodes specified in the
XPath query list or select the
check box next to a specific node to extract its JSON data only.

 

Die on error

Select this check box to stop the execution of the Job when an
error occurs. Clear the check box to skip the row on error and
complete the process for error-free rows. If needed, you can collect
the rows on error using a Row >
Reject link.

Advanced settings

Advanced separator (for numbers)

Select this check box to modify the separators used for
numbers:

Thousands separator: define
separators for thousands.

Decimal separator: define
separators for decimals.

 

Validate date

Select this check box to check the date format strictly against
the input schema. This check box is available only if the Read By XPath check box is
selected.

 

Encoding

Select the encoding type from the list or select Custom and define it manually. This field
is compulsory for DB data handling.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a
Job level as well as at each component level.

Global Variables

NB_LINE: the number of rows processed. This is an After
variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is a start component of a Job and always needs an
output link.

Usage in Map/Reduce Jobs

In a Talend Map/Reduce Job, it is used as a start component and requires
a transformation component as output link. The other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

For further information about a Talend Map/Reduce Job, see the sections
describing how to create, convert and configure a Talend Map/Reduce Job of the
Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Log4j

The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User
Guide
.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

tFileInputJSON in Talend
Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of
the Talend solutions with Big Data and is not applicable to
Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tFileInputJSON, as well as the whole Map/Reduce Job using it, generates
native Map/Reduce code. This section presents the specific properties of tFileInputJSON when it is used in that situation. For further
information about a Talend Map/Reduce Job, see the Talend Big Data Getting Started Guide.

Component family

MapReduce / Input

 

Function

In a Map/Reduce Job, tFileInputJSON extracts data from one or more JSON
files on HDFS and sends it to the following transformation
component.

Basic settings

Property type

Either Built-in or Repository.

   

Built-in: no property data stored
centrally.

   

Repository: reuse properties
stored centrally under the File
Json
node of the Repository tree.

The fields that come after are pre-filled in using the fetched
data.

For further information about the File
Json
node, see the section about setting up a JSON
file schema in Talend StudioUser
Guide
.

 

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

   

Built-In: You create and store the schema locally for this
component only. Related topic: see Talend Studio
User Guide.

   

Repository: You have already created the schema and
stored it in the Repository. You can reuse it in various projects and Job designs. Related
topic: see Talend Studio User Guide.

 

Folder/File

Enter the path to the file or folder on HDFS from which the data
will be extracted.

If the path you entered points to a folder, all files stored in
that folder will be read.

If the file to be read is a compressed one, enter the file name
with its extension; then tFileInputJSON automatically decompresses it at
runtime. The supported compression formats and their corresponding
extensions are:

  • DEFLATE: *.deflate

  • gzip: *.gz

  • bzip2: *.bz2

  • LZO: *.lzo

Note that you need
to ensure you have properly configured the connection to the Hadoop
distribution to be used in the Hadoop
configuration
tab in the Run view.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows.
When errors are skipped, you can collect the rows on error using a Row
> Reject
link.

 

Loop Xpath query

Node within the JSON field, on which the loop is based.

 

Mapping

Complete the Mapping table to
extract the desired data.

  • Column: columns defined
    in the schema to hold the data extracted from the JSON
    field.

  • XPath query: XPath query
    to specify the node within the JSON field to be
    extracted.

  • Get Nodes: this check box
    can be selected to get values from a nested node within the
    JSON field.

Advanced settings

Advanced separator (for number)

Select this check box to change the separator used for numbers. By
default, the thousands separator is a coma (,) and the decimal separator is a period (.).

 

Validate date

Select this check box to check the date format strictly against
the input schema.

 

Encoding

Select the encoding from the list or select Custom and define it manually.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

In a Talend Map/Reduce Job, it is used as a start component and requires
a transformation component as output link. The other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

Once a Map/Reduce Job is opened in the workspace, tFileInputJSON as well as the MapReduce
family appears in the Palette of
the Studio.

For further information about a Talend Map/Reduce Job, see the sections
describing how to create, convert and configure a Talend Map/Reduce Job of the
Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Hadoop Connection

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

This connection is effective on a per-Job basis.

Prerequisites

The Hadoop distribution must be properly installed, so as to guarantee the interaction
with Talend Studio. The following list presents MapR related information for
example.

  • Ensure that you have installed the MapR client in the machine where the Studio is,
    and added the MapR client library to the PATH variable of that machine. According
    to MapR’s documentation, the library or libraries of a MapR client corresponding to
    each OS version can be found under MAPR_INSTALL
    hadoophadoop-VERSIONlib
    ative
    . For example, the library for
    Windows is lib
    ativeMapRClient.dll
    in the MapR
    client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

    Without adding the specified library or libraries, you may encounter the following
    error: no MapRClient in java.library.path.

  • Set the -Djava.library.path argument, for example, in the Job Run VM arguments area
    of the Run/Debug view in the [Preferences] dialog box. This argument provides to the Studio the
    path to the native library of that MapR client. This allows the subscription-based
    users to make full use of the Data viewer to view
    locally in the Studio the data stored in MapR. For further information about how to
    set this argument, see the section describing how to view data of Talend Big Data Getting Started Guide.

For further information about how to install a Hadoop distribution, see the manuals
corresponding to the Hadoop distribution you are using.

Scenario 1: Extracting JSON data from a file

In this scenario, the tFileInputJSON component reads
the JSON data from a file using JSONPath queries and the tLogRow component shows the flat data extracted.

The JSON file contains information about a movie collection.

Linking the components

  1. Drop tFileInputJSON and tLogRow from the Palette onto the Job designer.

  2. Rename tFileInputJSON as read_JSON_data and tLogRow as show_data.

  3. Link the components using a Row >
    Main connection.

    use_case_tfileinputjson_1.png

Configuring the components

  1. Double-click tFileInputJSON to open its
    Basic settings view:

    use_case_tfileinputjson_2.png
  2. Click the […] button next to the
    Edit schema field to open the schema
    editor.

    use_case_tfileinputjson_3.png
  3. Click the [+] button to add five columns,
    namely type, movie_name, release,
    rating and starring, with the type of String except for the column rating, which is Double.

    Click OK to close the editor.

  4. In the pop-up Propagate box, click
    Yes to propagate the schema to the
    subsequent components.

  5. In the Filename field, fill in the path
    to the JSON file.

    In this example, the JSON file is as follows:

  6. Clear the Read By XPath check box.

  7. In the Mapping table, the schema
    automatically appears in the Column
    part.

    use_case_tfileinputjson_4.png

    In the JSONPath query column, enter the
    following queries:

    • For the columns type and
      name, enter the JSONPath
      queries “$.movieCollection[*].type” and “$.movieCollection[*].name”
      respectively. They correspond to the first nodes of the JSON
      data.

      Here, “$.movieCollection[*]”
      stands for the root node relative to the nodes type and name, namely movieCollection.

    • For the columns release,
      rating and starring, enter the JSONPath queries
      “$..release”, “$..rating” and “$..starring” respectively.

      Here, “..” stands for the
      recursive decent of the details
      node, namely release, rating and starring.

  8. Double-click tLogRow to display the
    Basic settings view and select
    Table (print values in cells of a
    table)
    for a better display of the results.

    use_case_tfileinputjson_5.png

Executing the Job

  1. Press Ctrl+S to save the Job.

  2. Press F6 to execute the Job.

    use_case_tfileinputjson_6.png

    As shown above, the source JSON data is collected in a flat file
    table.

Scenario 2: Extracting JSON data from a file using XPath

In this scenario, the tFileInputJSON component reads
the JSON data from a file using XPath queries and the tLogRow component shows the flat data extracted.

The JSON file contains information about a movie collection.

Linking the components

  1. Drop tFileInputJSON and tLogRow from the Palette onto the Job designer.

  2. Rename tFileInputJSON as read_JSON_data and tLogRow as show_data.

  3. Link the components using a Row >
    Main connection.

    use_case_tfileinputjson_1.png

Configuring the components

  1. Double-click tFileInputJSON to open its
    Basic settings view:

    use_case_tfileinputjson_7.png
  2. Click the […] button next to the
    Edit schema field to open the schema
    editor.

    use_case_tfileinputjson_3.png
  3. Click the [+] button to add five columns,
    namely type, movie_name, release,
    rating and starring, with the type of String except for the column rating, which is Double.

    Click OK to close the editor.

  4. In the pop-up Propagate box, click
    Yes to propagate the schema to the
    subsequent components.

  5. In the Filename field, enter the path to
    the JSON file.

    In this example, the JSON file is as follows:

  6. Make sure that the Read By XPath check
    box is selected.

  7. In the Loop JSONPath query field, enter
    “/movieCollection/details”.

  8. In the Mapping table, the schema
    automatically appears in the Column
    part.

    use_case_tfileinputjson_8.png

    In the XPath query column, enter the
    following queries:

    • For the columns type and
      name, enter the XPath queries
      “../type” and “../name” respectively. They correspond
      to the first nodes of the JSON data.

    • For the columns release,
      rating and starring, enter the XPath queries
      “release”, “rating” and “starring” respectively.

  9. Double-click tLogRow to display the
    Basic settings view and select
    Table (print values in cells of a
    table)
    for a better display of the results.

    use_case_tfileinputjson_5.png

Executing the Job

  1. Press Ctrl+S to save the Job.

  2. Press F6 to execute the Job.

    use_case_tfileinputjson_6.png

    As shown above, the source JSON data is collected in a flat file
    table.

Scenario 3: Extracting JSON data from a URL

In this scenario, tFileInputJSON retrieves the
friends node from a JSON file that contains the
data of a Facebook user and tExtractJSONFields extracts
the data from the friends node for flat data
output.

Note that the JSON file is deployed on the Tomcat server, specifically, located in the
folder <tomcat path>/webapps/docs.

Linking the components

  1. Drop the following components from the Palette onto the design workspace: tFileInputJSON, tExtractJSONFields and tLogRow.

  2. Link tFileInputJSON and tExtractJSONFields using a Row > Main connection.

  3. Link tExtractJSONFields and tLogRow using a Row > Main connection.

    use_case_tfileinputjson_2_1.png

Configuring the components

  1. Double-click tFileInputJSON to display
    its Basic settings view.

    use_case_tfileinputjson_2_2.png
  2. Click the […] button next to the
    Edit schema field to open the schema
    editor.

    use_case_tfileinputjson_2_3.png

    Click the [+] button to add one column,
    namely friends, of the String type.

    Click OK to close the editor.

  3. Clear the Read by XPath check box and
    select the Use Url check box.

    In the URL field, enter the JSON file
    URL, “http://localhost:8080/docs/facebook.json” in this
    case.

    The JSON file is as follows:

  4. Enter the URL in a browser. If the Tomcat server is running, the browser
    displays:

    use_case_tfileinputjson_2_9.png
  5. In the Studio, in the Mapping table,
    enter the JSONPath query “$.user.friends[*]” next to the friends column, retrieving the entire friends node from the source file.

  6. Double-click tExtractJSONFields to
    display its Basic settings view.

    use_case_tfileinputjson_2_4.png
  7. Click the […] button next to the
    Edit schema field to open the schema
    editor.

    use_case_tfileinputjson_2_5.png
  8. Click the [+] button in the right panel
    to add five columns, namely id, name, like_id, like_name and
    like_category, which will hold the
    data of relevant nodes in the JSON field friends.

    Click OK to close the editor.

  9. In the pop-up Propagate box, click
    Yes to propagate the schema to the
    subsequent components.

    use_case_tfileinputjson_2_6.png
  10. In the Loop XPath query field, enter
    “/likes/data”.

  11. In the Mapping table, type in the queries
    of the JSON nodes in the XPath query
    column. The data of those nodes will be extracted and passed to their
    counterpart columns defined in the output schema.

  12. Specifically, define the XPath query “../../id” (querying the “/friends/id” node) for the column id, “../../name”
    (querying the “/friends/name” node) for
    the column name, “id” for the column like_id, “name” for the
    column like_name, and “category” for the column like_category.

  13. Double-click tLogRow to display its
    Basic settings view.

    use_case_tfileinputjson_2_7.png

    Select Table (print values in cells of a
    table)
    for a better display of the results.

Executing the Job

  1. Press Ctrl + S to save the Job.

  2. Click F6 to execute the Job.

    use_case_tfileinputjson_2_8.png

    As shown above, the friends data of the Facebook user Kelly Clarkson is
    extracted correctly.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x