July 30, 2023

tFileOutputJSON – Docs for ESB 7.x

tFileOutputJSON

Receives data and rewrites it in a JSON structured data block in an output
file.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tFileOutputJSON Standard properties

These properties are used to configure tFileOutputJSON running in the Standard Job framework.

The Standard
tFileOutputJSON component belongs to the File family.

The component in this framework is available in all Talend
products
.

Basic settings

File Name

Name and path of the output file.

Warning: Use absolute path (instead of relative path) for
this field to avoid possible errors.

Generate an array json

Select this check box to generate an array JSON file.

Name of data block

Enter a name for the data block to be written, between double
quotation marks.

This field disappears when the Generate an
array json
check box is selected.

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Sync columns

Click to synchronize the output file schema with the input file
schema. The Sync function only displays once the Row connection is
linked with the Output component.

Advanced settings

Create directory if not exists

This check box is selected by default. This option creates the
directory that will hold the output files if it does not already
exist.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a
Job level as well as at each component level.

Global Variables

Global Variables

NB_LINE: the number of rows read by an input component or
transferred to an output component. This is an After variable and it returns an
integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

Use this component to rewrite received data in a JSON structured
output file.

Writing a JSON structured file

This is a 2 component scenario in which a
tRowGenerator
component generates random data which a tFileOutputJSON component then writes to a JSON structured
output file.

tFileOutputJSON_1.png

Procedure

  1. Drop a tRowGenerator and a tFileOutputJSON component onto the workspace from the
    Palette.
  2. Link the components using a Row > Main
    connection.
  3. Double click tRowGenerator to define
    its Basic Settings properties in the
    Component view.

    tFileOutputJSON_2.png

  4. Click […] next to Edit Schema to display the corresponding dialog box and define
    the schema.

    tFileOutputJSON_3.png

  5. Click [+] to add the number of columns
    desired.
  6. Under Columns type in the column
    names.
  7. Under Type, select the data type from the
    list.
  8. Click OK to close the dialog box.
  9. Click [+] next to RowGenerator Editor to open the corresponding dialog box.

    tFileOutputJSON_4.png

  10. Under Functions, select pre-defined functions
    for the columns, if required, or select […]
    to set customized function parameters in the Function
    parameters
    tab.
  11. Enter the number of rows to be generated in the corresponding field.
  12. Click OK to close the dialog box.
  13. Click tFileOutputJSON to set its Basic Settings properties in the Component view.

    tFileOutputJSON_5.png

  14. Click […] to browse to where you want the
    output JSON file to be generated and enter the file name.
  15. Enter a name for the data block to be generated in the corresponding field,
    between double quotation marks.
  16. Select Built-In as the Schema type.
  17. Click Sync Columns to retrieve the schema
    from the preceding component.
  18. Press F6 to run the Job.

    tFileOutputJSON_6.png

The data from the input schema is written in a JSON structured data block in the
output file.

tFileOutputJSON MapReduce properties (deprecated)

These properties are used to configure tFileOutputJSON running in the MapReduce Job framework.

The MapReduce
tFileOutputJSON component belongs to the MapReduce family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Folder

Enter the folder on HDFS where you want to store the JSON
output file(s).

The folder will be created automatically if it does not
exist.

Note that you need
to ensure you have properly configured the connection to the Hadoop
distribution to be used in the Hadoop
configuration
tab in the Run view.

Output type

Select the structure for the JSON output file(s).

  • All in one block: the received
    data will be written into one data block.

  • One row per record: the received
    data will be written into separate data blocks row by row.

Name of data block

Type in the name of the data block for the JSON output
file(s).

This field will be available only if you select All in one block from the Output type list.

Action

Select the action that you want to perform on the data:

  • Overwrite: the data on HDFS will
    be overwritten if it already exists.

  • Create: the data will be
    created.

Advanced settings

Use local timezone for date Select this check box to use the local date of the machine in which your Job is executed. If leaving this check box clear, UTC is automatically used to format the Date-type data.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

Use this component to rewrite received data in a JSON
structured output file.

In a
Talend
Map/Reduce Job, it is used as an end component and requires
a transformation component as input link. The other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

Once a Map/Reduce Job is opened in the workspace, tFileOutputJSON as well as the MapReduce family
appears in the Palette of the
Studio.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Hadoop Connection

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

This connection is effective on a per-Job basis.

Prerequisites

The Hadoop distribution must be properly installed, so as to guarantee the interaction
with
Talend Studio
. The following list presents MapR related information for
example.

  • Ensure that you have installed the MapR client in the machine where the Studio is,
    and added the MapR client library to the PATH variable of that machine. According
    to MapR’s documentation, the library or libraries of a MapR client corresponding to
    each OS version can be found under MAPR_INSTALL
    hadoophadoop-VERSIONlib
    ative
    . For example, the library for
    Windows is lib
    ativeMapRClient.dll
    in the MapR
    client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

    Without adding the specified library or libraries, you may encounter the following
    error: no MapRClient in java.library.path.

  • Set the -Djava.library.path argument, for example, in the Job Run VM arguments area
    of the Run/Debug view in the Preferences dialog box in the Window menu. This argument provides to the Studio the path to the
    native library of that MapR client. This allows the subscription-based users to make
    full use of the Data viewer to view locally in the
    Studio the data stored in MapR.

For further information about how to install a Hadoop distribution, see the manuals
corresponding to the Hadoop distribution you are using.

Related scenarios

No scenario is available for the Map/Reduce version of this component yet.

tFileOutputJSON properties for Apache Spark Batch

These properties are used to configure tFileOutputJSON running in the Spark Batch Job framework.

The Spark Batch
tFileOutputJSON component belongs to the File family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Define a storage configuration component

Select the configuration component to be used to provide the configuration
information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local
system.

The configuration component to be used must be present in the same Job.
For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write
the result in a given HDFS system.

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Folder

Browse to, or enter the path pointing to the data to be used in the file system.

Note that this path must point to a folder rather than a
file.

The button for browsing does not work with the Spark
Local mode; if you are
using the other Spark Yarn
modes that the Studio supports with your distribution, ensure that you have properly
configured the connection in a configuration component in the same Job, such as

tHDFSConfiguration
. Use the
configuration component depending on the filesystem to be used.

Output type

Select the structure for the JSON output file(s).

  • All in one block: the received
    data will be written into one data block.

  • One row per record: the received
    data will be written into separate data blocks row by row.

Name of data block

Type in the name of the data block for the JSON output
file(s).

This field will be available only if you select All in one block from the Output type list.

Action

Select the action that you want to perform on the data:

  • Overwrite: the data on HDFS will
    be overwritten if it already exists.

  • Create: the data will be
    created.

Advanced settings

Use local timezone for date Select this check box to use the local date of the machine in which your Job is executed. If leaving this check box clear, UTC is automatically used to format the Date-type data.

Usage

Usage rule

This component is used as an end component and requires an input link.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tFileOutputJSON properties for Apache Spark Streaming

These properties are used to configure tFileOutputJSON running in the Spark Streaming Job framework.

The Spark Streaming
tFileOutputJSON component belongs to the File family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Define a storage configuration component

Select the configuration component to be used to provide the configuration
information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local
system.

The configuration component to be used must be present in the same Job.
For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write
the result in a given HDFS system.

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Folder

Browse to, or enter the path pointing to the data to be used in the file system.

Note that this path must point to a folder rather than a
file.

The button for browsing does not work with the Spark
Local mode; if you are
using the other Spark Yarn
modes that the Studio supports with your distribution, ensure that you have properly
configured the connection in a configuration component in the same Job, such as

tHDFSConfiguration
. Use the
configuration component depending on the filesystem to be used.

Output type

Select the structure for the JSON output file(s).

  • All in one block: the received
    data will be written into one data block.

  • One row per record: the received
    data will be written into separate data blocks row by row.

Name of data block

Type in the name of the data block for the JSON output
file(s).

This field will be available only if you select All in one block from the Output type list.

Action

Select the action that you want to perform on the data:

  • Overwrite: the data on HDFS will
    be overwritten if it already exists.

  • Create: the data will be
    created.

Advanced settings

Write empty batches Select this check box to allow your Spark Job to create an empty batch when the
incoming batch is empty.

For further information about when this is desirable
behavior, see this discussion.

Use local timezone for date Select this check box to use the local date of the machine in which your Job is executed. If leaving this check box clear, UTC is automatically used to format the Date-type data.

Usage

Usage rule

This component is used as an end component and requires an input link.

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional
Talend
data
integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x