July 30, 2023

tHMapInput – Docs for ESB 7.x

tHMapInput

Runs a
Talend Data Mapper
map where input and output structures may differ, as a Spark batch execution, and
sends the data for use by a downstream component.

tHMapInput transforms data from a
single source, in a Spark environment, for use by a downstream
component.

tHMapInput properties for Apache Spark Batch

These properties are used to configure tHMapInput running in the Spark Batch Job framework.

The Spark Batch
tHMapInput component belongs to the Processing family.

This component is available in Talend Platform products with Big Data and
in Talend Data Fabric.

Basic settings

Storage

To connect to an HDFS installation, select the Define a storage configuration component
check box and then select the name of the component to use from those
available in the drop-down list.

This option requires you to have previously configured the
connection to the HDFS installation to be used, as described in the
documentation for the tHDFSConfiguration component.

If you leave the Define a
storage configuration component
check box unselected,
you can only convert files locally.

Configure Component

Before you configure this component, you must have already
added a downstream component and linked it to the tHMapInput component, and retreived the
schema from the downstream component.

To configure the component, click the […] button and, in the Component Configuration window, perform
the following actions.

  1. Click the Select button next to the Record structure field and
    then, in the Select a
    Structure
    dialog box that opens, select the
    map you want to use and then click OK.

    This structure must have been previously
    created in
    Talend Data Mapper
    .

  2. Select the Input
    Representation
    to use from the drop-down
    list.

    Supported input formats are Avro, COBOL, EDI,
    Flat, IDocs, JSON and XML.

  3. Click Next.

  4. Tell the component where each new record
    begins. In order for you to be able to do so, you need to
    fully understand the structure of your data.

    Exactly how you do this varies depending on
    the input representation being used, and you will be
    presented with one of the following options.

    1. Select an appropriate record
      delimiter for your data. Note that you must specify
      this value without quotes.

      • Separator
        lets you specify a separator indicator, such as
        , to identify a new
        line.

        Supported indicators are
        for a Unix-type new line,

        for Windows and
        for Mac, and for tab characters.

      • Start/End
        with
        lets you specify the initial
        characters that indicate a new record, such as <root, or the characters
        that indicate where a record ends.

        Start with
        also supports new lines,

        for a Unix-type new line,
        for Windows and
        for Mac, and for
        tab characters.

        Select the Regular
        Expression
        check box if you to wish to
        enter a regular expression to match the start of a
        record. When you select XML or JSON, this check
        box is selected by default and a pre-configured
        regular expression is provided.

      • Sample File: To test the
        signature with a sample file, click the
        […] button, browse to the
        file you want to use as a sample, click
        Open, and then click
        Run to test your
        sample.

        Testing the signature lets you check
        that the total number of records and their minimum
        and maximum length corresponds to what you expect
        based on your knowledge of the data. This step
        assumes you have a local subset of your data to
        use as a sample.

      • Click Finish.

    2. If your input representation is COBOL
      or Flat with positional and/or binary encoding
      properties, define the signature for the input
      record structure:

      • Input Record root
        corresponds to the root element in your input
        record.
      • Minimum Record
        Size
        corresponds to the size in bytes
        of the smallest record. If you set this value too
        low, you may encounter performance issues, since
        the component will perform more checks than
        necessary when looking for a new record.

      • Maximum Record
        Size
        corresponds to the size in bytes
        of the largest record, and is used to determine
        how much memory is allocated to read the
        input.

      • Sample from Workspace or
        Sample from File System: To
        test the signature with a sample file, click the
        […]
        button, and then browse to the file you want to
        use as a sample.

        Testing the signature lets you
        check that the total number of records and their
        minimum and maximum length corresponds to what you
        expect based on your knowledge of the data. This
        step assumes you have a local subset of your data
        to use as a sample.

      • Footer Size
        corresponds to the size in bytes of the footer, if
        any. At runtime, the footer will be ignored rather
        than being mistakenly included in the last record.
        Leave this field empty if there is no footer.

      • Click the Next button to open
        the Signature
        Parameters
        window, select the fields
        that define the signature of your record input
        structure (that is, to identify where a new record
        begins), update the Operation and Value columns as
        appropriate, and then click Next.

      • In the Record
        Signature Test
        window that opens, check
        that your Records are correctly delineated by
        scrolling through them with the
        Back and
        Next buttons and performing
        a visual check, and then click
        Finish.

  5. Map the elements from the input structure to
    the output structure in the new map that opens, and then
    press Ctrl+S to save
    your map.

    For more information on creating maps, see
    Talend Data Mapper User Guide
    .

Input

Click the […]
button to define the path to where the input file is stored.

Open Map Editor

Click the […]
button to open the map for editing in the Map
Editor
of
Talend Data Mapper
.

For more information, see
Talend Data Mapper User Guide
.

Die on error

This check box is selected by default.

Clear the check box to skip any rows on error and complete the
process for error-free rows.

If you opt to clear the
check box, you can perform any of these options:

  • Connect the tHMapInput component to an output
    component, for example tAvroOutput, using a Row > Rejects connection. In the output component, ensure that
    you add a fixed metadata with the following columns:

    • inputRecord: contains the rejected
      input record during the transformation.
    • recordId: refers to the record
      identifier. For a text or binary input, the recordId
      specifies the start offset of the record in the input
      file. For an AVRO input, the recordId specifies the
      timestamp when the input was processed.
    • errorMessage: contains the
      transformation status with details of the cause of the
      transformation error.
  • You can retrieve the rejected records in a file.
    One of these mechanisms triggers this feature: (1) a context
    variable (talend_transform_reject_file_path) and (2) a
    system variable set in the Advanced job parameters (spark.hadoop.talend.transform.reject.file.path).

    When you set the file path on the Hadoop
    Distributed File System (HDFS), no further configurations are
    needed. When you set the file on Amazon S3 or any other
    Hadoop-compatible file systems, add the associated Spark
    advanced configuration parameter.

    In case of errors at runtime, tHMapFile checks if one of the
    mechanisms exists and, if so, appends the rejected record to the
    designated file. The reject file content includes the
    concatenation of the rejected records without any additional
    metadata.

    If the file system you use does not support
    appending to a file, a separate file is created for each
    rejection. The file uses the provided file path as the prefix
    and adds a suffix that is the offset of the input file and the
    size of the rejected record.

Note: Any errors while trying to store the reject are logged and the
processing continues.

Usage

Usage rule

This component is used with a tHDFSConfiguration component which defines the
connection to the HDFS storage.

It is an input component and requires an output flow.

Related scenarios

For a related scenario, see Transforming data in a Spark environment.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x