July 30, 2023

tStem – Docs for ESB 7.x

tStem

Enables to standardize data in columns before matching this data.

tStem standardizes data in columns through the process of linguistic
normalization, in which the variant forms of a word are reduced to a common form.

tStem Standard properties

These properties are used to configure tStem running in the Standard Job framework.

The Standard
tStem component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic
settings

Schema and
Edit schema

A schema is a row description, it defines
the number of fields to be processed and passed on
to the next component. The schema is either
Built-in or
stored remotely in the Repository.

 

Built-in:
You create the schema and store it locally for
this component only. Related topic: see

Talend Studio User Guide
.

 

Repository:
You have already created the schema and stored it
in the Repository. You can reuse it in various
projects and job designs. Related topic: see

Talend Studio User Guide
.

Select Algorithm

Set a stemming algorithm for each analyzed
column of the type
string.

Column: list of
the analyzed column.

Algorithm: Select
from the list the algorithm language you want to
check the column data against.

Advanced
settings

tStatCatcher
Statistics

Select this check box to collect log data at
the component level.

Global
Variables

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It
requires an input flow as well as an
output.

 

Generating stems for a list of English words

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This basic scenario describes a four-component Job that reads a list of
English words from a one-column delimited file, extracts the stems of the
words, and displays both the list of words and the corresponding stems on
the Run console.

Setting up the Job

  1. Drop the following components from the Palette onto the design
    workspace: tFileInputDelimited, tMap, tStem, and tLogRow.
  2. Link the tFileInputDelimited component to the
    tMap component
    using a Row >
    Main
    connection.
  3. Link the tMap
    component to the tStem component using a Row > Main connection, and give the output
    row connection a name, out in this example.
  4. Link the tStem
    component to the tLogRow component using a Row > Main connection.

    tStem_1.png

Configuring the components

  1. Double-click the tFileInputDelimited component to open
    its Basic settings
    view.

    tStem_2.png

  2. Browse to the input file, and set basic properties
    based on the structure of the input file. In this
    example, the input file provides a list of English
    words in different variant forms, and does not have
    a header. The following is an exact of the file
    content.

  3. Click the […]
    button next to Edit
    schema
    to open the Schema dialog box, and
    set the input schema, which should contain one
    column named Word
    in this example.

    tStem_3.png

    When done, click OK
    to close the dialog box.
  4. Double-click the tMap
    component to open the map editor. We will use this
    component to map the single-column input flow to a
    two-column data flow to feed the tStem component.

    tStem_4.png

  5. Click the [+] button
    to add two columns to the output schema and name
    them Fullform and
    Stem
    respectively. Then, drag the Word column from the input table onto
    the Fullform
    column, then onto the Stem column, in the output table.

    When done, click OK
    to close the map editor and propagate the changes to
    the next component.
  6. Double-click the tStem component to open its Basic settings view.

    tStem_5.png

  7. In the Select
    Algorithm
    table, click in the Algorithm field for the
    Stem column,
    which will carry the word stems extracted from the
    input data, and select English as the algorithm
    language.
  8. Double-click the tLogRow component to open its
    Basic settings
    view, and select the Table option for better readable
    display of the Job execution result.

    tStem_6.png

Executing the Job

  1. Press Ctrl+S to save
    your Job.
  2. Press F6 or click the
    Run button on the
    Run tab to
    execute the Job.

    The list of words read from the input data and their
    corresponding stems are displayed on the Run console.
    tStem_7.png

Extracting the stems of English words from a specific DB
column

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a six-component Job that carries out linguistic
normalization on data in the translation column and
extract the base part (word stem) of all English words.

The aim of this Job is to create a kind of dictionary of stems of the English
words listed in the translation column. This dictionary
may be used at a later stage in order to check new words to be put in the
selected table. The extracted English stems are written in an output file
along with the number of their occurrences in the
translation column.

In this scenario, we have already stored the main input schema in the
Repository. For more information about storing schema metadata in the
Repository, see
Talend Studio User
Guide
.

The main input table contains eight columns: id_key,
id_lang, translation,
id_status, id_user_trans,
id_user_validate,
id_editor and date. We
want to extract the stem of the English words in the
translation column.

Setting up the Job

  1. In the Repository
    tree view, expand MetadataDB
    Connections
    where you have stored the
    main input schema and drop the relevant file onto
    the design workspace.

    The Components
    dialog box displays with the corresponding component
    selected by default.
    tStem_8.png

  2. Click OK to drop the
    tMysqlInput
    component onto the workspace.

    The input table used in this scenario is called
    translation. It holds several
    columns including the
    translation column that holds
    the English words we want to stem.
  3. Drop the following components from the Palette onto the design
    workspace: tNormalize, tFilterRow, tStem, tAggregateRow and tFileOutputExcel.
  4. Connect the component together using the Main links with the
    exception of the tFilterRowtStem connection that should use a
    Filter
    link.

    tStem_9.png

Configuring the data input

  1. Double-click the main input database component to
    display its Basic
    settings
    views.

    tStem_10.png

    The property fields for tMysqlInput are automatically filled
    in. If you do not define your input schema locally
    in the Repository, fill in the details manually
    after selecting Built-in in the Schema and Property Type fields.
  2. If required, modify the query in the Query box.

    In this example, we want to work only on the English
    words and this is why the id_lang is
    set to 1.

Configuring the preprocessing process

  1. Double-click tNormalize to display its Basic settings view and
    define the component properties.

    tStem_11.png

  2. From the Column to
    normalize
    list, select
    translation.

    This will split the data strings in the
    translation column into
    words.
  3. In the Item separator
    field, enter the separator which will delimits data
    in the translation column, a
    space character in this example.
  4. Double-click tFilterRow to display its Basic settings view and
    define the component properties.

    tStem_12.png

  5. Select the logical operator you want to use in order
    to combine simple filtering and advanced
    mode.
  6. In the Conditions
    area, click the plus button to add one or more
    conditions to the output flow. And then in the
    corresponding table column:

    • select the input column you want to operate
      on,

    • select the needed function on the
      list,

    • select the operator to bind the input column
      with the value,

    • type in the value for content
      filtering.

      In this example, we want to filter all
      words in the translation
      column that have less than three letters.

Configuring word stem extraction

  1. Double-click tStem to
    display its Basic
    settings
    view and define the component
    properties.

    tStem_13.png

  2. In the Select
    Algorithm
    area, click in the Algorithm cell that
    corresponds to the translation column. And then select
    from the list the algorithm language you want to
    check the column data against, English in this scenario.

Configuring the data output

  1. Double-click tAggregateRow to display its Basic settings view and
    define the component properties.

    tStem_14.png

  2. Click the […]
    button next to Edit
    schema
    to open a dialog box. Here you
    can define the output flow.

    tStem_15.png

  3. In the output flow to the right of the dialog box,
    click the plus button to add as many columns as you
    need in the output flow.

    In this example, we want to have two output columns,
    the translation column and a
    new output column called
    count.
    When done, click OK
    to close the dialog box and proceed to the next
    step.
  4. In the tAggregateRow
    basic settings view and in the Group by area, click the plus button
    to add an many lines as needed. Here you can define
    the group-by values.

    • Click in the Output
      column
      line and select the output
      column that will hold the aggregated data, the
      translation column in this
      example.

    • Click in the Input
      column position
      line and select the
      input column from which you want to collect the
      values to be aggregated, the
      translation column in this
      example.

  5. In the Operations
    area, click the plus button to add lines for the
    columns that will hold the aggregated data. Here you
    can define the calculation values.

    • Click in the Output
      column
      line and select the destination
      column from the list, the
      translation column in this
      example.

    • Click in the Function column line and select any of
      the listed operations.

      In this example, we want to count the
      number of distinct stems to be listed only once in
      the output column.

    • Click in the Input
      column position
      line and select the
      input column from which you want to collect the
      values to be aggregated, the
      id_key column in this
      example.

  6. Double-click tFileOutputExcel to display its
    Basic settings
    view and define the component properties.

    tStem_16.png

  7. Set the destination file path and define the settings
    of the file according to your needs.

Executing the Job

Save the Job and press F6
to execute it.

The output Excel file is written in the defined path.
This file holds the extracted English word stems in
the translation column and the
count of each stem in the count
column.
The figure below illustrates an extraction of the
output file.
tStem_17.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x