tStem
Enables to standardize data in columns before matching this data.
tStem standardizes data in columns through the process of linguistic
normalization, in which the variant forms of a word are reduced to a common form.
tStem Standard properties
These properties are used to configure tStem running in the Standard Job framework.
The Standard
tStem component belongs to the Data Quality family.
This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
Basic
settings
Schema and |
A schema is a row description, it defines |
 |
Built-in: |
 |
Repository: |
Select Algorithm |
Set a stemming algorithm for each analyzed
Column: list of
Algorithm: Select |
Advanced
settings
tStatCatcher |
Select this check box to collect log data at |
Global
Variables
Global |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is an intermediary step. It |
Â
Generating stems for a list of English words
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This basic scenario describes a four-component Job that reads a list of
English words from a one-column delimited file, extracts the stems of the
words, and displays both the list of words and the corresponding stems on
the Run console.
Setting up the Job
-
Drop the following components from the Palette onto the design
workspace: tFileInputDelimited, tMap, tStem, and tLogRow. -
Link the tFileInputDelimited component to the
tMap component
using a Row >
Main
connection. -
Link the tMap
component to the tStem component using a Row > Main connection, and give the output
row connection a name, out in this example. -
Link the tStem
component to the tLogRow component using a Row > Main connection.
Configuring the components
-
Double-click the tFileInputDelimited component to open
its Basic settings
view. -
Browse to the input file, and set basic properties
based on the structure of the input file. In this
example, the input file provides a list of English
words in different variant forms, and does not have
a header. The following is an exact of the file
content.123456789computerizecomputerizedcomputerizingprogramprogrammingcookingcookedcooksevaporable -
Click the […]
button next to Edit
schema to open the Schema dialog box, and
set the input schema, which should contain one
column named Word
in this example.When done, click OK
to close the dialog box. -
Double-click the tMap
component to open the map editor. We will use this
component to map the single-column input flow to a
two-column data flow to feed the tStem component. -
Click the [+] button
to add two columns to the output schema and name
them Fullform and
Stem
respectively. Then, drag the Word column from the input table onto
the Fullform
column, then onto the Stem column, in the output table.When done, click OK
to close the map editor and propagate the changes to
the next component. -
Double-click the tStem component to open its Basic settings view.
-
In the Select
Algorithm table, click in the Algorithm field for the
Stem column,
which will carry the word stems extracted from the
input data, and select English as the algorithm
language. -
Double-click the tLogRow component to open its
Basic settings
view, and select the Table option for better readable
display of the Job execution result.
Executing the Job
-
Press Ctrl+S to save
your Job. -
Press F6 or click the
Run button on the
Run tab to
execute the Job.The list of words read from the input data and their
corresponding stems are displayed on the Run console.
Extracting the stems of English words from a specific DB
column
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This scenario describes a six-component Job that carries out linguistic
normalization on data in the translation column and
extract the base part (word stem) of all English words.
The aim of this Job is to create a kind of dictionary of stems of the English
words listed in the translation column. This dictionary
may be used at a later stage in order to check new words to be put in the
selected table. The extracted English stems are written in an output file
along with the number of their occurrences in the
translation column.
In this scenario, we have already stored the main input schema in the
Repository. For more information about storing schema metadata in the
Repository, see
Talend Studio User
Guide.
The main input table contains eight columns: id_key,
id_lang, translation,
id_status, id_user_trans,
id_user_validate,
id_editor and date. We
want to extract the stem of the English words in the
translation column.
Setting up the Job
-
In the Repository
tree view, expand Metadata – DB
Connections where you have stored the
main input schema and drop the relevant file onto
the design workspace.The Components
dialog box displays with the corresponding component
selected by default. -
Click OK to drop the
tMysqlInput
component onto the workspace.The input table used in this scenario is called
translation. It holds several
columns including the
translation column that holds
the English words we want to stem. -
Drop the following components from the Palette onto the design
workspace: tNormalize, tFilterRow, tStem, tAggregateRow and tFileOutputExcel. -
Connect the component together using the Main links with the
exception of the tFilterRow – tStem connection that should use a
Filter
link.
Configuring the data input
-
Double-click the main input database component to
display its Basic
settings views.The property fields for tMysqlInput are automatically filled
in. If you do not define your input schema locally
in the Repository, fill in the details manually
after selecting Built-in in the Schema and Property Type fields. -
If required, modify the query in the Query box.
In this example, we want to work only on the English
words and this is why theid_lang
is
set to1
.
Configuring the preprocessing process
-
Double-click tNormalize to display its Basic settings view and
define the component properties. -
From the Column to
normalize list, select
translation.This will split the data strings in the
translation column into
words. -
In the Item separator
field, enter the separator which will delimits data
in the translation column, a
space character in this example. -
Double-click tFilterRow to display its Basic settings view and
define the component properties. -
Select the logical operator you want to use in order
to combine simple filtering and advanced
mode. -
In the Conditions
area, click the plus button to add one or more
conditions to the output flow. And then in the
corresponding table column:-
select the input column you want to operate
on, -
select the needed function on the
list, -
select the operator to bind the input column
with the value, -
type in the value for content
filtering.In this example, we want to filter all
words in the translation
column that have less than three letters.
-
Configuring word stem extraction
-
Double-click tStem to
display its Basic
settings view and define the component
properties. -
In the Select
Algorithm area, click in the Algorithm cell that
corresponds to the translation column. And then select
from the list the algorithm language you want to
check the column data against, English in this scenario.
Configuring the data output
-
Double-click tAggregateRow to display its Basic settings view and
define the component properties. -
Click the […]
button next to Edit
schema to open a dialog box. Here you
can define the output flow. -
In the output flow to the right of the dialog box,
click the plus button to add as many columns as you
need in the output flow.In this example, we want to have two output columns,
the translation column and a
new output column called
count.When done, click OK
to close the dialog box and proceed to the next
step. -
In the tAggregateRow
basic settings view and in the Group by area, click the plus button
to add an many lines as needed. Here you can define
the group-by values.-
Click in the Output
column line and select the output
column that will hold the aggregated data, the
translation column in this
example. -
Click in the Input
column position line and select the
input column from which you want to collect the
values to be aggregated, the
translation column in this
example.
-
-
In the Operations
area, click the plus button to add lines for the
columns that will hold the aggregated data. Here you
can define the calculation values.-
Click in the Output
column line and select the destination
column from the list, the
translation column in this
example. -
Click in the Function column line and select any of
the listed operations.In this example, we want to count the
number of distinct stems to be listed only once in
the output column. -
Click in the Input
column position line and select the
input column from which you want to collect the
values to be aggregated, the
id_key column in this
example.
-
-
Double-click tFileOutputExcel to display its
Basic settings
view and define the component properties. -
Set the destination file path and define the settings
of the file according to your needs.
Executing the Job
to execute it.
This file holds the extracted English word stems in
the translation column and the
count of each stem in the count
column.
output file.