tStandardizeRow
Normalizes the incoming data in a separate XML or JSON data flow to
separate or standardize the rule-compliant data from the non-compliant data.
tStandardizeRow tokenizes the data flow it
has received from the preceding component and applies user-defined parser rules to analyze
the data. Based on this analysis, this component normalizes and writes analyzed data in a
separate data flow and tags them using the user-defined rule names. It does not make any
changes on your raw data.
The standardization option adds a supplementary column to the output
flow where the normalized data are then standardized.
The Java library ANTLR is used to parse and tokenize the incoming
data. For further information about ANTLR, see the site
7.3. If your indexes were created with version 7.2 or lower, you need to update them. The
location of the migration procedure depends on the Studio installation:
- With the installer: /addons/scripts/Lucene_Migration_Tool/README.md
- With no installer: in the license email, click the link in Migration tool for Lucene Indexes from version 4 to version 8
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Standard : see tStandardizeRow Standard properties.
The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.
-
MapReduce: see tStandardizeRow MapReduce properties (deprecated).
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
-
Spark Batch: see tStandardizeRow properties for Apache Spark Batch.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
-
Spark Streaming: see tStandardizeRow properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Rule types
and the advanced rule types.
- Basic rule types: Enumeration, Format and Combination. Rules of these types are composed
with some given ANTLR symbols. - Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized
data and standardize them when needed.
The advanced rule types are always executed after the ANTLR specific rules regardless
of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and
Using two parsing levels to extract information from unstructured data.
Talend
provides the pre-defined and case-sensitive elements (ANTLR tokens) as
follows for defining the composition of a string to be matched:
- INT: integer;
- WORD: word;
- WORD+: literals of several words;
- CAPWORD: capitalized word;
- DECIMAL: decimal float;
- FRACTION: fraction float;
- CURRENCY: currencies;
- ROMAN_NUMERAL: Roman numerals;
- ALPHANUM: combination of alphabetic and numeric
characters; - WHITESPACE: whitespace
- UNDEFINED: unexpected strings such as ASCII
codes that any other token cannot recognize.
The following three tables successively present detailed information about the basic
types, the advanced types and the ANTLR symbols used by the basic rule types. These
three tables help you to complete the Conversion rules
table in the Basic settings of this component.
For basic rule types:
Basic Rule Type | Usage | Example |
Conditions of rule composition |
---|---|---|---|
Enumeration | A rule of this type provides a list of possible matches. |
RuleName: LengthUnit
RuleValue: “ |
Each option must be put in a pair of single quotation marks unless this option is a pre-defined element. Defined options must be separated by |
Format
(Rule name starts with upper |
A rule of this type uses the pre-defined elements along with any of user-defined Enumeration, Format or Combination rules to define the composition of a string. |
RuleName: Length
RuleValue: This rule means that a whitespace
|
When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace. |
Format (Rule name starts with lower case) |
A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match. |
RuleName: length
RuleValue: The rule matches strings like 1.4 cm or 1.4cm etc. where the |
N/A |
Combination | A rule of this type is used when you need to create several rules of the same name. |
RuleName: Size (or size )
RuleValue: The rule matches strings like 1.4 cm by 1.4 cm, where |
Literal texts or characters are not accepted as a part of the rule value. When the literal texts or characters are needed, you must create an Enumeration rule to define these texts or characters and then use this Enumeration rule instead. When several Combination rules |
accepted by a rule type, must be put in single quotation marks when used, otherwise they
will be treated as ANTLR grammar symbols or variables and generate errors or unexpected
results at runtime.
For advanced rule types:
Advanced Rule Type | Usage | Example | Conditions |
---|---|---|---|
Regex | A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR. |
RuleName: ZipCode
RuleValue: The rule matches strings like “92150” |
Regular expressions must be Java compliant. |
Index | A rule of this type uses a synonym index as reference to search for the matched incoming data. For further information about available synonym |
A scenario is available in Standardizing addresses from unstructured data. | On Windows, the backslashes need to be doubled orreplaced by slashes / if the path is copied from the file system. If you run the Job using Spark Local mode or if you run the Job When processing a record, a given Index rule matches In a |
Shape | A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data. |
RuleName: Address
RuleValue: This rule matches the addresses For further |
Only the contents put in < > arerecognizable. In the other cases, the contents are considered as error or are omitted. |
For the given ANTLR symbols:
Symbols | Description |
---|---|
|
|
alternative |
's'
|
char or string literal |
+
|
1 or more |
*
|
0 or more |
?
|
optional or semantic predicate |
~
|
match not |
For more information about ANTLR symbols, see: https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.
Search modes for Index rules
One type of the advanced rules used by the tStandardizeRow component is Index rules. Index rules use
synonym indexes as a reference to search for match data.
Using an Index rule without
having the possibility to specify what type of match (exact, partial, fuzzy, etc.) you
want to use on the input flow will not standardize and output the data you expect.
tStandardizeRow allows you to select one of
the following search modes for each Index rule
you define in the component:
Search mode |
Description |
---|---|
Match all |
each word of the input string must exist in the |
Match all fuzzy |
each word of the input string must match similar |
Match any |
the input string should have at least one word that |
Match any fuzzy |
the input string should have at least one word that |
Match exact |
the exact input string should match the exact index |
Match partial |
each word of the input string must exist in the |
1 |
DULUX PAINTPOD EXTRA REACH HANDLE |
And you have created a color index that has the Extra Deep Base string.
If you define an Index rule in
tStandardizeRow and set the search mode to
Match any, the component will return Extra Deep Base as a color for the above record because there is the Extra word that matches the index string. But if you want the
component to only return a match when the exact search string is found in the index, you
set the search mode of the rule to Match exact and the component
will not return a color for the record.
For a Job example, see Extracting exact match by using Index rules.
Different rule types for different parsing levels
The tStandardizeRow component uses basic rules based
on ANTLR grammar and advanced rules defined by
Talend
and not based on
ANTLR.
standardizing data. Suppose, for example, that you want to extract the liquid amount in
the following three records:
1 2 3 |
3M PROJECT LAMP 7 LUMENS 32ML A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L 54MLP FAC 32 ML |
You may start by defining a liquid unit and a liquid amount in basic parser rules as
the following:
If you test these rules in the
Profiling
perspective of studio,
you can see that these rules extract 7 L from 7
LUMENS and this is not what you expect. You do not want that the word
LUMENS is split into two tokens.
The basic rules you have defined above are ANTLR lexer rules and lexer rules are used
for tokenizing the input string. ANTLR does not provide a word boundary symbol like the
used in regular expressions. You must then be careful when choosing
lexer rules because they define how the input strings will be split in tokens.
You can solve such a problem using two approaches:
The first approach is to define another basic rule that matches a word with a numeric
value in front of it, the Amount rule in this example:
This basic rule is a lexer rule, a Format rule that
starts with an uppercase. If you test this rule in the
Profiling
perspective of the Studio, you can see that non liquid amounts are matched by this rule
and the LiquidAmount rule only matches the expected sequence of
characters.
The second approach is to use an advanced rule like a regular expression and define a
word boundary with . You can use a lexer rule to tokenize amounts where
you match any word with a numeric in front of it. Then use a regular expression that
matches liquid amounts as the following: a digit optionally followed by space and
followed by L or ML and terminated by a word boundary.
Note that the regular expression will be applied on the tokens created by the basic
lexer rule.
Profiling
perspective of the Studio as you do with basic
rules. The only means to see the results of advanced rules is by using them in a Job.
The results will look as the following:
1 2 3 4 5 6 7 8 9 10 |
3M PROJECT LAMP 7 LUMENS 32ML <record> <Amount>3M</Amount> <Amount>7 LUMENS</Amount> <LiquidAmount>32ML</LiquidAmount> <UNMATCHED> <CAPWORD>PROJECT</CAPWORD> <CAPWORD>LAMP</CAPWORD> </UNMATCHED> </record> |
For a Job example about the use of the above rules, see Using two parsing levels to extract information from unstructured data.
Comparing these two approaches, the first one uses only ANTLR grammar and may be more
efficient than the second solution which requires a second parsing pass to check each
token against the regular expression. But regular expressions can help people
experienced in regular expressions to create more advanced rules that could hardly be
created using ANTLR only.
tStandardizeRow Standard properties
These properties are used to configure tStandardizeRow running
in the Standard Job framework.
The Standard
tStandardizeRow component belongs to the Data Quality
family.
The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Column to parse |
Select the column to be parsed from the received data flow |
Standardize this field |
Select this check box to standardize the rule-compliant data For further information about this index providing Every time you select or clear this check box, the schema of |
Generate parser code in Routines |
Click this button to enable the data parser of your Studio to In a given Job, when a rule is created, this operation is |
and |
Click the import or export button to exchange a given – When you click the export button, your studio is switched to – When you click the import button, a import wizard is opened For further information, see |
Conversion rules |
Define the rules you need to apply as the following: – In the Name column, – In the Type column, – In the Value column, – In the Search mode A test view is provided to help you create the parser rules of |
Advanced settings
Advanced options for INDEX rules |
– Search UNDEFINED – Word distance for partial – Max edits for fuzzy Fuzzy match gains much in performance with Max edits for fuzzy match. Note:
Jobs migrated in the Studio from older releases run |
Output format |
–XML: this option is –JSON: select this –Pretty print: this |
tStatCatcher |
Select this check box to collect log data at the component |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is an intermediary step. It requires an input This component generates Java code in the Routines node of the Repository tree view. This code implements the grammar defined When you modify /add/delete a rule, the code must be generated To generate it, click the Generate |
Connections |
Outgoing links (from this component to another): Row: Main; Reject
Trigger: Run if; On Component Ok; On Incoming links (from one component to this one): Row: Main; Reject For further information regarding connections, see |
Normalizing data using rules of basic types
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
In this scenario, two steps are performed to:
-
normalize the incoming data (separate the compliant data from the
non-compliant data) and, -
extract the data of interests and display it.
Before replicating these two steps, we need to analyze the source data in order to
figure out what rules need to be composed. For this scenario, the source data is stored
in a .csv file called partsmaster.
There are totally 59 rows of raw data, but some of them are not shown in our
capture.
Through observation, you can expect that the third row will not be recognized as it
contains Oriental characters. Furthermore, you can figure out that:
-
the SKU data contains 34-9923, XC-3211 and pb710125 and
so on. So the rule used to parse the SKU data could
be:Name
Type
Value
"SKU"
"Format"
"(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT
DIGIT (DIGIT DIGIT?)? "
-
for the Size data, the correct format is the
multiplication of two or three lengths plus the length units. Therefore, the
rules used to parse the Size data could be:Name
Type
Value
"LengthUnit"
"Enumeration"
" 'm' | ''' | 'inch' | 'inches' | '"'"
"BY"
"Enumeration"
"'X' | 'x' | 'by' "
"Length"
"Format"
"(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit
""Size"
"Combination"
"Length BY Length BY Length"
"Size"
"Combination"
"Length BY Length"
Two Combination rules use the same name, in which
case, they will be executed in top-down order as is presented in this table.
-
for the Weight data, the correct format is the weight
plus the weight unit. Therefore, the rules used to parse the
Weight data are:Name
Type
Value
"WeightUnit"
"Enumeration"
" 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"
"Weight"
"Format"
"(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit
"
Now, you can begin to replicate the two steps of this scenario.
Separating the compliant data and the non-compliant data
In this first step, four components are used. They are
-
tFileInputDelimited (partsmaster): this component loads and outputs the source data to the component
that follows. -
tStandardizeRow: this component normalizes the
incoming data and rejects the exception data that it cannot recognize. -
tLogRow: this component is used two times to
display respectively the exception data and the output XML data.
To replicate this step, proceed as the following sections illustrate.
Dropping and linking the components
To replicate this step, proceed as follows:
-
Set up the file delimited schema for the
partsmaster.csv file in the Repository of your Studio. For further
information about how to set up a file delimited schema, see
Talend Studio User Guide. -
Under the Metadata node of the
Repository of your Studio, select
the file delimited schema for partsmaster.csv and
drop it on the design workspace. -
From the dialog box that pops up, select tFileInputDelimited.
Then the tFileInputDelimited
(partsmaster) component displays in the design
workspace. -
From the Palette, drop the tStandardizeRow component and two tLogRow components into the design
workspace. -
Right-click tFileInputDelimited
(partsmaster) to open the contextual
menu. -
From the contextual menu, select Row >
Main link to connect this component to tStandardizeRow. -
Accept the schema propagation prompted by the pop-up dialog
box. -
From tStandardizeRow, do the same to
connect this component to one of the two tLogRow components using Row >
Main link and to the other using Row > Reject link and accept the schema
propagation.You can change the name displayed for each of these component as what
has been done for one of the tLogRow
component, named as Exception in this scenario. For
further information, see
Talend Studio User
Guide.
Then you can continue to configure each component to be used.
Configuring the process of normalizing rows
To do this, proceed as follows:
-
Double-click tStandardizeRow to open
its Component view. -
In the Column to parse field, select
SKU_Description_Size_Weight. This is the only
column that the incoming schema has. -
Under the Conversion rules table,
click the plus button eight times to add eight rows in this
table. -
To complete these rows, type in the rules you have figured out when
analyzing the raw data at the beginning of this scenario.The two Size rules are executed in top-down
order. In this example, this order allows this component to match
firstly the sizes with three numbers and then those with two numbers. If
you reverse this order, this component will match the first two numbers
of all sizes before all and then treat the last number of the
three-numbers sizes as unmatched. -
Click the Generate parser code in
routines button. -
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are.The Max edits for fuzzy match is set to
1 by default.
Executing the normalization Job
Press F6 to run this Job.
In the console of the Run view, you can read
the output data structure. Each instance of the XML data is written on a
separate row because the Pretty print check box
is selected in the Advanced settings view of
the tStandardizeRow component.
Corresponding to the raw data, such as, 34-9923, Monolithic
Membrance, 4′ x 8′, 26 lbs, an XML segment is generated where the
compliant data are separated from the noncompliant data (unmatched). Thus the
raw data are normalized.
Again from this console, you can read that a row is rejected as it contains
unrecognizable data.
Extracting data of interest from the normalized XML data
This section presents the second step of this scenario. Four components are added
to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step.
The four components are:
-
tExtractXMLField: this component reads
an input XML field and extracts desired data; -
tFilterRow: this component filters input
rows by setting conditions on the selected columns; -
tLogRow (two times): the two components
receive and display respectively the valid or invalid data against
conditions set in tFilterRow.
To replicate the second step, proceed as the following sections illustrate.
Dropping and linking the additional components
To do this, proceed as follows:
- On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.
-
From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this
scenario, the two tLogRow components
are named respectively ValidSize and
InvalidSize.For further information about how to rename a component, see
Talend Studio User Guide. -
Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect
partsmaster to tStandardizeRow. -
Do the same but with Filter link to
connect tFilterRow to
ValidSize. -
Connect tFilterRow to
InvalidSize with Reject link as you did in the first step to connect
tSandardizeRow to
Exception.
Configuring the process of extracting the XML fields of interest
To do this, proceed as follows
-
Double click tExtractXMLField to open
its Component view. -
Next to Edit schema, click the
three-dot button to open the schema editor. -
On the left side of this editor, select the
SKU_Description_Size_Weight row. -
Click the
button to add this row to the right side of the
editor. -
On the right side of this editor, click the plus button on the bottom
toolbar three times to add three rows. -
Rename the three rows as SKU,
Size, Weight, all of which
are tags of the normalized XML data. -
Click OK to validate this editing and
accept the schema propagation prompted by the pop-up dialog box. Four
rows are automatically filled in the Mapping table on the Component view. -
In the XML field field of the
Component view, select the desired
XML field of the incoming data. In this example, it is
Normalized_field. -
In the Loop XPath Query field, type
in the name of the XML Tree root tag to replace the default value. In
this example, type in"/record"
. -
In the XPath query column of the
Mapping table, type in the XML Tree
tags corresponding to desired normalized data between quotation marks.
In this example, they are"SKU"
,"Size"
,
"Weight"
.
Configuring the process of filtering the rows of interest
To do this, proceed as follows:
-
Double click tFilterRow to open its
Component view. -
Under the Conditions table, click the
plus button to add one row. -
In the InputColumn column of this
row, select Size from the drop-down list. -
In the Function column, select
Empty. -
In the Operator column, select
Not equal to. -
In the Value column, type in the
quotation marks""
.
The condition is set up. The normalized data which contain no null in the
Size tag will be output while those which contain null
in this Size tag will be rejected.
Executing the Job
can read the desired data selected from the normalized XML data
flow.
raw data.
For reason of page space, all of the results are not
displayed.
Standardizing addresses from unstructured data
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
In this scenario, six components are used to standardize addresses from
unstructured input data copied from a Website, by matching the data with the data contained
in an index previously generated by a Job.
For more information about index creation, see tSynonymOutput.
Drop the following components from the Palette to the design workspace.
-
tFixedFlowInput: this component holds the
unstructured data from which the addresses are extracted. -
tStandardizeRow: this component defines the
address rules and generates the addresses in XML format with the defined tags. This
is the process of normalizing and standardizing the initially unstructured data. -
tFilterColumns: this component filters the
standardized addresses. -
tExtractXMLField: this component extracts the
attributes from the Address node of the XML tree in order to
output every address item in formatted columns. -
two tLogRow: these components
are used to display the output data. The first tLogRow returns the errors, if any. The second tLogRow displays the result in the console.
Before starting up to replicate this scenario, you have to retrieve the
content of an index in order to match the unstructured data with the index data. The
content of the index reads as follows:
On the left side are held Paris street names and on the right side are held
the synonyms used in the data. The data will be used as references to standardize the
address data collected from the website of interest, for example, from http://paris.conciergerie.com/travel_paris/address.php.
To replicate this scenario, proceed as the following sections
illustrate.
Preparing the unstructured data
To do this, proceed as follows:
-
Double-click the tFixedFlowInput
component to display its Basic settings
view. -
Click the […] button to open the
Schema dialog box, click the
[+] button to add a column, and name it
input. Then, click OK to validate and close the editor. - In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, paste the data
copied from the Website. This data is unstructured.
Then you can continue to configure the standardization process.
Configuring the process of standardizing the unstructured data
To do this, proceed as follows:
-
Connect the tFixedFlowInput component to
the tStandardizeRow component with a
Row > Main link. -
Double-click the tStandardizeRow
component to display its Basic settings
view.The advanced rule types are always executed after the ANTLR specific rules
regardless of rule order. -
In the Column to parse list, select
input. -
Select the Standardize this field check
box. -
In the Conversion rules
table, click six times on the [+] button
to add six columns. Name them and configure them as shown in the capture. For
“StreetName” rule:- Select the Index type in the Type field.
-
Enter the path to your index in the Value field.
Before the path, enter the protocol: file:// or
hdfs:// if the index is on a cluster. -
Select Match exact as the search mode from the
Search mode list.By using the Match exact mode, you will extract
from the input flow only the strings that exactly match the street name
indexes. For further information about available search modes, see Search modes for Index rules
Then the other rules are:Name
Type
Value
“Zip”
Format
" DIGIT DIGIT DIGIT
DIGIT DIGIT "“City” Enumeration " 'Paris' | 'Paris'
| 'PARIS' | 'Bagnolet' "“SpecialStreetNumber” Format " (INT
('bis'|'ter')) | ( INT '-' INT) "“INT” Shape "<SpecialStreetNumber>"
“Address” Shape " <INT>
<StreetName> <Zip> <City> "As advanced rule types are always executed after the ANTLR specific
rules, the basic rules “Zip”,
“City” and “SpecialStreetNumber”
are executed first followed by “INT” and
“Address”, the advanced rules. -
Click the Generate parser code in
Routines button in order to generate the code in the
Routines.If you do not click on this button, the Job will not be executed. -
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are.The Max edits for fuzzy
match is set to 1 by default. -
In the design workspace, right-click the tStandardizeRow component to connect it to the tLogRow below with a Row > Reject
link. -
Double-click the tLogRow component linked
to the tStandardizeRow component to display
its Basic settings view. -
In the Mode area, select the Table (print values in cells of a table)
option.
Then continue to configure the process of filtering and extracting the data of
interest.
Configuring the process of filtering and extracting the data of interest
To do this, proceed as follows:
-
In the design workspace, connect tStandardizeRow to tFilterColumns with a Row
> Main link. -
In the Basic settings view of tFilterColumns, click the […] button next to the Edit
schema field to open the schema editor. -
Click on the STANDARDIZED_FIELD column, click the
first arrow to filter this column, and then click OK to close the editor and validate the changes. -
In the design workspace, connect tFilterColumns to tExtractXMLField with a Row
> Main link. -
Double-click the tExtractXMLField
component to display its Basic settings
view. -
Click the […] button next to the
Edit schema field to open the schema
editor, add four columns and name them respectively
number, street,
zip and city, and then click
OK to validate and close the
editor. -
In the XML field field, check that the
STANDARDIZED_FIELD column is selected.The content of the STANDARDIZED_FIELD column is an
XML field with a<record>
tag as root. Under this tag,
the structure defined by the rules configured in the tStandardizeRow component allows to extract each item of the
address, as you can see in the following capture. -
In the Loop XPath query field, type in
“/record/Address”. -
In the Mapping table, in the XPath query field, enter
“INT” for the number column,
“StreetName” for the street
column, “Zip” for the zip column
and “City” for the city
column.
Then you can prepare to execute the Job.
Executing the Job
Before executing the Job, you need to add the tLogRow component to display the execution result.
To do this, perform the following operations:
-
In the design workspace, connect tExtractXMLField to the second tLogRow with a Row >
Main link. -
Double-click the tLogRow component to
open its Basic settings view. - In the Schema list, select Built-In.
-
Click the […] button next to the
Edit schema field and check that the
input and the output schemas in the tLogRow
are synchronized.If not, click on the Sync columns
button. -
In the Mode area, select the Table (print values in cells of a table) option.
Press F6 to execute the Job.
The second tLogRow displays the normalized and
standardized addresses in the console according to the index.
For example, you can see that the Champs Elysées address has been selected and
formatted, and only input street names that exactly match an index string are
extracted.
The first tLogRow does not return anything in
this scenario because no error has been generated.
Extracting exact match by using Index rules
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
In this scenario, you will standardize some long descriptions of customer
products by matching the input flow with the data contained in an index. This scenario
explains how to use Index rules to tokenize product
data and then check each token against an index to extract exact match.
For this scenario, you must first create an index by using a Job with the
tSynonymOutput component. You need to create
indexes for the brand, range, color and unit of the customer products. Use the tSynonymOutput component to generate the indexes and feed
them with entries and synonyms. The below capture shows an example Job:
Below is a sample of the generated indexes for this scenario:
Each of the generated indexes has strings (sequences of words) in one column
and their corresponding synonyms in the second column. These strings are used as a
reference data against which the product data, generated by
tFixedFlowInput, will be matched. For further information about index
creation, see tSynonymOutput.
In this scenario, the generated indexes are defined as context variable. For
further information about context variables, see
Talend Studio User Guide.
Setting up the Job
-
Drop the following components from the Palette to the design workspace: tFixedFlowInput,
tStandardizeRow,
tExtractXMLField and tFileOutputExcel. - Connect the components together using Main links.
-
Drop tLogRow to the Job and link
tStandardizeRow to it using a Reject link.
Generating unstructured data
-
Double-click tFixedFlowInput to display
the component Basic settings view.This component will generate the unstructured product data from which the
brand, range, color and unit are extracted. -
Click the […] button to open the
Schema dialog box. -
Click the [+] button and add two columns,
name them as Long_Description and
Supplier_Product_Code. Click
OK to close the dialog box. - In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, enter the
unstructured product data you want to standardize, for example as the
following:HONEYWELL 7DAY SNGL CHNL TIMESWITCH C/W EX LARGE BACKLIT DISPLAY ST9100C1106|ST9100C1006||Data to be standardised||
HONEYWELL AQUASTAT SINGLE IMMERSION T/STAT WITH CAPILLARY 70-140C|L6188B2018U||||
HONEYWELL S PLAN ZONE VALVE CONTROL PACK TW6Z910C|TW6Z910C||||
H/WELL ST6100S1001 24HR 1CH SERVIC TIMER|||||
H/WELL Y603A1133 BOXED Y PLAN |||||
HWEL V8600N2056 LOW VOLT COMBI VALVE |||||
HWEL VR4700C4022 GAS VALVE |||||
DULUX PAINTPOD 2M EXTENSION TUBE |6030950||||
DULUX PAINTPOD REPLACEMENT ROLLER SLEEVE|6030948||||
DULUX PAINTPOD EXTRA REACH HANDLE |6030951||||
DULUX PAINTPOD REPLACEMENT EDGING BRUSH |6030952||||
DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 1L|5180583||||
DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 2.5L|5180652||||
DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 5L|5180744||||
DT TRADE WEATHERSHIELD ULTIMATE OPAQUE EXTRA DEEP BASE 1L|5180584||||
Defining parsing rules to standardize data
-
Double-click tStandardizeRow to display
the component Basic settings view.This component helps you to define the rules necessary to standardize the
unstructured input flow and generates the brand, range, color and unit in
XML format. -
From the Column to parse
list, select Long_Description. -
Select the Standardize this field check
box. -
Define your rules as the following:
-
In the Conversion rules table, click on the
[+] button to add the columns necessary to define
the rules.This scenario focuses on the rules of the type
Index. For detail examples about the other
rule types defined in the capture above, please refer to the other
tStandardizeRow scenarios. -
Define three rules as Brand,
Range and Color. -
From the Type list, select
Index and fill in the
Value field with the context variable of the
indexes you generated.For further informastion about how to create and use context
variables, see
Talend Studio User Guide. -
From the Search mode list, select
Match exact. Search modes are only applicable to
the Index rules.Using the Match exact mode, you will extract
from the input flow only the strings that exactly match the brand,
range and color index strings you generated with the
tSynonymOutput component. For further
information about available search modes, see Search modes for Index rules
-
In the Conversion rules table, click on the
-
Click the Generate parser code in
Routines button in order to generate the code under the
Routines folder in the DQ Repository tree view of the
Profiling
perspective.This step is mandatory, otherwise the Job will not be executed. -
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are.The Max edits for fuzzy
match is set to 1 by default. -
Double-click tLogRow and define the
component settings in the Basic settings
view. -
In the Mode area, select the Table (print values in cells of a table)
option.This component displays the tokens from the input flow that could not be
analysed and matched to any of the index strings.
Extracting exact match
-
Double-click tExtractXMLField to display
the component Basic settings view. -
Click the […] button next to the
Edit schema field to open the schema
editor, add four columns in the output flow and name them as
Brand, Range,
Color and Amount. Click
OK to validate and close the
editor. -
In the XML field field, check that the
STANDARDIZED_FIELD column is
selected.The content of the STANDARDIZED_FIELD
column is an XML field with a<record>
tag as root. Under
this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract items
from the input flow. -
In the Loop XPath query field, enter “/”
to define the root as the loop node. -
In XPath query field in the Mapping table, enter respectively
“record/Brand”,
“record/Range”
“record/Color”
and “record/Measure”. -
Double-click tFileOutputExcel to display
the component Basic settings view. -
Set the destination file name as well as the Sheet name and select the
Include header check box.
Executing the Job
-
Save your Job and press F6 to execute
it.The tLogRow component displays on the
studio console the data the Job could not analyze:The normalized and standardized data is written in the output excel
file. -
Right-click the tFileOutputExcel
component and select Data viewer
A preview of the standardized data is opened in the studio.The Brand, Range and
Color rules you defined have checked each token in
the input flow against the index strings and only input strings that exactly
match an index string are extracted. Take for example line 11, no color has
been extracted from this record as the record does not have an exact match
to any of the color index strings.Try and change in the tStandardizeRow
basic settings the search mode for the Color rule to
Match any, and then execute the Job.You can see that a color has now been extracted for the record on line 11.
This record has one word, “EXTRA”, that matches to the color index “Extra
Deep Base” and that is enough according to the Match
any search mode to consider that the record has a
color.The different search modes available in the tStandardizeRow component allow you to standardize and
output the data you expect.
Using two parsing levels to extract information from unstructured
data
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This scenario describes how to build a set of rules to extract some information from
unstructured data. It explains how to use a basic ANTLR rule to tokenize data then how
to use an advanced rule to check each token created by ANTLR against a regular
expression.
This scenario uses:
-
a tFixedFlowInput component to create the
unstructured data strings. -
a tStandardizeRow component to define the
rules necessary to extract the liquid amounts from the data strings. -
a tLogRow component to display the output
data.
Setting up the Job
-
Drop the following components from the Palette onto the design workspace:
tFixedFlowInput,
tStandardizeRow and tLogRow. - Connect the three components together using the Main links.
Creating the unstructured data
-
Double-click the tFixedFlowInput
component to display its Basic settings
view. -
Click the […] button to open the
Schema dialog box, click the
[+] button to add a column, name the
column product and finally click
OK to validate and close the
box. - In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, enter the following
three strings:
3M PROJECT LAMP 7 LUMENS 32ML
A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
54MLP FAC 32 ML
Creating the parsing rules
-
Double-click the tStandardizeRow
component to display its Basic settings
view. -
From the Column to parse list, select
product. -
In the Conversion rules table, define a
basic rule and an advanced rule as the following:-
Click twice on the [+] button to
add two columns. Name the first as “Amount” and
the second as “LiquidAmount”. -
Select Format as the type for
the basic rule, and define it to read"INT WHITESPACE*
.
WORD" -
Select RegExp as the type for
the advanced rule, and define it to read
"\d+\s*(L|ML)\b"
.The advanced rule will be executed after the basic ANTLR rule. The
“Amount” rule will tokenize the amounts in
the three strings, it matches any word with a numeric in front of
it. Then the RegExp rule will
check each token created by ANTLR against a regular
expression.
-
-
Click the Generate parser code in
Routines button in order to generate the code under the
Routines folder in the DQ Repository tree view of the
Profiling
perspective.This step is mandatory, otherwise the Job will not be executed. -
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are.The Max edits for fuzzy match is set to
1 by default. -
Double-click the tLogRow component and
select the Table (print values in cells of a
table) option in the Mode
area.
Executing the Job
it.
basic ANTLR rule that has tokenized amounts followed by an advanced rule
that has checked each token created by ANTLR against the regular
expression.
Pretty print check box is selected in
the Advanced settings of the tStandardizeRow component.
tStandardizeRow MapReduce properties (deprecated)
These properties are used to configure tStandardizeRow running
in the MapReduce Job framework.
The MapReduce
tStandardizeRow component belongs to the Data Quality
family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Schema et Edit |
A schema is a row description. It defines the number of fields |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Column to parse |
Select the column to be parsed from the received data flow |
Standardize this field |
Select this check box to standardize the rule-compliant data For further information about this index providing Every time you select or clear this check box, the schema of |
Generate parser code in Routines |
Click this button to enable the data parser of your Studio to In a given Job, when a rule is created, this operation is |
and |
Click the import or export button to exchange a given – When you click the export button, your studio is switched to – When you click the import button, a import wizard is opened For further information, see |
Conversion rules |
Define the rules you need to apply as the following: – In the Name column, – In the Type column, – In the Value column, – In the Search mode If you use this component in a A test view is provided to help you create the parser rules of |
Advanced settings
Advanced options for INDEX rules |
– Search UNDEFINED – Word distance for partial – Max edits for fuzzy Fuzzy match gains much in performance with Max edits for fuzzy match. Note:
Jobs migrated in the Studio from older releases run |
Output format |
–XML: this option is –JSON: select this The output flow will be always written in one flow. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component generates code in the Routines node of the In a For further information about a Note that in this documentation, unless otherwise |
Connections |
Outgoing links (from this component to another): Row: Main; Reject
Trigger: Run if; On Component Ok; On Incoming links (from one component to this one): Row: Main; Reject For further information regarding connections, see |
Related scenarios
No scenario is available for the Map/Reduce version of this component yet.
tStandardizeRow properties for Apache Spark Batch
These properties are used to configure tStandardizeRow running
in the Spark Batch Job framework.
The Spark Batch
tStandardizeRow component belongs to the Data Quality family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Column to parse |
Select the column to be parsed from the received data |
Standardize this field |
Select this check box to standardize the rule-compliant For further information about this index providing Every time you select or clear this check box, the schema |
Generate analyzer code as routine |
Click this button to enable the data parser of your In a given Job, when a rule is created, this operation is |
and |
Click the import or export button to exchange a given – When you click the export button, your studio is – When you click the import button, a import wizard is For further information, see |
Conversion rules |
Define the rules you need to apply as the following: – In the Name – In the Type – In the Value – In the Search A test view is provided to help you create the parser |
Advanced settings
Advanced options for INDEX rules |
– Search UNDEFINED – Word distance for partial – Max edits for fuzzy Fuzzy match gains much in performance with Max edits for fuzzy match. Note:
Jobs migrated in the Studio from older releases run |
Output format |
–XML: this option –JSON: select this |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Connections |
Outgoing links (from this component to another): Row: Main; Reject Incoming links (from one component to this one): Row: Main; Reject For further information regarding connections, see |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Batch version of this component
yet.
tStandardizeRow properties for Apache Spark Streaming
These properties are used to configure tStandardizeRow running
in the Spark Streaming Job framework.
The Spark Streaming
tStandardizeRow component belongs to the Data Quality family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Column to parse |
Select the column to be parsed from the received data |
Standardize this field |
Select this check box to standardize the rule-compliant For further information about this index providing Every time you select or clear this check box, the schema |
Generate analyzer code as routine |
Click this button to enable the data parser of your In a given Job, when a rule is created, this operation is |
and |
Click the import or export button to exchange a given – When you click the export button, your studio is – When you click the import button, a import wizard is For further information, see |
Conversion rules |
Define the rules you need to apply as the following: – In the Name – In the Type – In the Value – In the Search A test view is provided to help you create the parser |
Advanced settings
Advanced options for INDEX rules |
– Search UNDEFINED – Word distance for partial – Max edits for fuzzy Fuzzy match gains much in performance with Max edits for fuzzy match. Note:
Jobs migrated in the Studio from older releases run |
Output format |
–XML: this option –JSON: select this |
Usage
Usage rule |
This component, along with the Spark Streaming component Palette it belongs to, appears This component is used as an intermediate step. You need to use the Spark Configuration tab in the This connection is effective on a per-Job basis. For further information about a Note that in this documentation, unless otherwise explicitly stated, a |
Connections |
Outgoing links (from this component to another): Row: Main; Reject Incoming links (from one component to this one): Row: Main; Reject For further information regarding connections, see |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Streaming version of this component
yet.