August 17, 2023

tStandardizeRow – Docs for ESB 5.x

tStandardizeRow

tStandardizeRow_icon32.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tStandardizeRow properties

Component family

Data quality

 

Function

tStandardizeRow tokenizes the
data flow it has received from the preceding component and applies
user-defined parser rules to analyze the data. Based on this
analysis, this component normalizes and writes analyzed data in a
separate data flow and tags them using the user-defined rule names.

The standardization option adds a supplementary column to the
output flow where the normalized data are then standardized.

The Java library ANTLR is used to parse and tokenize the incoming
data. For further information about ANTLR, see the site

http://www.antlr.org/

Note

This component does not make any changes on your raw
data.

Purpose

tStandardizeRow normalizes the
incoming data in a separate XML or JSON data flow. This allows you
to separate, even standardize if need be, the rule-compliant data
from the non-compliant data.

Basic settings

Schema and Edit
schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: You create the schema
and store it locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

 

Repository: You have already
created the schema and stored it in the Repository. You can reuse it
in various projects and job designs. Related topic: see
Talend Studio User
Guide
.

 

Column to parse

Select the column to be parsed from the received data flow

 

Standardize this field

Select this check box to standardize the rule-compliant data
identified, that is, to replace the duplicates of the identified
data with the corresponding standardized data from a given
index.

For further information about this index
providing standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema of this
component is changed automatically, so in a given Job, you need to
click the activated Sync columns
button to fix the inconsistencies in the schema.

 

Generate parser code in Routines

Click this button to enable the data parser of your Studio to
recognize the rules defined in the Conversion
rules
table.

In a given Job, when a rule is created, this operation is required
for the execution of this rule, while if it is on an existing rule
that you have modified, this operation is required only when the
modified rule is of type Enumeration, Format
or Combination. For further
information about all of the rule types, see Rule types.

 

import.png and export.png

Click the import or export button to exchange a given
standardization rule set with the DQ
Repository
.

– When you click the export button, your studio is switched to the
Profiling
perspective and the Parser rule
Settings
view is opened on the workspace with the
relative contents filled automatically . Then if need be, you can
edit the exported rule set and save it to the Libraries > Rules > Parser
folder in the DQ Repository tree
view.

– When you click the import button, a import wizard is opened to
help you import the standardization rule of interest.

For further information, see Talend Studio User
Guide
.

 

Conversion rules

Define the rules you need to apply as the following:

– In the Name column, type in a
name of the rule you want to use. This name is used as the XML tag
or the JSON attribute name and the token name to label the incoming
data identified by this rule.

– In the Type column, select the
type of the rule you need to apply. For further information about
available rule types, see Rule types.

– In the Value column, type in
the syntax of the rule.

– In the Search mode column,
select a search mode from the list. The search modes can be used
only with the Index rule type. For
further information about available search modes, see Search modes for Index rules.

If you use this component in a Talend Map/Reduce Job, bear in mind the
prerequisites for an Index rule
stated in the section Rule types.

A test view is provided to help you create the parser rules of
interest. For further information, see Talend Studio User Guide.

Advanced settings

Advanced options for INDEX rules

Search UNDEFINED fields: select
this check box if you want the component to search for undefined
tokens in the index run results.

Word distance for partial match
(available for the Match partial mode): set the
maximum number of words allowed to come inside a sequence of words
that may be found in the index, default value is
1.

Minimum similarity for fuzzy
match
(Based on the Levenshtein algorithm and
available for fuzzy modes): set the minimum similarities requested
in a token to match an index string. The value is between
0 and 1, default value
is 0.8. When the value is closer to
1, only terms with higher similarity will
be matched.

Output format

XML: this option is selected by
default. It outputs normalized data in XML format.

JSON: select this option to output
normalized data in JSON format.

Pretty print: this option is
selected by default. It allows you to have the output on several
rows. If you clear this check box, you will have the output in one
row.

You can not use the Pretty print option when you use the component as a
Map/Reduce component, and the output flow will be always written in
one flow.

tStatCatcher
Statistics

Select this check box to collect log data at the component level.
Note that this check box is not
available in the Map/Reduce version of the
component.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is an intermediary step. It requires an input flow
as well as an output.

This component generates Java code in the Routines node of the Repository tree view. This code implements the
grammar defined in rules. It is automatically compiled to be used by
the component.

When you modify /add/delete a rule, the code must be generated
again, so that the Job takes the modifications into account and then
becomes runnable.

To generate it, click the Generate parser
code in Routines
button.

Usage in Map/Reduce Jobs

If you have subscribed to one of the Talend solutions with Big Data, you can also
use this component as a Map/Reduce component. In a Talend Map/Reduce Job, this
component is used as an intermediate step and other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

For further information about a Talend Map/Reduce Job, see the sections
describing how to create, convert and configure a Talend Map/Reduce Job of the
Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Trigger: Run if; On Component Ok;
On Component Error.

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see
Talend Studio User
Guide
.

Limitation

n/a

Rule types

Two groups of rule types are provided: the basic rule types and the advanced rule
types.

  • Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some
    given ANTLR symbols.

  • Advanced rule types: Regex, Index and Shape.
    Rules of these types match the tokenized data and standardize them when
    needed.

The advanced rule types are always executed after the ANTLR specific rules
regardless of rule order. For further information about basic and advanced rules,
see Different rule types for different parsing levels
and Scenario 4:
Using two parsing levels to extract information from unstructured data
.

To create the rules of any type, Talend provides the pre-defined and
case-sensitive elements (ANTLR tokens) as follows for defining the composition of a
string to be matched:

  • INT: integer;

  • WORD: word;

  • WORD+: literals of several words;

  • CAPWORD: capitalized word;

  • DECIMAL: decimal float;

  • FRACTION: fraction float;

  • CURRENCY: currencies;

  • ROMAN_NUMERAL: Roman numerals;

  • ALPHANUM: combination of alphabetic and numeric characters;

  • WHITESPACE: whitespace

  • UNDEFINED: unexpected strings such as ASCII codes that any other token
    cannot recognize.

The following three tables successively present detailed information about the
basic types, the advanced types and the ANTLR symbols used by the basic rule types.
These three tables help you to complete the Conversion
rules
table in the Basic settings of
this component.

For basic rule types:

Basic Rule Type

Usage

Example

Conditions of rule composition

Enumeration

A rule of this type provides a list of possible
matches.

RuleName: LengthUnit

RuleValue: ‘inch' | ‘cm'
"

– Each option must be put in a pair of single quotation marks
unless this option is a pre-defined element.

– Defined options must be separated by the | symbol.

Format

(Rule name starts with upper case)

A rule of this type uses the pre-defined elements along with
any of user-defined Enumeration, Format
or Combination rules to define
the composition of a string.

RuleName: Length

RuleValue: "DECIMAL WHITESPACE
LengthUnit"

This rule means that a whitespace between decimal and
lengthunit is required, so it matches strings like,
1.4 cm but does not match a string like
1.4cm. To match both of these cases, you need to define this
rule as, for example, "DECIMAL WHITESPACE*
LengthUnit"
.

LengthUnit is an Enumeration rule defining " ‘inch' | 'cm'
"
.

– When the name of a Format
rule starts with upper case, this rule requires the exact
matching result. It means that you need to define exactly any
single element of a string, even a whitespace.

Format (Rule name starts with
lower case)

A rule of this type is almost the same as a Format rule starting its name with
upper case. The difference is that the Format rule with lower-case initial does not
require exact match.

RuleName: length

RuleValue: "DECIMAL
LengthUnit"

The rule matches strings like 1.4 cm or
1.4cm etc. where the
Decimal is one of the pre-defined element types
and LengthUnit is an Enumeration rule defining " ‘inch' | 'cm'
"
.

n/a

Combination

A rule of this type is used when you need to create several
rules of the same name.

RuleName: Size (or
size)

RuleValue: "length BY
length"

The rule matches strings like 1.4 cm by 1.4
cm
, where length is a Format rule (starting with lower
case) and BY is an Enumeration rule defining " ‘By' | 'by' |
‘x' | ‘X' "
.

– Literal texts or characters are not accepted as a part of
the rule value. When the literal texts or characters are needed,
you must create an Enumeration
rule to define these texts or characters and then use this
Enumeration rule
instead.

– When several Combination
rules use the identical rule name, they are executed in top-down
order in the Conversion rules
table of the Basic settings of
tStandardizeRow, so arrange
them properly in order to obtain the best result. For an
example, see the following scenario.

Warning

Any characters or string literals, if accepted by a rule type, must be put in
single quotation marks when used, otherwise they will be treated as ANTLR
grammar symbols or variables and generate errors or unexpected results at
runtime.

For advanced rule types:

Advanced Rule Type

Usage

Example

Conditions

Regex

A rule of this type uses regular expressions to match the
incoming data tokenized by ANTLR.

RuleName: ZipCode

RuleValue: "\d{5}"

The rule matches strings like
“92150”

Regular expressions must be Java compliant.

Index

A rule of this type uses a synonym index as reference to
search for the matched incoming data.

For further information about available synonym indexes, see
the appendix about data synonym dictionaries in the
Talend Studio User Guide
.

A scenario is available in Scenario 2: Standardizing addresses from unstructured data.

– In Windows, the backslashes need to be
doubled or replaced by slashes / if the path is
copied from the file system.

– When processing a record, a given Index rule matches up only the first string
identified as matchable.

– In a Talend Map/Reduce Job, you need to
compress each synonym index to be used as a zip file; moreover,
if you use Talend
Oozie scheduler to run that
Job, you have to place the zip file in the Hadoop distribution
where the Job is run.

Shape

A rule of this type uses pre-defined elements along with the
established Regex or Index rules or both to match the
incoming data.

RuleName: Address

RuleValue:
"<INT><WORD><StreetType>"

This rule matches the addresses like 12 main
street
, where INT and WORD
are pre-defined tokens (rule elements) and
StreetType is an Index rule which you define along with this
example rule in the Basic
settings
view of this component.

For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data.

Only the contents put in < > are
recognizable. In the other cases, the contents are considered as
error or are omitted.

For the given ANTLR symbols:

Symbols

Meaning

|

alternative

's'

char or string literal

+

1 or more

*

0 or more

?

optional or semantic predicate

~

match not

Examples of using these symbols are presented in the following scenarios, but you
can also find more examples on the following site:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.

Search modes for Index rules

One type of the advanced rules used by the tStandardizeRow component is Index
rules. Index rules use synonym indexes as a
reference to search for match data.

Using an Index rule without having the
possibility to specify what type of match (exact, partial, fuzzy, etc.) you want to
use on the input flow will not standardize and output the data you expect. tStandardizeRow allows you to select one of the following
search modes for each Index rule you define in the
component:

Search mode

Description

Match all

each word of the input string must exist in the index string,
but the index string may contain other words too.

Match all fuzzy

each word of the input string must match similar words of the
index string.

Match any

the input string should have at least one word that matches a
word of the index string.

Match any fuzzy

the input string should have at least one word that is similar
to a word in the index string.

Match exact

the exact input string should match the exact index
string.

Match partial

each word of the input string must exist in the index string
but the input string may contain other words too up to a given
limit, 1 by default. This means that one word of the input
string may not match to any word of the index string

Suppose, for example, that you have the below record in the input flow:

And you have created a color index that has the Extra Deep
Base
string.

If you define an Index rule in tStandardizeRow and set the search mode to
Match any, the component will return Extra Deep
Base
as a color for the above record because there is the
Extra word that matches the index string. But if you want
the component to only return a match when the exact search string is found in the
index, you set the search mode of the rule to Match exact and
the component will not return a color for the record.

For a Job example, see Scenario 3: Extracting exact match by using Index rules.

Different rule types for different parsing levels

The tStandardizeRow component uses basic rules
based on ANTLR grammar and advanced rules defined by Talend and not based
on ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing
and standardizing data. Suppose, for example, that you want to extract the liquid
amount in the following three records:

You may start by defining a liquid unit and a liquid amount in basic parser rules
as the following:

parser_rules.png

If you test these rules in the Profiling perspective of
studio, you can see that these rules extract 7 L from
7 LUMENS and this is not what you expect. You do not want
that the word LUMENS is split into two tokens.

parser_rules2.png

The basic rules you have defined above are ANTLR lexer rules and lexer rules are
used for tokenizing the input string. ANTLR does not provide a word boundary symbol
like the  used in regular expressions. You must then be careful when
choosing lexer rules because they define how the input strings will be split in
tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a
numeric value in front of it, the Amount rule in this example:

parser_rules3.png

This basic rule is a lexer rule, a Format rule
that starts with an uppercase. If you test this rule in the Profiling perspective of the Studio, you can see that non
liquid amounts are matched by this rule and the LiquidAmount
rule only matches the expected sequence of characters.

parser_rules4.png

The second approach is to use an advanced rule like a regular expression and
define a word boundary with . You can use a lexer rule to tokenize
amounts where you match any word with a numeric in front of it. Then use a regular
expression that matches liquid amounts as the following: a digit optionally followed
by space and followed by L or ML and terminated by a word boundary.

parser_rules5.png

Note that the regular expression will be applied on the tokens created by the
basic lexer rule.

You can not check the results of the advanced rule by testing the rule in the
Profiling perspective of the Studio as you do with
basic rules. The only means to see the results of advanced rules is by using them in
a Job. The results will look as the following:

For a Job example about the use of the above rules, see Scenario 4:
Using two parsing levels to extract information from unstructured data
.

Comparing these two approaches, the first one uses only ANTLR grammar and may be
more efficient than the second solution which requires a second parsing pass to
check each token against the regular expression. But regular expressions can help
people experienced in regular expressions to create more advanced rules that could
hardly be created using ANTLR only.

Scenario 1: Normalizing data using rules of basic types

In this scenario, two steps are performed to:

  1. normalize the incoming data (separate the compliant data from the
    non-compliant data) and,

  2. extract the data of interests and display it.

Before replicating these two steps, we need to analyze the source data in order to
figure out what rules need to be composed. For this scenario, the source data is stored
in a .csv file called partsmaster.

Use_Case_tStandardizeRow_rawdata.png

There are totally 59 rows of raw data, but some of them are not shown in our
capture.

Through observation, you can expect that the third row will not be recognized as it
contains Oriental characters. Furthermore, you can figure out that:

  • the SKU data contains 34-9923, XC-3211 and pb710125 and
    so on. So the rule used to parse the SKU data could
    be:

    Name

    Type

    Value

    "SKU"

    "Format"

    "(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT
    DIGIT (DIGIT DIGIT?)? "

  • for the Size data, the correct format is the
    multiplication of two or three lengths plus the length units. Therefore, the
    rules used to parse the Size data could be:

    Name

    Type

    Value

    "LengthUnit"

    "Enumeration"

    " 'm' | '''' | 'inch' | 'inches' | '"'"

    "BY"

    "Enumeration"

    "'X' | 'x' | 'by' "

    "Length"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit
    "

    "Size"

    "Combination"

    "Length BY Length BY Length"

    "Size"

    "Combination"

    "Length BY Length"

Two Combination rules use the same name, in which
case, they will be executed in top-down order as is presented in this table.

  • for the Weight data, the correct format is the weight
    plus the weight unit. Therefore, the rules used to parse the
    Weight data are:

    Name

    Type

    Value

    "WeightUnit"

    "Enumeration"

    " 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

    "Weight"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit
    "

Now, you can begin to replicate the two steps of this scenario.

Separating the compliant data and the non-compliant data

In this first step, four components are used. They are

  • tFileInputDelimited
    (partsmaster): this component loads and outputs the
    source data to the component that follows.

  • tStandardizeRow: this component
    normalizes the incoming data and rejects the exception data that it cannot
    recognize.

  • tLogRow: this component is used two
    times to display respectively the exception data and the output XML
    data.

Use_Case_tStandardizeRow_Normalization.png

To replicate this step, proceed as the following sections illustrate.

Dropping and linking the components

To replicate this step, proceed as follows:

  1. Set up the file delimited schema for the
    partsmaster.csv file in the Repository of your Studio. For further
    information about how to set up a file delimited schema, see
    Talend Studio User Guide.

  2. Under the Metadata node of the
    Repository of your Studio, select
    the file delimited schema for partsmaster.csv and
    drop it on the design workspace.

  3. From the dialog box that pops up, select tFileInputDelimited.

    Use_Case_tStandardizeRow2.png

    Then the tFileInputDelimited
    (partsmaster) component displays in the design
    workspace.

  4. From the Palette, drop the tStandardizeRow component and two tLogRow components into the design
    workspace.

  5. Right-click tFileInputDelimited
    (partsmaster) to open the contextual
    menu.

  6. From the contextual menu, select Row >
    Main
    link to connect this component to tStandardizeRow.

  7. Accept the schema propagation prompted by the pop-up dialog
    box.

  8. From tStandardizeRow, do the same to
    connect this component to one of the two tLogRow components using Row >
    Main
    link and to the other using Row > Reject link and accept the schema
    propagation.

    You can change the name displayed for each of these component as what
    has been done for one of the tLogRow
    component, named as Exception in this scenario. For
    further information, see Talend Studio User
    Guide
    .

Then you can continue to configure each component to be used.

Configuring the process of normalizing rows

To do this, proceed as follows:

  1. Double-click tStandardizeRow to open
    its Component view.

    Use_Case_tStandardizeRow3.png
  2. In the Column to parse field, select
    SKU_Description_Size_Weight. This is the only
    column that the incoming schema has.

  3. Under the Conversion rules table,
    click the plus button eight times to add eight rows in this
    table.

  4. To complete these rows, type in the rules you have figured out when
    analyzing the raw data at the beginning of this scenario.

    The two Size rules are executed in top-down
    order. In this example, this order allows this component to match
    firstly the sizes with three numbers and then those with two numbers. If
    you reverse this order, this component will match the first two numbers
    of all sizes before all and then treat the last number of the
    three-numbers sizes as unmatched.

  5. Click the Generate parser code in
    routines
    button.

  6. In the Advanced settings view, leave
    the options selected by default in the Output
    format
    area as they are.

Executing the normalization Job

Press F6 to run this Job.

In the console of the Run view, you can read
the output data structure. Each instance of the XML data is written on a
separate row because the Pretty print check box
is selected in the Advanced settings view of
the tStandardizeRow component.

Use_Case_tStandardizeRow4.png

Corresponding to the raw data, such as, 34-9923, Monolithic
Membrance, 4′ x 8′, 26 lbs
, an XML segment is generated where the
compliant data are separated from the noncompliant data (unmatched). Thus the
raw data are normalized.

Again from this console, you can read that a row is rejected as it contains
unrecognizable data.

Use_Case_tStandardizeRow5.png

Extracting data of interest from the normalized XML data

This section presents the second step of this scenario. Four components are added
to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step.
The four components are:

  • tExtractXMLField: this component reads
    an input XML field and extracts desired data;

  • tFilterRow: this component filters input
    rows by setting conditions on the selected columns;

  • tLogRow (two times): the two components
    receive and display respectively the valid or invalid data against
    conditions set in tFilterRow.

Use_Case_tStandardizeRow1.png

To replicate the second step, proceed as the following sections illustrate.

Droping and linking the additional components

To do this, proceed as follows:

  1. On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.

  2. From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this
    scenario, the two tLogRow components
    are named respectively ValidSize and
    InvalidSize.

    For further information about how to rename a component, see
    Talend Studio User Guide.

  3. Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect
    partsmaster to tStandardizeRow.

  4. Do the same but with Filter link to
    connect tFilterRow to
    ValidSize.

  5. Connect tFilterRow to
    InvalidSize with Reject link as you did in the first step to connect
    tSandardizeRow to
    Exception.

Configuring the process of extracting the XML fields of interest

To do this, proceed as follows

  1. Double click tExtractXMLField to open
    its Component view.

    Use_Case_tStandardizeRow6.png
  2. Next to Edit schema, click the
    three-dot button to open the schema editor.

    Use_Case_tStandardizeRow7.png
  3. On the left side of this editor, select the
    SKU_Description_Size_Weight row.

  4. Click the Schema_icon_RightArrow.png button to add this row to the right side of the
    editor.

  5. On the right side of this editor, click the plus button on the bottom
    toolbar three times to add three rows.

  6. Rename the three rows as SKU,
    Size, Weight, all of which
    are tags of the normalized XML data.

  7. Click OK to validate this editing and
    accept the schema propagation prompted by the pop-up dialog box. Four
    rows are automatically filled in the Mapping table on the Component view.

  8. In the XML field field of the
    Component view, select the desired
    XML field of the incoming data. In this example, it is
    Normalized_field.

  9. In the Loop XPath Query field, type
    in the name of the XML Tree root tag to replace the default value. In
    this example, type in "/record".

  10. In the XPath query column of the
    Mapping table, type in the XML Tree
    tags corresponding to desired normalized data between quotation marks.
    In this example, they are "SKU", "Size",
    "Weight".

Configuring the process of filtering the rows of interest

To do this, proceed as follows:

  1. Double click tFilterRow to open its
    Component view.

    Use_Case_tStandardizeRow8.png
  2. Under the Conditions table, click the
    plus button to add one row.

  3. In the InputColumn column of this
    row, select Size from the drop-down list.

  4. In the Function column, select
    Empty.

  5. In the Operator column, select
    Not equal to.

  6. In the Value column, type in the
    quotation marks "".

The condition is set up. The normalized data which contain no null in the
Size tag will be output while those which contain null
in this Size tag will be rejected.

Executing the Job
  • Press F6 to run this Job.

    In the console of the Run view, you
    can read the desired data selected from the normalized XML data
    flow.

    Use_Case_tStandardizeRow9.png

    This Job lists the normalized data of interest in parallel with the
    raw data.

    You can read the data rejected by tFilterRow as well.

    Use_Case_tStandardizeRow10.png

    Note

    For reason of page space, all of the results are not
    displayed.

Scenario 2: Standardizing addresses from unstructured data

In this scenario, six components are used to standardize addresses from unstructured
input data copied from a Website, by matching the data with the data contained in an
index previously generated by a Job.

For more information about index creation, see tSynonymOutput.

Drop the following components from the Palette to the
design workspace.

  • tFixedFlowInput: this component holds the
    unstructured data from which the addresses are extracted.

  • tStandardizeRow: this component defines the
    address rules and generates the addresses in XML format with the defined tags.
    This is the process of normalizing and standardizing the initially unstructured
    data.

  • tFilterColumns: this component filters the
    standardized addresses.

  • tExtractXMLField: this component extracts
    the attributes from the Address node of the XML tree in
    order to output every address item in formatted columns.

  • two tLogRow: these components are used to
    display the output data. The first tLogRow
    returns the errors, if any. The second tLogRow
    displays the result in the console.

Use_Case_tStandardizeRow1_s2.png

Before starting up to replicate this scenario, you have to retrieve the content of an
index in order to match the unstructured data with the index data. The content of the
index reads as follows:

Use_Case_tStandardizeRow_s2_Index.png

On the left side are held Paris street names and on the right side are held the
synonyms used in the data. The data will be used as references to standardize the
address data collected from the website of interest, for example, from http://paris.conciergerie.com/travel_paris/address.php.

To replicate this scenario, proceed as the following sections illustrate.

Preparing the unstructured data

To do this, proceed as follows:

  1. Double-click the tFixedFlowInput
    component to display its Basic settings
    view.

    Use_Case_tStandardizeRow2_s2.png
  2. Click the […] button to open the
    [Schema] dialog box, click the
    [+] button to add a column, and name it
    input. Then, click OK to validate and close the editor.

    Use_Case_tStandardizeRow_s2_SchemaFixedFlow.png
  3. In the Mode area, select Use Inline Content (delimited file).

  4. In the Content field, paste the data
    copied from the Website. This data is unstructured.

Then you can continue to configure the standardization process.

Configuring the process of standardizing the unstructured data

To do this, proceed as follows:

  1. Connect the tFixedFlowInput component to
    the tStandardizeRow component with a
    Row > Main link.

  2. Double-click the tStandardizeRow
    component to display its Basic settings
    view.

    Use_Case_tStandardizeRow3_s2.png

    The advanced rule types are always executed after the ANTLR specific rules
    regardless of rule order.

  3. In the Column to parse list, select
    input.

  4. Select the Standardize this field check
    box.

  5. In the Conversion rules table, click six
    times on the [+] button to add six columns.
    Name them and configure them as shown in the capture.

    For “StreetName” rule: select the Index type in the Type field and enter the path to your index in the Value field.

    • Select the Index type from
      the Type list.

    • Enter the path to your index in the Value field.

    • Select Match exact as the
      search mode from the Search
      mode
      list.

      By using the Match exact
      mode, you will extract from the input flow only the strings that
      exactly match the street name indexes.

      For further information about available search modes, see
      Search modes for Index rules

    Then the other rules are:

    Name

    Type

    Value

    “Zip”

    Format

    " DIGIT DIGIT DIGIT DIGIT DIGIT "

    “City” Enumeration

    " 'Paris' | 'Paris' | 'PARIS' | 'Bagnolet'
    "

    “SpecialStreetNumber” Format

    " (INT ('bis'|'ter')) | ( INT '-' INT) "

    INT” Shape

    "<SpecialStreetNumber>"

    “Address” Shape

    " <INT> <StreetName> <Zip>
    <City> "

    As advanced rule types are always executed after the ANTLR specific rules,
    the basic rules “Zip”, “City” and “SpecialStreetNumber” are executed first followed by
    “INT” and “Address”, the advanced rules.

  6. Click the Generate parser code in
    Routines
    button in order to generate the code in the
    Routines.

    Warning

    If you do not click on this button, the Job will not be executed.

  7. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

  8. In the design workspace, right-click the tStandardizeRow component to connect it to the tLogRow below with a Row > Reject
    link.

  9. Double-click the tLogRow component linked
    to the tStandardizeRow component to display
    its Basic settings view.

    Use_Case_tStandardizeRow6_s2.png
  10. In the Mode area, select the Table (print values in cells of a table)
    option.

Then continue to configure the process of filtering and extracting the data of
interest.

Configuring the process of filtering and extracting the data of interest

To do this, proceed as follows:

  1. In the design workspace, connect tStandardizeRow to tFilterColumns with a Row
    > Main link.

  2. In the Basic settings view of tFilterColumns, click the […] button next to the Edit
    schema
    field to open the schema editor.

    Use_Case_tStandardizeRow_s2_FilterColumnsschema.png
  3. Click on the STANDARDIZED_FIELD column, click the
    first arrow to filter this column, and then click OK to close the editor and validate the changes.

  4. In the design workspace, connect tFilterColumns to tExtractXMLField with a Row
    > Main link.

  5. Double-click the tExtractXMLField
    component to display its Basic settings
    view.

  6. Click the […] button next to the
    Edit schema field to open the schema
    editor, add four columns and name them respectively
    number, street,
    zip and city, and then click
    OK to validate and close the
    editor.

    Use_Case_tStandardizeRow_s2_SchemaExtractXML.png
  7. In the XML field field, check that the
    STANDARDIZED_FIELD column is selected.

    The content of the STANDARDIZED_FIELD column is an
    XML field with a <record> tag as root. Under this tag,
    the structure defined by the rules configured in the tStandardizeRow component allows to extract each item of the
    address, as you can see in the following capture.

    Use_Case_tStandardizeRow_s2_Result1.png
  8. In the Loop XPath query field, type in
    “/record/Address”.

  9. In the Mapping table, in the XPath query field, enter
    “INT” for the number column,
    “StreetName” for the street
    column, “Zip” for the zip column
    and “City” for the city
    column.

Then you can prepare to execute the Job.

Executing the Job

Before executing the Job, you need to add the tLogRow component to display the execution result.

To do this, perform the following operations:

  1. In the design workspace, connect tExtractXMLField to the second tLogRow with a Row >
    Main link.

  2. Double-click the tLogRow component to
    open its Basic settings view.

    Use_Case_tStandardizeRow6_s2.png
  3. In the Schema list, select Built-In.

  4. Click the […] button next to the
    Edit schema field and check that the
    input and the output schemas in the tLogRow
    are synchronized.

    If not, click on the Sync columns
    button.

  5. In the Mode area, select the Table (print values in cells of a table) option.

Press F6 to execute the Job.

Use_Case_tStandardizeRow8_s2.png

The second tLogRow displays the normalized and
standardized addresses in the console according to the index.

For example, you can see that the Champs Elysées address has been selected and
formatted, and only input street names that exactly match an index string are
extracted.

The first tLogRow does not return anything in
this scenario because no error has been generated.

Scenario 3: Extracting exact match by using Index rules

In this scenario, you will standardize some long descriptions of customer products by
matching the input flow with the data contained in an index. This scenario explains how
to use Index rules to tokenize product data and then
check each token against an index to extract exact match.

For this scenario, you must first create an index by using a Job with the tSynonymOutput component. You need to create indexes for the
brand, range, color and unit of the customer products. Use the tSynonymOutput component to generate the indexes and feed them with
entries and synonyms. The below capture shows an example Job:

use_case-tstandardizerow-index_rules.png

Below is a sample of the generated indexes for this scenario:

use_case-tstandardizerow-indexes.png

Each of the generated indexes has strings (sequences of words) in one column and their
corresponding synonyms in the second column. These strings are used as a reference data
against which the product data, generated by
tFixedFlowInput
, will be matched. For further information about index
creation, see tSynonymOutput.

In this scenario, the generated indexes are defined as context variable. For further
information about context variables, see Talend Studio User
Guide
.

Setting up the Job

  1. Drop the following components from the Palette to the design workspace: tFixedFlowInput,
    tStandardizeRow
    ,
    tExtractXMLField
    and tFileOutputExcel.

    use_case-tstandardizerow-index_rules2.png
  2. Connect the components together using Main links.

  3. Drop tLogRow to the Job and link
    tStandardizeRow to it using a Reject link.

Generating unstructured data

  1. Double-click tFixedFlowInput to display
    the component Basic settings view.

    use_case-tstandardizerow-index_rules3.png

    This component will generate the unstructured product data from which the
    brand, range, color and unit are extracted.

  2. Click the […] button to open the
    [Schema] dialog box.

    use_case-tstandardizerow-index_rules4.png
  3. Click the [+] button and add two columns,
    name them as Long_Description and
    Supplier_Product_Code. Click
    OK to close the dialog box.

  4. In the Mode area, select Use Inline Content (delimited file).

  5. In the Content field, enter the
    unstructured product data you want to standardize, for example as the
    following:

Defining parsing rules to standardize data

  1. Double-click tStandardizeRow to display
    the component Basic settings view.

    use_case-tstandardizerow-index_rules5.png

    This component helps you to define the rules necessary to standardize the
    unstructured input flow and generates the brand, range, color and unit in
    XML format.

  2. From the Column to parse list, select
    Long_Description.

  3. Select the Standardize this field check
    box.

  4. Define your rules as the following:

    • In the Conversion rules
      table, click on the [+] button
      to add the columns necessary to define the rules.

      This scenario focuses on the rules of the type Index. For detail examples about
      the other rule types defined in the capture above, please refer
      to the other tStandardizeRow
      scenarios.

    • Define three rules as Brand, Range and
      Color.

    • From the Type list, select
      Index and fill in the
      Value field with the
      context variable of the indexes you generated.

      For further information about how to create and use context
      variables, see Talend Studio User
      Guide
      .

    • From the Search mode list,
      select Match exact. Search
      modes are only applicable to the Index rules.

      Using the Match exact mode,
      you will extract from the input flow only the strings that
      exactly match the brand, range and color index strings you
      generated with the tSynonymOutput component.

      For further information about available search modes, see
      Search modes for Index rules

  5. Click the Generate parser code in
    Routines
    button in order to generate the code under the
    Routines folder in the DQ Repository tree view of the Profiling perspective.

    This step is mandatory, otherwise the Job will not be executed.

  6. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

  7. Double-click tLogRow and define the
    component settings in the Basic settings
    view.

  8. In the Mode area, select the Table (print values in cells of a table)
    option.

    This component displays the tokens from the input flow that could not be
    analysed and matched to any of the index strings.

Extracting exact match

  1. Double-click tExtractXMLField to display
    the component Basic settings view.

    use_case-tstandardizerow-index_rules7.png
  2. Click the […] button next to the
    Edit schema field to open the schema
    editor, add four columns in the output flow and name them as
    Brand, Range,
    Color and Amount. Click
    OK to validate and close the
    editor.

    use_case-tstandardizerow-index_rules6.png
  3. In the XML field field, check that the
    STANDARDIZED_FIELD column is
    selected.

    The content of the STANDARDIZED_FIELD
    column is an XML field with a <record> tag as root. Under
    this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract items
    from the input flow.

  4. In the Loop XPath query field, enter “/”
    to define the root as the loop node.

  5. In XPath query field in the Mapping table, enter respectively
    “record/Brand”,
    “record/Range”“record/Color”
    and “record/Measure”.

  6. Double-click tFileOutputExcel to display
    the component Basic settings view.

    use_case-tstandardizerow-index_rules8.png
  7. Set the destination file name as well as the Sheet name and select the
    Include header check box.

Executing the Job

  1. Save your Job and press F6 to execute
    it.

    The tLogRow component displays on the
    studio console the data the Job could not analyze:

    use_case-tstandardizerow-index_rules9.png

    The normalized and standardized data is written in the output excel
    file.

  2. Right-click the tFileOutputExcel
    component and select Data viewer

    A preview of the standardized data is opened in the studio.

    use_case-tstandardizerow-index_rules10.png

    The Brand, Range and
    Color rules you defined have checked each token in
    the input flow against the index strings and only input strings that exactly
    match an index string are extracted. Take for example line 11, no color has
    been extracted from this record as the record does not have an exact match
    to any of the color index strings.

    Try and change in the tStandardizeRow
    basic settings the search mode for the Color rule to
    Match any, and then execute the Job.

    use_case-tstandardizerow-index_rules11.png

    You can see that a color has now been extracted for the record on line 11.
    This record has one word, “EXTRA”, that matches to the color index “Extra
    Deep Base” and that is enough according to the Match
    any
    search mode to consider that the record has a
    color.

    The different search modes available in the tStandardizeRow component allow you to standardize and
    output the data you expect.

Scenario 4:
Using two parsing levels to extract information from unstructured data

This scenario describes how to build a set of rules to extract some information from
unstructured data. It explains how to use a basic ANTLR rule to tokenize data then how
to use an advanced rule to check each token created by ANTLR against a regular
expression.

This scenario uses:

  • a tFixedFlowInput component to create the
    unstructured data strings.

  • a tStandardizeRow component to define the
    rules necessary to extract the liquid amounts from the data strings.

  • a tLogRow component to display the output
    data.

use_case-tstandardizerow_rules.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace:
    tFixedFlowInput
    ,
    tStandardizeRow
    and tLogRow.

  2. Connect the three components together using the Main links.

Creating the unstructured data

  1. Double-click the tFixedFlowInput
    component to display its Basic settings
    view.

    use_case-tstandardizerow_rules2.png
  2. Click the […] button to open the
    [Schema] dialog box, click the
    [+] button to add a column, name the
    column product and finally click
    OK to validate and close the
    box.

    use_case-tstandardizerow_rules3.png
  3. In the Mode area, select Use Inline Content (delimited file).

  4. In the Content field, enter the following
    three strings:

Creating the parsing rules

  1. Double-click the tStandardizeRow
    component to display its Basic settings
    view.

    use_case-tstandardizerow_rules4.png
  2. From the Column to parse list, select
    product.

  3. In the Conversion rules table, define a
    basic rule and an advanced rule as the following:

    • Click twice on the [+] button to
      add two columns. Name the first as “Amount” and
      the second as “LiquidAmount”.

    • Select Format as the type for
      the basic rule, and define it to read "INT WHITESPACE*
      WORD"
      .

    • Select RegExp as the type for
      the advanced rule, and define it to read
      "\d+\s*(L|ML)\b".

      The advanced rule will be executed after the basic ANTLR rule. The
      “Amount” rule will tokenize the amounts in
      the three strings, it matches any word with a numeric in front of
      it. Then the RegExp rule will
      check each token created by ANTLR against a regular
      expression.

  4. Click the Generate parser code in
    Routines
    button in order to generate the code under the
    Routines folder in the DQ Repository tree view of the Profiling perspective.

    This step is mandatory, otherwise the Job will not be executed.

  5. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

  6. Double-click the tLogRow component and
    select the Table (print values in cells of a
    table)
    option in the Mode
    area.

Executing the Job

  • Save your Job and press F6 to execute
    it.

    use_case-tstandardizerow_rules5.png

    The liquid amount has been extracted from the unstructured data by using a
    basic ANTLR rule that has tokenized amounts followed by an advanced rule
    that has checked each token created by ANTLR against the regular
    expression.

    Each instance of the XML data is written on a separate row because the
    Pretty print check box is selected in
    the Advanced settings of the tStandardizeRow component.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x