July 30, 2023

tStandardizeRow – Docs for ESB 7.x

tStandardizeRow

Normalizes the incoming data in a separate XML or JSON data flow to
separate or standardize the rule-compliant data from the non-compliant data.

tStandardizeRow tokenizes the data flow it
has received from the preceding component and applies user-defined parser rules to analyze
the data. Based on this analysis, this component normalizes and writes analyzed data in a
separate data flow and tags them using the user-defined rule names. It does not make any
changes on your raw data.

The standardization option adds a supplementary column to the output
flow where the normalized data are then standardized.

The Java library ANTLR is used to parse and tokenize the incoming
data. For further information about ANTLR, see the site

http://www.antlr.org/

Note: This component is enhanced from the Studio version
7.3. If your indexes were created with version 7.2 or lower, you need to update them. The
location of the migration procedure depends on the Studio installation:

  • With the installer: /addons/scripts/Lucene_Migration_Tool/README.md
  • With no installer: in the license email, click the link in Migration tool for Lucene Indexes from version 4 to version 8

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Rule types

Two groups of rule types are provided: the basic rule types
and the advanced rule types.

  • Basic rule types: Enumeration, Format and Combination. Rules of these types are composed
    with some given ANTLR symbols.
  • Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized
    data and standardize them when needed.

The advanced rule types are always executed after the ANTLR specific rules regardless
of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and
Using two parsing levels to extract information from unstructured data.

To create the rules of any type,
Talend
provides the pre-defined and case-sensitive elements (ANTLR tokens) as
follows for defining the composition of a string to be matched:

  • INT: integer;
  • WORD: word;
  • WORD+: literals of several words;
  • CAPWORD: capitalized word;
  • DECIMAL: decimal float;
  • FRACTION: fraction float;
  • CURRENCY: currencies;
  • ROMAN_NUMERAL: Roman numerals;
  • ALPHANUM: combination of alphabetic and numeric
    characters;
  • WHITESPACE: whitespace
  • UNDEFINED: unexpected strings such as ASCII
    codes that any other token cannot recognize.

The following three tables successively present detailed information about the basic
types, the advanced types and the ANTLR symbols used by the basic rule types. These
three tables help you to complete the Conversion rules
table in the Basic settings of this component.

For basic rule types:

Basic Rule Type Usage Example

Conditions of rule composition

Enumeration A rule of this type provides a list of
possible matches.
RuleName: LengthUnit

RuleValue: 'inch' | 'cm' "

Each option must be put in a pair of single
quotation marks unless this option is a pre-defined element.

Defined options must be separated by
the | symbol.

Format

(Rule name starts with upper
case)

A rule of this type uses the pre-defined
elements along with any of user-defined Enumeration, Format or Combination rules to define the
composition of a string.

RuleName: Length

RuleValue: "DECIMAL WHITESPACE
LengthUnit"

This rule means that a whitespace
between decimal and lengthunit is required, so it matches strings
like, 1.4 cm but does not
match a string like 1.4cm. To match both of these cases, you need to
define this rule as, for example, "DECIMAL WHITESPACE* LengthUnit"
.

LengthUnit is an
Enumeration rule defining " 'inch' | 'cm' ".

When the name of a Format rule starts with
upper case, this rule requires the exact matching result. It means that
you need to define exactly any single element of a string, even a
whitespace.
Format (Rule name
starts with lower case)
A rule of this type is almost the same as a
Format rule
starting its name with upper case. The difference is that the Format rule with
lower-case initial does not require exact match.
RuleName: length

RuleValue: "DECIMAL LengthUnit"

The rule matches strings like 1.4 cm or 1.4cm etc. where the
Decimal is one
of the pre-defined element types and LengthUnit is an Enumeration rule
defining " 'inch' | 'cm'
"
.

N/A
Combination A rule of this type is used when you need
to create several rules of the same name.
RuleName: Size (or size)

RuleValue: "length BY length"

The rule matches strings like 1.4 cm by 1.4 cm, where
length is a
Format
rule (starting with lower case) and BY is an Enumeration rule defining " 'By' | 'by' | 'x' | 'X'
"
.

Literal texts or characters are not
accepted as a part of the rule value. When the literal texts or
characters are needed, you must create an Enumeration rule to define these texts
or characters and then use this Enumeration rule instead.

When several Combination rules
use the identical rule name, they are executed in top-down order in
the Conversion
rules
table of the Basic settings of tStandardizeRow, so
arrange them properly in order to obtain the best result. For an
example, see the following scenario.

Warning: Any characters or string literals, if
accepted by a rule type, must be put in single quotation marks when used, otherwise they
will be treated as ANTLR grammar symbols or variables and generate errors or unexpected
results at runtime.

For advanced rule types:

Advanced Rule Type Usage Example Conditions
Regex A rule of this type uses regular
expressions to match the incoming data tokenized by ANTLR.
RuleName: ZipCode

RuleValue: "\d{5}"

The rule matches strings like “92150”

Regular expressions must be Java
compliant.
Index A rule of this type uses a synonym index
as reference to search for the matched incoming data.

For further information about available synonym
indexes, see the appendix about data synonym dictionaries in the
Talend Studio User Guide
.

A scenario is available in Standardizing addresses from unstructured data. On Windows, the backslashes need to be doubled or
replaced by slashes /
if the path is copied from the file system.

If you run the Job using Spark Local mode or if you run the Job
locally, the path to index folder must start with file:///, even. If the
index is stored in HDFS, the path to the index folder must start
with hdfs://.

When processing a record, a given Index rule matches
up only the first string identified as matchable.

In a
Talend Map/Reduce Job, you need to compress each
synonym index to be used as a zip file.

Shape A rule of this type uses pre-defined
elements along with the established Regex or Index rules or both to match the
incoming data.
RuleName: Address

RuleValue: "<INT><WORD><StreetType>"

This rule matches the addresses
like 12 main street, where
INT and WORD are pre-defined tokens
(rule elements) and StreetType is an Index rule which you define along
with this example rule in the Basic settings view of this
component.

For further
information about the Shape rule type, see Standardizing addresses from unstructured data.

Only the contents put in < > are
recognizable. In the other cases, the contents are considered as error
or are omitted.

For the given ANTLR symbols:

Symbols Description
| alternative
's' char or string literal
+ 1 or more
* 0 or more
? optional or semantic predicate
~ match not

For more information about ANTLR symbols, see: https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.

Search modes for Index rules

One type of the advanced rules used by the tStandardizeRow component is Index rules. Index rules use
synonym indexes as a reference to search for match data.

Using an Index rule without
having the possibility to specify what type of match (exact, partial, fuzzy, etc.) you
want to use on the input flow will not standardize and output the data you expect.
tStandardizeRow allows you to select one of
the following search modes for each Index rule
you define in the component:

Search mode

Description

Match all

each word of the input string must exist in the
index string, but the index string may contain other words too.

Match all fuzzy

each word of the input string must match similar
words of the index string.

Match any

the input string should have at least one word that
matches a word of the index string.

Match any fuzzy

the input string should have at least one word that
is similar to a word in the index string.

Match exact

the exact input string should match the exact index
string.

Match partial

each word of the input string must exist in the
index string but the input string may contain other words too up to
a given limit, 1 by default. This means that one word of the input
string may not match to any word of the index string

Suppose, for example, that you have the below record in the input flow:

And you have created a color index that has the Extra Deep Base string.

If you define an Index rule in
tStandardizeRow and set the search mode to
Match any, the component will return Extra Deep Base as a color for the above record because there is the Extra word that matches the index string. But if you want the
component to only return a match when the exact search string is found in the index, you
set the search mode of the rule to Match exact and the component
will not return a color for the record.

For a Job example, see Extracting exact match by using Index rules.

Different rule types for different parsing levels

The tStandardizeRow component uses basic rules based
on ANTLR grammar and advanced rules defined by
Talend
and not based on
ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing and
standardizing data. Suppose, for example, that you want to extract the liquid amount in
the following three records:

You may start by defining a liquid unit and a liquid amount in basic parser rules as
the following:

tStandardizeRow_1.png

If you test these rules in the
Profiling
perspective of studio,
you can see that these rules extract 7 L from 7
LUMENS
and this is not what you expect. You do not want that the word
LUMENS is split into two tokens.

tStandardizeRow_2.png

The basic rules you have defined above are ANTLR lexer rules and lexer rules are used
for tokenizing the input string. ANTLR does not provide a word boundary symbol like the
 used in regular expressions. You must then be careful when choosing
lexer rules because they define how the input strings will be split in tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a numeric
value in front of it, the Amount rule in this example:

tStandardizeRow_3.png

This basic rule is a lexer rule, a Format rule that
starts with an uppercase. If you test this rule in the
Profiling

perspective of the Studio, you can see that non liquid amounts are matched by this rule
and the LiquidAmount rule only matches the expected sequence of
characters.

tStandardizeRow_4.png

The second approach is to use an advanced rule like a regular expression and define a
word boundary with . You can use a lexer rule to tokenize amounts where
you match any word with a numeric in front of it. Then use a regular expression that
matches liquid amounts as the following: a digit optionally followed by space and
followed by L or ML and terminated by a word boundary.

tStandardizeRow_5.png

Note that the regular expression will be applied on the tokens created by the basic
lexer rule.

You can not check the results of the advanced rule by testing the rule in the

Profiling
perspective of the Studio as you do with basic
rules. The only means to see the results of advanced rules is by using them in a Job.
The results will look as the following:

For a Job example about the use of the above rules, see Using two parsing levels to extract information from unstructured data.

Comparing these two approaches, the first one uses only ANTLR grammar and may be more
efficient than the second solution which requires a second parsing pass to check each
token against the regular expression. But regular expressions can help people
experienced in regular expressions to create more advanced rules that could hardly be
created using ANTLR only.

tStandardizeRow Standard properties

These properties are used to configure tStandardizeRow running
in the Standard Job framework.

The Standard
tStandardizeRow component belongs to the Data Quality
family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema and Edit
schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Column to parse

Select the column to be parsed from the received data flow

Standardize this field

Select this check box to standardize the rule-compliant data
identified, that is, to replace the duplicates of the identified data with the
corresponding standardized data from a given index.

For further information about this index providing
standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema of
this component is changed automatically, so in a given Job, you need to click
the activated Sync columns button to fix
the inconsistencies in the schema.

Generate parser code in Routines

Click this button to enable the data parser of your Studio to
recognize the rules defined in the Conversion
rules
table.

In a given Job, when a rule is created, this operation is
required for the execution of this rule, while if it is on an existing rule
that you have modified, this operation is required only when the modified rule
is of type Enumeration, Format or Combination. For further information about all of the rule
types, see Rule types.

tStandardizeRow_6.png and
tStandardizeRow_7.png

Click the import or export button to exchange a given
standardization rule set with the DQ
Repository
.

– When you click the export button, your studio is switched to
the
Profiling
perspective and the Parser rule
Settings
view is opened on the workspace with the relative
contents filled automatically . Then if need be, you can edit the exported rule
set and save it to the Libraries >
Rules > Parser folder in the DQ
Repository
tree view.

– When you click the import button, a import wizard is opened
to help you import the standardization rule of interest.

For further information, see
Talend Studio User Guide
.

Conversion rules

Define the rules you need to apply as the following:

– In the Name column,
type in a name of the rule you want to use. This name is used as the XML tag or
the JSON attribute name and the token name to label the incoming data
identified by this rule.

– In the Type column,
select the type of the rule you need to apply. For further information about
available rule types, see Rule types.

– In the Value column,
type in the syntax of the rule.

– In the Search mode
column, select a search mode from the list. The search modes can be used only
with the Index rule type. For further
information about available search modes, see Search modes for Index rules.

A test view is provided to help you create the parser rules of
interest. For further information, see
Talend Studio User Guide
.

Advanced settings

Advanced options for INDEX rules

Search UNDEFINED
fields
: select this check box if you want the component to
search for undefined tokens in the index run results.

Word distance for partial
match
(available for the Match partial
mode): set the maximum number of words allowed to come inside a sequence of
words that may be found in the index, default value is 1.

Max edits for fuzzy
match
(Based on the Levenshtein algorithm and available for
fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit distance
from the input data are matched. With a max edit distance 2, for example, you can have up to two insertions, deletions or
substitutions. The score for each match is based on the edit distance of that
term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note:

Jobs migrated in the Studio from older releases run
correctly, but results might be slightly different because Max edits for fuzzy match is now used in
place of Minimum similarity for fuzzy
match
.

Output format

XML: this option is
selected by default. It outputs normalized data in XML format.

JSON: select this
option to output normalized data in JSON format.

Pretty print: this
option is selected by default. It allows you to have the output on several
rows. If you clear this check box, you will have the output in one row.

tStatCatcher
Statistics

Select this check box to collect log data at the component
level.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input
flow as well as an output.

This component generates Java code in the Routines node of the Repository tree view. This code implements the grammar defined
in rules. It is automatically compiled to be used by the component.

When you modify /add/delete a rule, the code must be generated
again, so that the Job takes the modifications into account and then becomes
runnable.

To generate it, click the Generate
parser code in Routines
button.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Trigger: Run if; On Component Ok; On
Component Error.

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see
Talend Studio User Guide
.

Normalizing data using rules of basic types

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this scenario, two steps are performed to:

  1. normalize the incoming data (separate the compliant data from the
    non-compliant data) and,

  2. extract the data of interests and display it.

Before replicating these two steps, we need to analyze the source data in order to
figure out what rules need to be composed. For this scenario, the source data is stored
in a .csv file called partsmaster.

tStandardizeRow_8.png

There are totally 59 rows of raw data, but some of them are not shown in our
capture.

Through observation, you can expect that the third row will not be recognized as it
contains Oriental characters. Furthermore, you can figure out that:

  • the SKU data contains 34-9923, XC-3211 and pb710125 and
    so on. So the rule used to parse the SKU data could
    be:

    Name

    Type

    Value

    "SKU"

    "Format"

    "(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT
    DIGIT (DIGIT DIGIT?)? "

  • for the Size data, the correct format is the
    multiplication of two or three lengths plus the length units. Therefore, the
    rules used to parse the Size data could be:

    Name

    Type

    Value

    "LengthUnit"

    "Enumeration"

    " 'm' | ''' | 'inch' | 'inches' | '"'"

    "BY"

    "Enumeration"

    "'X' | 'x' | 'by' "

    "Length"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit
    "

    "Size"

    "Combination"

    "Length BY Length BY Length"

    "Size"

    "Combination"

    "Length BY Length"

Two Combination rules use the same name, in which
case, they will be executed in top-down order as is presented in this table.

  • for the Weight data, the correct format is the weight
    plus the weight unit. Therefore, the rules used to parse the
    Weight data are:

    Name

    Type

    Value

    "WeightUnit"

    "Enumeration"

    " 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

    "Weight"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit
    "

Now, you can begin to replicate the two steps of this scenario.

Separating the compliant data and the non-compliant data

In this first step, four components are used. They are

  • tFileInputDelimited (partsmaster): this component loads and outputs the source data to the component
    that follows.

  • tStandardizeRow: this component normalizes the
    incoming data and rejects the exception data that it cannot recognize.

  • tLogRow: this component is used two times to
    display respectively the exception data and the output XML data.

tStandardizeRow_9.png

To replicate this step, proceed as the following sections illustrate.

Dropping and linking the components

To replicate this step, proceed as follows:

  1. Set up the file delimited schema for the
    partsmaster.csv file in the Repository of your Studio. For further
    information about how to set up a file delimited schema, see

    Talend Studio User Guide
    .
  2. Under the Metadata node of the
    Repository of your Studio, select
    the file delimited schema for partsmaster.csv and
    drop it on the design workspace.
  3. From the dialog box that pops up, select tFileInputDelimited.

    tStandardizeRow_10.png

    Then the tFileInputDelimited
    (partsmaster) component displays in the design
    workspace.
  4. From the Palette, drop the tStandardizeRow component and two tLogRow components into the design
    workspace.
  5. Right-click tFileInputDelimited
    (partsmaster) to open the contextual
    menu.
  6. From the contextual menu, select Row >
    Main
    link to connect this component to tStandardizeRow.
  7. Accept the schema propagation prompted by the pop-up dialog
    box.
  8. From tStandardizeRow, do the same to
    connect this component to one of the two tLogRow components using Row >
    Main
    link and to the other using Row > Reject link and accept the schema
    propagation.

    You can change the name displayed for each of these component as what
    has been done for one of the tLogRow
    component, named as Exception in this scenario. For
    further information, see
    Talend Studio User
    Guide
    .

Then you can continue to configure each component to be used.

Configuring the process of normalizing rows

To do this, proceed as follows:

  1. Double-click tStandardizeRow to open
    its Component view.

    tStandardizeRow_11.png

  2. In the Column to parse field, select
    SKU_Description_Size_Weight. This is the only
    column that the incoming schema has.
  3. Under the Conversion rules table,
    click the plus button eight times to add eight rows in this
    table.
  4. To complete these rows, type in the rules you have figured out when
    analyzing the raw data at the beginning of this scenario.

    The two Size rules are executed in top-down
    order. In this example, this order allows this component to match
    firstly the sizes with three numbers and then those with two numbers. If
    you reverse this order, this component will match the first two numbers
    of all sizes before all and then treat the last number of the
    three-numbers sizes as unmatched.
  5. Click the Generate parser code in
    routines
    button.
  6. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

    The Max edits for fuzzy match is set to
    1 by default.

Executing the normalization Job

Press F6 to run this Job.

In the console of the Run view, you can read
the output data structure. Each instance of the XML data is written on a
separate row because the Pretty print check box
is selected in the Advanced settings view of
the tStandardizeRow component.

tStandardizeRow_12.png

Corresponding to the raw data, such as, 34-9923, Monolithic
Membrance, 4′ x 8′, 26 lbs
, an XML segment is generated where the
compliant data are separated from the noncompliant data (unmatched). Thus the
raw data are normalized.

Again from this console, you can read that a row is rejected as it contains
unrecognizable data.

tStandardizeRow_13.png

Extracting data of interest from the normalized XML data

This section presents the second step of this scenario. Four components are added
to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step.
The four components are:

  • tExtractXMLField: this component reads
    an input XML field and extracts desired data;

  • tFilterRow: this component filters input
    rows by setting conditions on the selected columns;

  • tLogRow (two times): the two components
    receive and display respectively the valid or invalid data against
    conditions set in tFilterRow.

tStandardizeRow_14.png

To replicate the second step, proceed as the following sections illustrate.

Dropping and linking the additional components

To do this, proceed as follows:

  1. On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.
  2. From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this
    scenario, the two tLogRow components
    are named respectively ValidSize and
    InvalidSize.

    For further information about how to rename a component, see

    Talend Studio User Guide
    .
  3. Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect
    partsmaster to tStandardizeRow.
  4. Do the same but with Filter link to
    connect tFilterRow to
    ValidSize.
  5. Connect tFilterRow to
    InvalidSize with Reject link as you did in the first step to connect
    tSandardizeRow to
    Exception.

Configuring the process of extracting the XML fields of interest

To do this, proceed as follows

  1. Double click tExtractXMLField to open
    its Component view.

    tStandardizeRow_15.png

  2. Next to Edit schema, click the
    three-dot button to open the schema editor.

    tStandardizeRow_16.png

  3. On the left side of this editor, select the
    SKU_Description_Size_Weight row.
  4. Click the

    tStandardizeRow_17.png

    button to add this row to the right side of the
    editor.

  5. On the right side of this editor, click the plus button on the bottom
    toolbar three times to add three rows.
  6. Rename the three rows as SKU,
    Size, Weight, all of which
    are tags of the normalized XML data.
  7. Click OK to validate this editing and
    accept the schema propagation prompted by the pop-up dialog box. Four
    rows are automatically filled in the Mapping table on the Component view.
  8. In the XML field field of the
    Component view, select the desired
    XML field of the incoming data. In this example, it is
    Normalized_field.
  9. In the Loop XPath Query field, type
    in the name of the XML Tree root tag to replace the default value. In
    this example, type in "/record".
  10. In the XPath query column of the
    Mapping table, type in the XML Tree
    tags corresponding to desired normalized data between quotation marks.
    In this example, they are "SKU", "Size",
    "Weight".

Configuring the process of filtering the rows of interest

To do this, proceed as follows:

  1. Double click tFilterRow to open its
    Component view.

    tStandardizeRow_18.png

  2. Under the Conditions table, click the
    plus button to add one row.
  3. In the InputColumn column of this
    row, select Size from the drop-down list.
  4. In the Function column, select
    Empty.
  5. In the Operator column, select
    Not equal to.
  6. In the Value column, type in the
    quotation marks "".

The condition is set up. The normalized data which contain no null in the
Size tag will be output while those which contain null
in this Size tag will be rejected.

Executing the Job

Press F6 to run this Job.

In the console of the Run view, you
can read the desired data selected from the normalized XML data
flow.
tStandardizeRow_19.png

This Job lists the normalized data of interest in parallel with the
raw data.
You can read the data rejected by tFilterRow as well.
tStandardizeRow_20.png

Note:

For reason of page space, all of the results are not
displayed.

Standardizing addresses from unstructured data

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this scenario, six components are used to standardize addresses from
unstructured input data copied from a Website, by matching the data with the data contained
in an index previously generated by a Job.

For more information about index creation, see tSynonymOutput.

Drop the following components from the Palette to the design workspace.

  • tFixedFlowInput: this component holds the
    unstructured data from which the addresses are extracted.

  • tStandardizeRow: this component defines the
    address rules and generates the addresses in XML format with the defined tags. This
    is the process of normalizing and standardizing the initially unstructured data.

  • tFilterColumns: this component filters the
    standardized addresses.

  • tExtractXMLField: this component extracts the
    attributes from the Address node of the XML tree in order to
    output every address item in formatted columns.

  • two tLogRow: these components
    are used to display the output data. The first tLogRow returns the errors, if any. The second tLogRow displays the result in the console.

tStandardizeRow_21.png

Before starting up to replicate this scenario, you have to retrieve the
content of an index in order to match the unstructured data with the index data. The
content of the index reads as follows:

tStandardizeRow_22.png

On the left side are held Paris street names and on the right side are held
the synonyms used in the data. The data will be used as references to standardize the
address data collected from the website of interest, for example, from http://paris.conciergerie.com/travel_paris/address.php.

To replicate this scenario, proceed as the following sections
illustrate.

Preparing the unstructured data

To do this, proceed as follows:

  1. Double-click the tFixedFlowInput
    component to display its Basic settings
    view.

    tStandardizeRow_23.png

  2. Click the […] button to open the
    Schema dialog box, click the
    [+] button to add a column, and name it
    input. Then, click OK to validate and close the editor.

    tStandardizeRow_24.png

  3. In the Mode area, select Use Inline Content (delimited file).
  4. In the Content field, paste the data
    copied from the Website. This data is unstructured.

Then you can continue to configure the standardization process.

Configuring the process of standardizing the unstructured data

To do this, proceed as follows:

  1. Connect the tFixedFlowInput component to
    the tStandardizeRow component with a
    Row > Main link.
  2. Double-click the tStandardizeRow
    component to display its Basic settings
    view.

    tStandardizeRow_25.png

    The advanced rule types are always executed after the ANTLR specific rules
    regardless of rule order.
  3. In the Column to parse list, select
    input.
  4. Select the Standardize this field check
    box.
  5. In the Conversion rules
    table, click six times on the [+] button
    to add six columns. Name them and configure them as shown in the capture. For
    “StreetName” rule:

    1. Select the Index type in the Type field.
    2. Enter the path to your index in the Value field.
      Before the path, enter the protocol: file:// or
      hdfs:// if the index is on a cluster.
    3. Select Match exact as the search mode from the
      Search mode list.

      By using the Match exact mode, you will extract
      from the input flow only the strings that exactly match the street name
      indexes. For further information about available search modes, see Search modes for Index rules
    Then the other rules are:

    Name

    Type

    Value

    “Zip”

    Format

    " DIGIT DIGIT DIGIT
    DIGIT DIGIT "

    “City” Enumeration

    " 'Paris' | 'Paris'
    | 'PARIS' | 'Bagnolet' "

    “SpecialStreetNumber” Format

    " (INT
    ('bis'|'ter')) | ( INT '-' INT) "

    “INT” Shape

    "<SpecialStreetNumber>"

    “Address” Shape

    " <INT>
    <StreetName> <Zip> <City> "

    As advanced rule types are always executed after the ANTLR specific
    rules, the basic rules “Zip”,
    “City” and “SpecialStreetNumber”
    are executed first followed by “INT” and
    “Address”, the advanced rules.

  6. Click the Generate parser code in
    Routines
    button in order to generate the code in the
    Routines.

    If you do not click on this button, the Job will not be executed.
  7. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

    The Max edits for fuzzy
    match
    is set to 1 by default.
  8. In the design workspace, right-click the tStandardizeRow component to connect it to the tLogRow below with a Row > Reject
    link.
  9. Double-click the tLogRow component linked
    to the tStandardizeRow component to display
    its Basic settings view.

    tStandardizeRow_26.png

  10. In the Mode area, select the Table (print values in cells of a table)
    option.

Then continue to configure the process of filtering and extracting the data of
interest.

Configuring the process of filtering and extracting the data of interest

To do this, proceed as follows:

  1. In the design workspace, connect tStandardizeRow to tFilterColumns with a Row
    > Main link.
  2. In the Basic settings view of tFilterColumns, click the […] button next to the Edit
    schema
    field to open the schema editor.

    tStandardizeRow_27.png

  3. Click on the STANDARDIZED_FIELD column, click the
    first arrow to filter this column, and then click OK to close the editor and validate the changes.
  4. In the design workspace, connect tFilterColumns to tExtractXMLField with a Row
    > Main link.
  5. Double-click the tExtractXMLField
    component to display its Basic settings
    view.
  6. Click the […] button next to the
    Edit schema field to open the schema
    editor, add four columns and name them respectively
    number, street,
    zip and city, and then click
    OK to validate and close the
    editor.

    tStandardizeRow_28.png

  7. In the XML field field, check that the
    STANDARDIZED_FIELD column is selected.

    The content of the STANDARDIZED_FIELD column is an
    XML field with a <record> tag as root. Under this tag,
    the structure defined by the rules configured in the tStandardizeRow component allows to extract each item of the
    address, as you can see in the following capture.
    tStandardizeRow_29.png

  8. In the Loop XPath query field, type in
    “/record/Address”.
  9. In the Mapping table, in the XPath query field, enter
    “INT” for the number column,
    “StreetName” for the street
    column, “Zip” for the zip column
    and “City” for the city
    column.

Then you can prepare to execute the Job.

Executing the Job

Before executing the Job, you need to add the tLogRow component to display the execution result.

To do this, perform the following operations:

  1. In the design workspace, connect tExtractXMLField to the second tLogRow with a Row >
    Main link.
  2. Double-click the tLogRow component to
    open its Basic settings view.

    tStandardizeRow_26.png

  3. In the Schema list, select Built-In.
  4. Click the […] button next to the
    Edit schema field and check that the
    input and the output schemas in the tLogRow
    are synchronized.

    If not, click on the Sync columns
    button.
  5. In the Mode area, select the Table (print values in cells of a table) option.

Press F6 to execute the Job.

tStandardizeRow_31.png

The second tLogRow displays the normalized and
standardized addresses in the console according to the index.

For example, you can see that the Champs Elysées address has been selected and
formatted, and only input street names that exactly match an index string are
extracted.

The first tLogRow does not return anything in
this scenario because no error has been generated.

Extracting exact match by using Index rules

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this scenario, you will standardize some long descriptions of customer
products by matching the input flow with the data contained in an index. This scenario
explains how to use Index rules to tokenize product
data and then check each token against an index to extract exact match.

For this scenario, you must first create an index by using a Job with the
tSynonymOutput component. You need to create
indexes for the brand, range, color and unit of the customer products. Use the tSynonymOutput component to generate the indexes and feed
them with entries and synonyms. The below capture shows an example Job:

tStandardizeRow_32.png

Below is a sample of the generated indexes for this scenario:

tStandardizeRow_33.png

Each of the generated indexes has strings (sequences of words) in one column
and their corresponding synonyms in the second column. These strings are used as a
reference data against which the product data, generated by
tFixedFlowInput
, will be matched. For further information about index
creation, see tSynonymOutput.

In this scenario, the generated indexes are defined as context variable. For
further information about context variables, see
Talend Studio User Guide
.

Setting up the Job

  1. Drop the following components from the Palette to the design workspace: tFixedFlowInput,
    tStandardizeRow
    ,
    tExtractXMLField
    and tFileOutputExcel.

    tStandardizeRow_34.png

  2. Connect the components together using Main links.
  3. Drop tLogRow to the Job and link
    tStandardizeRow to it using a Reject link.

Generating unstructured data

  1. Double-click tFixedFlowInput to display
    the component Basic settings view.

    tStandardizeRow_35.png

    This component will generate the unstructured product data from which the
    brand, range, color and unit are extracted.
  2. Click the […] button to open the
    Schema dialog box.

    tStandardizeRow_36.png

  3. Click the [+] button and add two columns,
    name them as Long_Description and
    Supplier_Product_Code. Click
    OK to close the dialog box.
  4. In the Mode area, select Use Inline Content (delimited file).
  5. In the Content field, enter the
    unstructured product data you want to standardize, for example as the
    following:HONEYWELL 7DAY SNGL CHNL TIMESWITCH C/W EX LARGE BACKLIT DISPLAY ST9100C1106|ST9100C1006||Data to be standardised||
    HONEYWELL AQUASTAT SINGLE IMMERSION T/STAT WITH CAPILLARY 70-140C|L6188B2018U||||
    HONEYWELL S PLAN ZONE VALVE CONTROL PACK TW6Z910C|TW6Z910C||||
    H/WELL ST6100S1001 24HR 1CH SERVIC TIMER|||||
    H/WELL Y603A1133 BOXED Y PLAN |||||
    HWEL V8600N2056 LOW VOLT COMBI VALVE |||||
    HWEL VR4700C4022 GAS VALVE |||||
    DULUX PAINTPOD 2M EXTENSION TUBE |6030950||||
    DULUX PAINTPOD REPLACEMENT ROLLER SLEEVE|6030948||||
    DULUX PAINTPOD EXTRA REACH HANDLE |6030951||||
    DULUX PAINTPOD REPLACEMENT EDGING BRUSH |6030952||||
    DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 1L|5180583||||
    DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 2.5L|5180652||||
    DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 5L|5180744||||
    DT TRADE WEATHERSHIELD ULTIMATE OPAQUE EXTRA DEEP BASE 1L|5180584||||

Defining parsing rules to standardize data

  1. Double-click tStandardizeRow to display
    the component Basic settings view.

    tStandardizeRow_37.png

    This component helps you to define the rules necessary to standardize the
    unstructured input flow and generates the brand, range, color and unit in
    XML format.
  2. From the Column to parse
    list, select Long_Description.
  3. Select the Standardize this field check
    box.
  4. Define your rules as the following:

    1. In the Conversion rules table, click on the
      [+] button to add the columns necessary to define
      the rules.

      This scenario focuses on the rules of the type
      Index. For detail examples about the other
      rule types defined in the capture above, please refer to the other
      tStandardizeRow scenarios.

    2. Define three rules as Brand,
      Range and Color.
    3. From the Type list, select
      Index and fill in the
      Value field with the context variable of the
      indexes you generated.

      For further informastion about how to create and use context
      variables, see
      Talend Studio User Guide
      .
    4. From the Search mode list, select
      Match exact. Search modes are only applicable to
      the Index rules.

      Using the Match exact mode, you will extract
      from the input flow only the strings that exactly match the brand,
      range and color index strings you generated with the
      tSynonymOutput component. For further
      information about available search modes, see Search modes for Index rules

  5. Click the Generate parser code in
    Routines
    button in order to generate the code under the
    Routines folder in the DQ Repository tree view of the
    Profiling
    perspective.

    This step is mandatory, otherwise the Job will not be executed.
  6. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

    The Max edits for fuzzy
    match
    is set to 1 by default.
  7. Double-click tLogRow and define the
    component settings in the Basic settings
    view.
  8. In the Mode area, select the Table (print values in cells of a table)
    option.

    This component displays the tokens from the input flow that could not be
    analysed and matched to any of the index strings.

Extracting exact match

  1. Double-click tExtractXMLField to display
    the component Basic settings view.

    tStandardizeRow_38.png

  2. Click the […] button next to the
    Edit schema field to open the schema
    editor, add four columns in the output flow and name them as
    Brand, Range,
    Color and Amount. Click
    OK to validate and close the
    editor.

    tStandardizeRow_39.png

  3. In the XML field field, check that the
    STANDARDIZED_FIELD column is
    selected.

    The content of the STANDARDIZED_FIELD
    column is an XML field with a <record> tag as root. Under
    this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract items
    from the input flow.
  4. In the Loop XPath query field, enter “/”
    to define the root as the loop node.
  5. In XPath query field in the Mapping table, enter respectively
    “record/Brand”,
    “record/Range”
    “record/Color”
    and “record/Measure”.
  6. Double-click tFileOutputExcel to display
    the component Basic settings view.

    tStandardizeRow_40.png

  7. Set the destination file name as well as the Sheet name and select the
    Include header check box.

Executing the Job

  1. Save your Job and press F6 to execute
    it.

    The tLogRow component displays on the
    studio console the data the Job could not analyze:
    tStandardizeRow_41.png

    The normalized and standardized data is written in the output excel
    file.
  2. Right-click the tFileOutputExcel
    component and select Data viewer

    A preview of the standardized data is opened in the studio.
    tStandardizeRow_42.png

    The Brand, Range and
    Color rules you defined have checked each token in
    the input flow against the index strings and only input strings that exactly
    match an index string are extracted. Take for example line 11, no color has
    been extracted from this record as the record does not have an exact match
    to any of the color index strings.
    Try and change in the tStandardizeRow
    basic settings the search mode for the Color rule to
    Match any, and then execute the Job.
    tStandardizeRow_43.png

    You can see that a color has now been extracted for the record on line 11.
    This record has one word, “EXTRA”, that matches to the color index “Extra
    Deep Base” and that is enough according to the Match
    any
    search mode to consider that the record has a
    color.
    The different search modes available in the tStandardizeRow component allow you to standardize and
    output the data you expect.

Using two parsing levels to extract information from unstructured
data

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes how to build a set of rules to extract some information from
unstructured data. It explains how to use a basic ANTLR rule to tokenize data then how
to use an advanced rule to check each token created by ANTLR against a regular
expression.

This scenario uses:

  • a tFixedFlowInput component to create the
    unstructured data strings.

  • a tStandardizeRow component to define the
    rules necessary to extract the liquid amounts from the data strings.

  • a tLogRow component to display the output
    data.

tStandardizeRow_44.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace:
    tFixedFlowInput
    ,
    tStandardizeRow
    and tLogRow.
  2. Connect the three components together using the Main links.

Creating the unstructured data

  1. Double-click the tFixedFlowInput
    component to display its Basic settings
    view.

    tStandardizeRow_45.png

  2. Click the […] button to open the
    Schema dialog box, click the
    [+] button to add a column, name the
    column product and finally click
    OK to validate and close the
    box.

    tStandardizeRow_46.png

  3. In the Mode area, select Use Inline Content (delimited file).
  4. In the Content field, enter the following
    three strings:
    3M PROJECT LAMP 7 LUMENS 32ML
    A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
    54MLP FAC 32 ML

Creating the parsing rules

  1. Double-click the tStandardizeRow
    component to display its Basic settings
    view.

    tStandardizeRow_47.png

  2. From the Column to parse list, select
    product.
  3. In the Conversion rules table, define a
    basic rule and an advanced rule as the following:

    • Click twice on the [+] button to
      add two columns. Name the first as “Amount” and
      the second as “LiquidAmount”.

    • Select Format as the type for
      the basic rule, and define it to read "INT WHITESPACE*
      WORD"
      .

    • Select RegExp as the type for
      the advanced rule, and define it to read
      "\d+\s*(L|ML)\b".

      The advanced rule will be executed after the basic ANTLR rule. The
      “Amount” rule will tokenize the amounts in
      the three strings, it matches any word with a numeric in front of
      it. Then the RegExp rule will
      check each token created by ANTLR against a regular
      expression.

  4. Click the Generate parser code in
    Routines
    button in order to generate the code under the
    Routines folder in the DQ Repository tree view of the
    Profiling
    perspective.

    This step is mandatory, otherwise the Job will not be executed.
  5. In the Advanced settings view, leave the
    options selected by default in the Output
    format
    area as they are.

    The Max edits for fuzzy match is set to
    1 by default.
  6. Double-click the tLogRow component and
    select the Table (print values in cells of a
    table)
    option in the Mode
    area.

Executing the Job

Save your Job and press F6 to execute
it.

tStandardizeRow_48.png

The liquid amount has been extracted from the unstructured data by using a
basic ANTLR rule that has tokenized amounts followed by an advanced rule
that has checked each token created by ANTLR against the regular
expression.
Each instance of the XML data is written on a separate row because the
Pretty print check box is selected in
the Advanced settings of the tStandardizeRow component.

tStandardizeRow MapReduce properties (deprecated)

These properties are used to configure tStandardizeRow running
in the MapReduce Job framework.

The MapReduce
tStandardizeRow component belongs to the Data Quality
family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema et Edit
schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Column to parse

Select the column to be parsed from the received data flow

Standardize this field

Select this check box to standardize the rule-compliant data
identified, that is, to replace the duplicates of the identified data with the
corresponding standardized data from a given index.

For further information about this index providing
standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema of
this component is changed automatically, so in a given Job, you need to click
the activated Sync columns button to fix
the inconsistencies in the schema.

Generate parser code in Routines

Click this button to enable the data parser of your Studio to
recognize the rules defined in the Conversion
rules
table.

In a given Job, when a rule is created, this operation is
required for the execution of this rule, while if it is on an existing rule
that you have modified, this operation is required only when the modified rule
is of type Enumeration, Format or Combination. For further information about all of the rule
types, see Rule types.

tStandardizeRow_6.png and
tStandardizeRow_7.png

Click the import or export button to exchange a given
standardization rule set with the DQ
Repository
.

– When you click the export button, your studio is switched to
the
Profiling
perspective and the Parser rule
Settings
view is opened on the workspace with the relative
contents filled automatically . Then if need be, you can edit the exported rule
set and save it to the Libraries >
Rules > Parser folder in the DQ
Repository
tree view.

– When you click the import button, a import wizard is opened
to help you import the standardization rule of interest.

For further information, see
Talend Studio User Guide
.

Conversion rules

Define the rules you need to apply as the following:

– In the Name column,
type in a name of the rule you want to use. This name is used as the XML tag or
the JSON attribute name and the token name to label the incoming data
identified by this rule.

– In the Type column,
select the type of the rule you need to apply. For further information about
available rule types, see Rule types.

– In the Value column,
type in the syntax of the rule.

– In the Search mode
column, select a search mode from the list. The search modes can be used only
with the Index rule type. For further
information about available search modes, see Search modes for Index rules.

If you use this component in a
Talend
Map/Reduce Job, bear in mind the prerequisites for an Index rule stated in the section Rule types.

A test view is provided to help you create the parser rules of
interest. For further information, see
Talend Studio User Guide
.

Advanced settings

Advanced options for INDEX rules

Search UNDEFINED
fields
: select this check box if you want the component to
search for undefined tokens in the index run results.

Word distance for partial
match
(available for the Match partial
mode): set the maximum number of words allowed to come inside a sequence of
words that may be found in the index, default value is 1.

Max edits for fuzzy
match
(Based on the Levenshtein algorithm and available for
fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit distance
from the input data are matched. With a max edit distance 2, for example, you can have up to two insertions, deletions or
substitutions. The score for each match is based on the edit distance of that
term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note:

Jobs migrated in the Studio from older releases run
correctly, but results might be slightly different because Max edits for fuzzy match is now used in
place of Minimum similarity for fuzzy
match
.

Output format

XML: this option is
selected by default. It outputs normalized data in XML format.

JSON: select this
option to output normalized data in JSON format.

The output flow will be always written in one flow.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component generates code in the Routines node of the
Repository tree view. This code implements the grammar defined in rules. It is
automatically compiled to be used by the component. When you modify /add/delete
a rule, the code must be generated again, so that the Job takes the
modifications into account and then becomes runnable. To generate it, click the
Generate parser code in Routines button.

In a
Talend
Map/Reduce Job, this component is used as an intermediate
step and other components used along with it must be Map/Reduce components, too. They
generate native Map/Reduce code that can be executed directly in Hadoop.

For further information about a
Talend
Map/Reduce Job, see the sections
describing how to create, convert and configure a
Talend
Map/Reduce Job of the

Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Trigger: Run if; On Component Ok; On
Component Error.

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see
Talend Studio User Guide
.

Related scenarios

No scenario is available for the Map/Reduce version of this component yet.

tStandardizeRow properties for Apache Spark Batch

These properties are used to configure tStandardizeRow running
in the Spark Batch Job framework.

The Spark Batch
tStandardizeRow component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Column to parse

Select the column to be parsed from the received data
flow

Standardize this field

Select this check box to standardize the rule-compliant
data identified, that is, to replace the duplicates of the identified
data with the corresponding standardized data from a given index.

For further information about this index providing
standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema
of this component is changed automatically, so in a given Job, you need
to click the activated Sync
columns
button to fix the inconsistencies in the
schema.

Generate analyzer code as routine

Click this button to enable the data parser of your
Studio to recognize the rules defined in the Conversion rules table.

In a given Job, when a rule is created, this operation is
required for the execution of this rule, while if it is on an existing
rule that you have modified, this operation is required only when the
modified rule is of type Enumeration, Format or Combination. For further information about all of the
rule types, see Rule types.

tStandardizeRow_6.png and tStandardizeRow_7.png

Click the import or export button to exchange a given
standardization rule set with the DQ
Repository
.

– When you click the export button, your studio is
switched to the
Profiling
perspective and the Parser rule
Settings
view is opened on the workspace with the
relative contents filled automatically . Then if need be, you can edit
the exported rule set and save it to the Libraries > Rules > Parser folder in the DQ
Repository
tree view.

– When you click the import button, a import wizard is
opened to help you import the standardization rule of interest.

For further information, see
Talend Studio User Guide
.

Conversion rules

Define the rules you need to apply as the following:

– In the Name
column, type in a name of the rule you want to use. This name is used as
the XML tag or the JSON attribute name and the token name to label the
incoming data identified by this rule.

– In the Type
column, select the type of the rule you need to apply. For further
information about available rule types, see Rule types.

– In the Value
column, type in the syntax of the rule.

– In the Search
mode
column, select a search mode from the list. The
search modes can be used only with the Index rule type. For further information about
available search modes, see Search modes for Index rules.

A test view is provided to help you create the parser
rules of interest. For further information, see
Talend Studio User Guide
.

Advanced settings

Advanced options for INDEX rules

Search UNDEFINED
fields
: select this check box if you want the component
to search for undefined tokens in the index run results.

Word distance for partial
match
(available for the Match
partial
mode): set the maximum number of words allowed to come
inside a sequence of words that may be found in the index, default value
is 1.

Max edits for fuzzy
match
(Based on the Levenshtein algorithm and available
for fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit
distance from the input data are matched. With a max edit distance 2, for example, you can have up to two
insertions, deletions or substitutions. The score for each match is
based on the edit distance of that term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note:

Jobs migrated in the Studio from older releases run
correctly, but results might be slightly different because
Max edits for fuzzy match
is now used in place of Minimum
similarity for fuzzy match
.

Output format

XML: this option
is selected by default. It outputs normalized data in XML format.

JSON: select this
option to output normalized data in JSON format.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see
Talend Studio User Guide
.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tStandardizeRow properties for Apache Spark Streaming

These properties are used to configure tStandardizeRow running
in the Spark Streaming Job framework.

The Spark Streaming
tStandardizeRow component belongs to the Data Quality family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Column to parse

Select the column to be parsed from the received data
flow

Standardize this field

Select this check box to standardize the rule-compliant
data identified, that is, to replace the duplicates of the identified
data with the corresponding standardized data from a given index.

For further information about this index providing
standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema
of this component is changed automatically, so in a given Job, you need
to click the activated Sync
columns
button to fix the inconsistencies in the
schema.

Generate analyzer code as routine

Click this button to enable the data parser of your
Studio to recognize the rules defined in the Conversion rules table.

In a given Job, when a rule is created, this operation is
required for the execution of this rule, while if it is on an existing
rule that you have modified, this operation is required only when the
modified rule is of type Enumeration, Format or Combination. For further information about all of the
rule types, see Rule types.

tStandardizeRow_6.png and tStandardizeRow_7.png

Click the import or export button to exchange a given
standardization rule set with the DQ
Repository
.

– When you click the export button, your studio is
switched to the
Profiling
perspective and the Parser rule
Settings
view is opened on the workspace with the
relative contents filled automatically . Then if need be, you can edit
the exported rule set and save it to the Libraries > Rules > Parser folder in the DQ
Repository
tree view.

– When you click the import button, a import wizard is
opened to help you import the standardization rule of interest.

For further information, see
Talend Studio User Guide
.

Conversion rules

Define the rules you need to apply as the following:

– In the Name
column, type in a name of the rule you want to use. This name is used as
the XML tag or the JSON attribute name and the token name to label the
incoming data identified by this rule.

– In the Type
column, select the type of the rule you need to apply. For further
information about available rule types, see Rule types.

– In the Value
column, type in the syntax of the rule.

– In the Search
mode
column, select a search mode from the list. The
search modes can be used only with the Index rule type. For further information about
available search modes, see Search modes for Index rules.

A test view is provided to help you create the parser
rules of interest. For further information, see
Talend Studio User Guide
.

Advanced settings

Advanced options for INDEX rules

Search UNDEFINED
fields
: select this check box if you want the component
to search for undefined tokens in the index run results.

Word distance for partial
match
(available for the Match
partial
mode): set the maximum number of words allowed to come
inside a sequence of words that may be found in the index, default value
is 1.

Max edits for fuzzy
match
(Based on the Levenshtein algorithm and available
for fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit
distance from the input data are matched. With a max edit distance 2, for example, you can have up to two
insertions, deletions or substitutions. The score for each match is
based on the edit distance of that term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note:

Jobs migrated in the Studio from older releases run
correctly, but results might be slightly different because
Max edits for fuzzy match
is now used in place of Minimum
similarity for fuzzy match
.

Output format

XML: this option
is selected by default. It outputs normalized data in XML format.

JSON: select this
option to output normalized data in JSON format.

Usage

Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the
Run view to define the connection to a given Spark cluster
for the whole Job.

This connection is effective on a per-Job basis.

For further information about a
Talend
Spark Streaming Job, see the sections
describing how to create, convert and configure a
Talend
Spark Streaming Job of the

Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see
Talend Studio User Guide
.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x