Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.
Component family |
Data quality |
|
Function |
tStandardizeRow tokenizes the The standardization option adds a supplementary column to the The Java library ANTLR is used to parse and tokenize the incoming NoteThis component does not make any changes on your raw |
|
Purpose |
tStandardizeRow normalizes the |
|
Basic settings |
Schema and Edit |
A schema is a row description, it defines the number of fields to Since version 5.6, both the Built-In mode and the Repository mode are |
|
|
Built-in: You create the schema |
|
|
Repository: You have already |
|
Column to parse |
Select the column to be parsed from the received data flow |
|
Standardize this field |
Select this check box to standardize the rule-compliant data For further information about this index Every time you select or clear this check box, the schema of this |
|
Generate parser code in Routines |
Click this button to enable the data parser of your Studio to In a given Job, when a rule is created, this operation is required |
and |
Click the import or export button to exchange a given – When you click the export button, your studio is switched to the – When you click the import button, a import wizard is opened to For further information, see Talend Studio User |
|
|
Conversion rules |
Define the rules you need to apply as the following: – In the Name column, type in a – In the Type column, select the – In the Value column, type in – In the Search mode column, If you use this component in a Talend Map/Reduce Job, bear in mind the A test view is provided to help you create the parser rules of |
Advanced settings |
Advanced options for INDEX rules |
– Search UNDEFINED fields: select – Word distance for partial match – Minimum similarity for fuzzy |
Output format |
–XML: this option is selected by –JSON: select this option to output –Pretty print: this option is You can not use the Pretty print option when you use the component as a |
|
tStatCatcher |
Select this check box to collect log data at the component level. |
|
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is an intermediary step. It requires an input flow This component generates Java code in the Routines node of the Repository tree view. This code implements the When you modify /add/delete a rule, the code must be generated To generate it, click the Generate parser |
|
Usage in Map/Reduce Jobs |
If you have subscribed to one of the Talend solutions with Big Data, you can also For further information about a Talend Map/Reduce Job, see the sections Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
|
Connections |
Outgoing links (from this component to another): Row: Main; Reject Trigger: Run if; On Component Ok; Incoming links (from one component to this one): Row: Main; Reject For further information regarding connections, see |
|
Limitation |
n/a |
Two groups of rule types are provided: the basic rule types and the advanced rule
types.
-
Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some
given ANTLR symbols. -
Advanced rule types: Regex, Index and Shape.
Rules of these types match the tokenized data and standardize them when
needed.
The advanced rule types are always executed after the ANTLR specific rules
regardless of rule order. For further information about basic and advanced rules,
see Different rule types for different parsing levels
and Scenario 4:
Using two parsing levels to extract information from unstructured data.
To create the rules of any type, Talend provides the pre-defined and
case-sensitive elements (ANTLR tokens) as follows for defining the composition of a
string to be matched:
-
INT: integer;
-
WORD: word;
-
WORD+: literals of several words;
-
CAPWORD: capitalized word;
-
DECIMAL: decimal float;
-
FRACTION: fraction float;
-
CURRENCY: currencies;
-
ROMAN_NUMERAL: Roman numerals;
-
ALPHANUM: combination of alphabetic and numeric characters;
-
WHITESPACE: whitespace
-
UNDEFINED: unexpected strings such as ASCII codes that any other token
cannot recognize.
The following three tables successively present detailed information about the
basic types, the advanced types and the ANTLR symbols used by the basic rule types.
These three tables help you to complete the Conversion
rules table in the Basic settings of
this component.
For basic rule types:
Basic Rule Type |
Usage |
Example |
Conditions of rule composition |
---|---|---|---|
Enumeration |
A rule of this type provides a list of possible |
RuleName: RuleValue: “ |
– Each option must be put in a pair of single quotation marks – Defined options must be separated by the | symbol. |
Format (Rule name starts with upper case) |
A rule of this type uses the pre-defined elements along with |
RuleName: RuleValue: This rule means that a whitespace between decimal and
|
– When the name of a Format |
Format (Rule name starts with |
A rule of this type is almost the same as a Format rule starting its name with |
RuleName: RuleValue: The rule matches strings like 1.4 cm or |
n/a |
Combination |
A rule of this type is used when you need to create several |
RuleName:
RuleValue: The rule matches strings like 1.4 cm by 1.4 |
– Literal texts or characters are not accepted as a part of – When several Combination |
Warning
Any characters or string literals, if accepted by a rule type, must be put in
single quotation marks when used, otherwise they will be treated as ANTLR
grammar symbols or variables and generate errors or unexpected results at
runtime.
For advanced rule types:
Advanced Rule Type |
Usage |
Example |
Conditions |
---|---|---|---|
Regex |
A rule of this type uses regular expressions to match the |
RuleName: RuleValue: The rule matches strings like |
Regular expressions must be Java compliant. |
Index |
A rule of this type uses a synonym index as reference to For further information about available synonym indexes, see |
A scenario is available in Scenario 2: Standardizing addresses from unstructured data. |
– In Windows, the backslashes – When processing a record, a given Index rule matches up only the first string – In a Talend Map/Reduce Job, you need to |
Shape |
A rule of this type uses pre-defined elements along with the |
RuleName: RuleValue: This rule matches the addresses like 12 main For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data. |
Only the contents put in |
For the given ANTLR symbols:
Symbols |
Meaning |
---|---|
|
alternative |
|
char or string literal |
|
1 or more |
|
0 or more |
|
optional or semantic predicate |
|
match not |
Examples of using these symbols are presented in the following scenarios, but you
can also find more examples on the following site:
https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.
One type of the advanced rules used by the tStandardizeRow component is Index
rules. Index rules use synonym indexes as a
reference to search for match data.
Using an Index rule without having the
possibility to specify what type of match (exact, partial, fuzzy, etc.) you want to
use on the input flow will not standardize and output the data you expect. tStandardizeRow allows you to select one of the following
search modes for each Index rule you define in the
component:
Search mode |
Description |
---|---|
Match all |
each word of the input string must exist in the index string, |
Match all fuzzy |
each word of the input string must match similar words of the |
Match any |
the input string should have at least one word that matches a |
Match any fuzzy |
the input string should have at least one word that is similar |
Match exact |
the exact input string should match the exact index |
Match partial |
each word of the input string must exist in the index string |
Suppose, for example, that you have the below record in the input flow:
1 |
DULUX PAINTPOD EXTRA REACH HANDLE |
And you have created a color index that has the Extra Deep
Base string.
If you define an Index rule in tStandardizeRow and set the search mode to
Match any, the component will return Extra Deep
Base as a color for the above record because there is the
Extra word that matches the index string. But if you want
the component to only return a match when the exact search string is found in the
index, you set the search mode of the rule to Match exact and
the component will not return a color for the record.
For a Job example, see Scenario 3: Extracting exact match by using Index rules.
The tStandardizeRow component uses basic rules
based on ANTLR grammar and advanced rules defined by Talend and not based
on ANTLR.
Sometimes, using ANTLR rules can not answer all your expectations when normalizing
and standardizing data. Suppose, for example, that you want to extract the liquid
amount in the following three records:
1 2 3 |
3M PROJECT LAMP 7 LUMENS 32ML A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L 54MLP FAC 32 ML |
You may start by defining a liquid unit and a liquid amount in basic parser rules
as the following:
If you test these rules in the Profiling perspective of
studio, you can see that these rules extract 7 L from
7 LUMENS and this is not what you expect. You do not want
that the word LUMENS is split into two tokens.
The basic rules you have defined above are ANTLR lexer rules and lexer rules are
used for tokenizing the input string. ANTLR does not provide a word boundary symbol
like the used in regular expressions. You must then be careful when
choosing lexer rules because they define how the input strings will be split in
tokens.
You can solve such a problem using two approaches:
The first approach is to define another basic rule that matches a word with a
numeric value in front of it, the Amount rule in this example:
This basic rule is a lexer rule, a Format rule
that starts with an uppercase. If you test this rule in the Profiling perspective of the Studio, you can see that non
liquid amounts are matched by this rule and the LiquidAmount
rule only matches the expected sequence of characters.
The second approach is to use an advanced rule like a regular expression and
define a word boundary with . You can use a lexer rule to tokenize
amounts where you match any word with a numeric in front of it. Then use a regular
expression that matches liquid amounts as the following: a digit optionally followed
by space and followed by L or ML and terminated by a word boundary.
Note that the regular expression will be applied on the tokens created by the
basic lexer rule.
You can not check the results of the advanced rule by testing the rule in the
Profiling perspective of the Studio as you do with
basic rules. The only means to see the results of advanced rules is by using them in
a Job. The results will look as the following:
1 2 3 4 5 6 7 8 9 10 |
3M PROJECT LAMP 7 LUMENS 32ML <record> <Amount>3M</Amount> <Amount>7 LUMENS</Amount> <LiquidAmount>32ML</LiquidAmount> <UNMATCHED> <CAPWORD>PROJECT</CAPWORD> <CAPWORD>LAMP</CAPWORD> </UNMATCHED> </record> |
For a Job example about the use of the above rules, see Scenario 4:
Using two parsing levels to extract information from unstructured data.
Comparing these two approaches, the first one uses only ANTLR grammar and may be
more efficient than the second solution which requires a second parsing pass to
check each token against the regular expression. But regular expressions can help
people experienced in regular expressions to create more advanced rules that could
hardly be created using ANTLR only.
In this scenario, two steps are performed to:
-
normalize the incoming data (separate the compliant data from the
non-compliant data) and, -
extract the data of interests and display it.
Before replicating these two steps, we need to analyze the source data in order to
figure out what rules need to be composed. For this scenario, the source data is stored
in a .csv file called partsmaster.
There are totally 59 rows of raw data, but some of them are not shown in our
capture.
Through observation, you can expect that the third row will not be recognized as it
contains Oriental characters. Furthermore, you can figure out that:
-
the SKU data contains 34-9923, XC-3211 and pb710125 and
so on. So the rule used to parse the SKU data could
be:Name
Type
Value
"SKU"
"Format"
"(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT
DIGIT (DIGIT DIGIT?)? "
-
for the Size data, the correct format is the
multiplication of two or three lengths plus the length units. Therefore, the
rules used to parse the Size data could be:Name
Type
Value
"LengthUnit"
"Enumeration"
" 'm' | '''' | 'inch' | 'inches' | '"'"
"BY"
"Enumeration"
"'X' | 'x' | 'by' "
"Length"
"Format"
"(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit
""Size"
"Combination"
"Length BY Length BY Length"
"Size"
"Combination"
"Length BY Length"
Two Combination rules use the same name, in which
case, they will be executed in top-down order as is presented in this table.
-
for the Weight data, the correct format is the weight
plus the weight unit. Therefore, the rules used to parse the
Weight data are:Name
Type
Value
"WeightUnit"
"Enumeration"
" 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"
"Weight"
"Format"
"(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit
"
Now, you can begin to replicate the two steps of this scenario.
In this first step, four components are used. They are
-
tFileInputDelimited
(partsmaster): this component loads and outputs the
source data to the component that follows. -
tStandardizeRow: this component
normalizes the incoming data and rejects the exception data that it cannot
recognize. -
tLogRow: this component is used two
times to display respectively the exception data and the output XML
data.
To replicate this step, proceed as the following sections illustrate.
To replicate this step, proceed as follows:
-
Set up the file delimited schema for the
partsmaster.csv file in the Repository of your Studio. For further
information about how to set up a file delimited schema, see
Talend Studio User Guide. -
Under the Metadata node of the
Repository of your Studio, select
the file delimited schema for partsmaster.csv and
drop it on the design workspace. -
From the dialog box that pops up, select tFileInputDelimited.
Then the tFileInputDelimited
(partsmaster) component displays in the design
workspace. -
From the Palette, drop the tStandardizeRow component and two tLogRow components into the design
workspace. -
Right-click tFileInputDelimited
(partsmaster) to open the contextual
menu. -
From the contextual menu, select Row >
Main link to connect this component to tStandardizeRow. -
Accept the schema propagation prompted by the pop-up dialog
box. -
From tStandardizeRow, do the same to
connect this component to one of the two tLogRow components using Row >
Main link and to the other using Row > Reject link and accept the schema
propagation.You can change the name displayed for each of these component as what
has been done for one of the tLogRow
component, named as Exception in this scenario. For
further information, see Talend Studio User
Guide.
Then you can continue to configure each component to be used.
To do this, proceed as follows:
-
Double-click tStandardizeRow to open
its Component view. -
In the Column to parse field, select
SKU_Description_Size_Weight. This is the only
column that the incoming schema has. -
Under the Conversion rules table,
click the plus button eight times to add eight rows in this
table. -
To complete these rows, type in the rules you have figured out when
analyzing the raw data at the beginning of this scenario.The two Size rules are executed in top-down
order. In this example, this order allows this component to match
firstly the sizes with three numbers and then those with two numbers. If
you reverse this order, this component will match the first two numbers
of all sizes before all and then treat the last number of the
three-numbers sizes as unmatched. -
Click the Generate parser code in
routines button. -
In the Advanced settings view, leave
the options selected by default in the Output
format area as they are.
Press F6 to run this Job.
In the console of the Run view, you can read
the output data structure. Each instance of the XML data is written on a
separate row because the Pretty print check box
is selected in the Advanced settings view of
the tStandardizeRow component.
Corresponding to the raw data, such as, 34-9923, Monolithic
Membrance, 4′ x 8′, 26 lbs, an XML segment is generated where the
compliant data are separated from the noncompliant data (unmatched). Thus the
raw data are normalized.
Again from this console, you can read that a row is rejected as it contains
unrecognizable data.
This section presents the second step of this scenario. Four components are added
to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step.
The four components are:
-
tExtractXMLField: this component reads
an input XML field and extracts desired data; -
tFilterRow: this component filters input
rows by setting conditions on the selected columns; -
tLogRow (two times): the two components
receive and display respectively the valid or invalid data against
conditions set in tFilterRow.
To replicate the second step, proceed as the following sections illustrate.
To do this, proceed as follows:
-
On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.
-
From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this
scenario, the two tLogRow components
are named respectively ValidSize and
InvalidSize.For further information about how to rename a component, see
Talend Studio User Guide. -
Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect
partsmaster to tStandardizeRow. -
Do the same but with Filter link to
connect tFilterRow to
ValidSize. -
Connect tFilterRow to
InvalidSize with Reject link as you did in the first step to connect
tSandardizeRow to
Exception.
To do this, proceed as follows
-
Double click tExtractXMLField to open
its Component view. -
Next to Edit schema, click the
three-dot button to open the schema editor. -
On the left side of this editor, select the
SKU_Description_Size_Weight row. -
Click the button to add this row to the right side of the
editor. -
On the right side of this editor, click the plus button on the bottom
toolbar three times to add three rows. -
Rename the three rows as SKU,
Size, Weight, all of which
are tags of the normalized XML data. -
Click OK to validate this editing and
accept the schema propagation prompted by the pop-up dialog box. Four
rows are automatically filled in the Mapping table on the Component view. -
In the XML field field of the
Component view, select the desired
XML field of the incoming data. In this example, it is
Normalized_field. -
In the Loop XPath Query field, type
in the name of the XML Tree root tag to replace the default value. In
this example, type in"/record"
. -
In the XPath query column of the
Mapping table, type in the XML Tree
tags corresponding to desired normalized data between quotation marks.
In this example, they are"SKU"
,"Size"
,
"Weight"
.
To do this, proceed as follows:
-
Double click tFilterRow to open its
Component view. -
Under the Conditions table, click the
plus button to add one row. -
In the InputColumn column of this
row, select Size from the drop-down list. -
In the Function column, select
Empty. -
In the Operator column, select
Not equal to. -
In the Value column, type in the
quotation marks""
.
The condition is set up. The normalized data which contain no null in the
Size tag will be output while those which contain null
in this Size tag will be rejected.
-
Press F6 to run this Job.
In the console of the Run view, you
can read the desired data selected from the normalized XML data
flow.This Job lists the normalized data of interest in parallel with the
raw data.You can read the data rejected by tFilterRow as well.
Note
For reason of page space, all of the results are not
displayed.
In this scenario, six components are used to standardize addresses from unstructured
input data copied from a Website, by matching the data with the data contained in an
index previously generated by a Job.
For more information about index creation, see tSynonymOutput.
Drop the following components from the Palette to the
design workspace.
-
tFixedFlowInput: this component holds the
unstructured data from which the addresses are extracted. -
tStandardizeRow: this component defines the
address rules and generates the addresses in XML format with the defined tags.
This is the process of normalizing and standardizing the initially unstructured
data. -
tFilterColumns: this component filters the
standardized addresses. -
tExtractXMLField: this component extracts
the attributes from the Address node of the XML tree in
order to output every address item in formatted columns. -
two tLogRow: these components are used to
display the output data. The first tLogRow
returns the errors, if any. The second tLogRow
displays the result in the console.
Before starting up to replicate this scenario, you have to retrieve the content of an
index in order to match the unstructured data with the index data. The content of the
index reads as follows:
On the left side are held Paris street names and on the right side are held the
synonyms used in the data. The data will be used as references to standardize the
address data collected from the website of interest, for example, from http://paris.conciergerie.com/travel_paris/address.php.
To replicate this scenario, proceed as the following sections illustrate.
To do this, proceed as follows:
-
Double-click the tFixedFlowInput
component to display its Basic settings
view. -
Click the […] button to open the
[Schema] dialog box, click the
[+] button to add a column, and name it
input. Then, click OK to validate and close the editor. -
In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, paste the data
copied from the Website. This data is unstructured.
Then you can continue to configure the standardization process.
To do this, proceed as follows:
-
Connect the tFixedFlowInput component to
the tStandardizeRow component with a
Row > Main link. -
Double-click the tStandardizeRow
component to display its Basic settings
view.The advanced rule types are always executed after the ANTLR specific rules
regardless of rule order. -
In the Column to parse list, select
input. -
Select the Standardize this field check
box. -
In the Conversion rules table, click six
times on the [+] button to add six columns.
Name them and configure them as shown in the capture.For “StreetName” rule: select the Index type in the Type field and enter the path to your index in the Value field.
-
Select the Index type from
the Type list. -
Enter the path to your index in the Value field.
-
Select Match exact as the
search mode from the Search
mode list.By using the Match exact
mode, you will extract from the input flow only the strings that
exactly match the street name indexes.For further information about available search modes, see
Search modes for Index rules
Then the other rules are:
Name
Type
Value
“Zip”
Format
" DIGIT DIGIT DIGIT DIGIT DIGIT "
“City” Enumeration " 'Paris' | 'Paris' | 'PARIS' | 'Bagnolet'
"“SpecialStreetNumber” Format " (INT ('bis'|'ter')) | ( INT '-' INT) "
“INT” Shape "<SpecialStreetNumber>"
“Address” Shape " <INT> <StreetName> <Zip>
<City> "As advanced rule types are always executed after the ANTLR specific rules,
the basic rules “Zip”, “City” and “SpecialStreetNumber” are executed first followed by
“INT” and “Address”, the advanced rules. -
-
Click the Generate parser code in
Routines button in order to generate the code in the
Routines.Warning
If you do not click on this button, the Job will not be executed.
-
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are. -
In the design workspace, right-click the tStandardizeRow component to connect it to the tLogRow below with a Row > Reject
link. -
Double-click the tLogRow component linked
to the tStandardizeRow component to display
its Basic settings view. -
In the Mode area, select the Table (print values in cells of a table)
option.
Then continue to configure the process of filtering and extracting the data of
interest.
To do this, proceed as follows:
-
In the design workspace, connect tStandardizeRow to tFilterColumns with a Row
> Main link. -
In the Basic settings view of tFilterColumns, click the […] button next to the Edit
schema field to open the schema editor. -
Click on the STANDARDIZED_FIELD column, click the
first arrow to filter this column, and then click OK to close the editor and validate the changes. -
In the design workspace, connect tFilterColumns to tExtractXMLField with a Row
> Main link. -
Double-click the tExtractXMLField
component to display its Basic settings
view. -
Click the […] button next to the
Edit schema field to open the schema
editor, add four columns and name them respectively
number, street,
zip and city, and then click
OK to validate and close the
editor. -
In the XML field field, check that the
STANDARDIZED_FIELD column is selected.The content of the STANDARDIZED_FIELD column is an
XML field with a<record>
tag as root. Under this tag,
the structure defined by the rules configured in the tStandardizeRow component allows to extract each item of the
address, as you can see in the following capture. -
In the Loop XPath query field, type in
“/record/Address”. -
In the Mapping table, in the XPath query field, enter
“INT” for the number column,
“StreetName” for the street
column, “Zip” for the zip column
and “City” for the city
column.
Then you can prepare to execute the Job.
Before executing the Job, you need to add the tLogRow component to display the execution result.
To do this, perform the following operations:
-
In the design workspace, connect tExtractXMLField to the second tLogRow with a Row >
Main link. -
Double-click the tLogRow component to
open its Basic settings view. -
In the Schema list, select Built-In.
-
Click the […] button next to the
Edit schema field and check that the
input and the output schemas in the tLogRow
are synchronized.If not, click on the Sync columns
button. -
In the Mode area, select the Table (print values in cells of a table) option.
Press F6 to execute the Job.
The second tLogRow displays the normalized and
standardized addresses in the console according to the index.
For example, you can see that the Champs Elysées address has been selected and
formatted, and only input street names that exactly match an index string are
extracted.
The first tLogRow does not return anything in
this scenario because no error has been generated.
In this scenario, you will standardize some long descriptions of customer products by
matching the input flow with the data contained in an index. This scenario explains how
to use Index rules to tokenize product data and then
check each token against an index to extract exact match.
For this scenario, you must first create an index by using a Job with the tSynonymOutput component. You need to create indexes for the
brand, range, color and unit of the customer products. Use the tSynonymOutput component to generate the indexes and feed them with
entries and synonyms. The below capture shows an example Job:
Below is a sample of the generated indexes for this scenario:
Each of the generated indexes has strings (sequences of words) in one column and their
corresponding synonyms in the second column. These strings are used as a reference data
against which the product data, generated by
tFixedFlowInput, will be matched. For further information about index
creation, see tSynonymOutput.
In this scenario, the generated indexes are defined as context variable. For further
information about context variables, see Talend Studio User
Guide.
-
Drop the following components from the Palette to the design workspace: tFixedFlowInput,
tStandardizeRow,
tExtractXMLField and tFileOutputExcel. -
Connect the components together using Main links.
-
Drop tLogRow to the Job and link
tStandardizeRow to it using a Reject link.
-
Double-click tFixedFlowInput to display
the component Basic settings view.This component will generate the unstructured product data from which the
brand, range, color and unit are extracted. -
Click the […] button to open the
[Schema] dialog box. -
Click the [+] button and add two columns,
name them as Long_Description and
Supplier_Product_Code. Click
OK to close the dialog box. -
In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, enter the
unstructured product data you want to standardize, for example as the
following:123456789101112131415HONEYWELL 7DAY SNGL CHNL TIMESWITCH C/W EX LARGE BACKLIT DISPLAY ST9100C1106|ST9100C1006||Data to be standardised||HONEYWELL AQUASTAT SINGLE IMMERSION T/STAT WITH CAPILLARY 70-140C|L6188B2018U||||HONEYWELL S PLAN ZONE VALVE CONTROL PACK TW6Z910C|TW6Z910C||||H/WELL ST6100S1001 24HR 1CH SERVIC TIMER|||||H/WELL Y603A1133 BOXED Y PLAN |||||HWEL V8600N2056 LOW VOLT COMBI VALVE |||||HWEL VR4700C4022 GAS VALVE |||||DULUX PAINTPOD 2M EXTENSION TUBE |6030950||||DULUX PAINTPOD REPLACEMENT ROLLER SLEEVE|6030948||||DULUX PAINTPOD EXTRA REACH HANDLE |6030951||||DULUX PAINTPOD REPLACEMENT EDGING BRUSH |6030952||||DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 1L|5180583||||DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 2.5L|5180652||||DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 5L|5180744||||DT TRADE WEATHERSHIELD ULTIMATE OPAQUE EXTRA DEEP BASE 1L|5180584||||
-
Double-click tStandardizeRow to display
the component Basic settings view.This component helps you to define the rules necessary to standardize the
unstructured input flow and generates the brand, range, color and unit in
XML format. -
From the Column to parse list, select
Long_Description. -
Select the Standardize this field check
box. -
Define your rules as the following:
-
In the Conversion rules
table, click on the [+] button
to add the columns necessary to define the rules.This scenario focuses on the rules of the type Index. For detail examples about
the other rule types defined in the capture above, please refer
to the other tStandardizeRow
scenarios. -
Define three rules as Brand, Range and
Color. -
From the Type list, select
Index and fill in the
Value field with the
context variable of the indexes you generated.For further information about how to create and use context
variables, see Talend Studio User
Guide. -
From the Search mode list,
select Match exact. Search
modes are only applicable to the Index rules.Using the Match exact mode,
you will extract from the input flow only the strings that
exactly match the brand, range and color index strings you
generated with the tSynonymOutput component.For further information about available search modes, see
Search modes for Index rules
-
-
Click the Generate parser code in
Routines button in order to generate the code under the
Routines folder in the DQ Repository tree view of the Profiling perspective.This step is mandatory, otherwise the Job will not be executed.
-
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are. -
Double-click tLogRow and define the
component settings in the Basic settings
view. -
In the Mode area, select the Table (print values in cells of a table)
option.This component displays the tokens from the input flow that could not be
analysed and matched to any of the index strings.
-
Double-click tExtractXMLField to display
the component Basic settings view. -
Click the […] button next to the
Edit schema field to open the schema
editor, add four columns in the output flow and name them as
Brand, Range,
Color and Amount. Click
OK to validate and close the
editor. -
In the XML field field, check that the
STANDARDIZED_FIELD column is
selected.The content of the STANDARDIZED_FIELD
column is an XML field with a<record>
tag as root. Under
this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract items
from the input flow. -
In the Loop XPath query field, enter “/”
to define the root as the loop node. -
In XPath query field in the Mapping table, enter respectively
“record/Brand”,
“record/Range”“record/Color”
and “record/Measure”. -
Double-click tFileOutputExcel to display
the component Basic settings view. -
Set the destination file name as well as the Sheet name and select the
Include header check box.
-
Save your Job and press F6 to execute
it.The tLogRow component displays on the
studio console the data the Job could not analyze:The normalized and standardized data is written in the output excel
file. -
Right-click the tFileOutputExcel
component and select Data viewerA preview of the standardized data is opened in the studio.
The Brand, Range and
Color rules you defined have checked each token in
the input flow against the index strings and only input strings that exactly
match an index string are extracted. Take for example line 11, no color has
been extracted from this record as the record does not have an exact match
to any of the color index strings.Try and change in the tStandardizeRow
basic settings the search mode for the Color rule to
Match any, and then execute the Job.You can see that a color has now been extracted for the record on line 11.
This record has one word, “EXTRA”, that matches to the color index “Extra
Deep Base” and that is enough according to the Match
any search mode to consider that the record has a
color.The different search modes available in the tStandardizeRow component allow you to standardize and
output the data you expect.
This scenario describes how to build a set of rules to extract some information from
unstructured data. It explains how to use a basic ANTLR rule to tokenize data then how
to use an advanced rule to check each token created by ANTLR against a regular
expression.
This scenario uses:
-
a tFixedFlowInput component to create the
unstructured data strings. -
a tStandardizeRow component to define the
rules necessary to extract the liquid amounts from the data strings. -
a tLogRow component to display the output
data.
-
Drop the following components from the Palette onto the design workspace:
tFixedFlowInput,
tStandardizeRow and tLogRow. -
Connect the three components together using the Main links.
-
Double-click the tFixedFlowInput
component to display its Basic settings
view. -
Click the […] button to open the
[Schema] dialog box, click the
[+] button to add a column, name the
column product and finally click
OK to validate and close the
box. -
In the Mode area, select Use Inline Content (delimited file).
-
In the Content field, enter the following
three strings:1233M PROJECT LAMP 7 LUMENS 32MLA 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L54MLP FAC 32 ML
-
Double-click the tStandardizeRow
component to display its Basic settings
view. -
From the Column to parse list, select
product. -
In the Conversion rules table, define a
basic rule and an advanced rule as the following:-
Click twice on the [+] button to
add two columns. Name the first as “Amount” and
the second as “LiquidAmount”. -
Select Format as the type for
the basic rule, and define it to read"INT WHITESPACE*
.
WORD" -
Select RegExp as the type for
the advanced rule, and define it to read
"\d+\s*(L|ML)\b"
.The advanced rule will be executed after the basic ANTLR rule. The
“Amount” rule will tokenize the amounts in
the three strings, it matches any word with a numeric in front of
it. Then the RegExp rule will
check each token created by ANTLR against a regular
expression.
-
-
Click the Generate parser code in
Routines button in order to generate the code under the
Routines folder in the DQ Repository tree view of the Profiling perspective.This step is mandatory, otherwise the Job will not be executed.
-
In the Advanced settings view, leave the
options selected by default in the Output
format area as they are. -
Double-click the tLogRow component and
select the Table (print values in cells of a
table) option in the Mode
area.
-
Save your Job and press F6 to execute
it.The liquid amount has been extracted from the unstructured data by using a
basic ANTLR rule that has tokenized amounts followed by an advanced rule
that has checked each token created by ANTLR against the regular
expression.Each instance of the XML data is written on a separate row because the
Pretty print check box is selected in
the Advanced settings of the tStandardizeRow component.