Different rule types for different parsing levels

The tStandardizeRow component uses basic rules based
on ANTLR grammar and advanced rules defined by
Talend
and not based on
ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing and
standardizing data. Suppose, for example, that you want to extract the liquid amount in
the following three records:

3M PROJECT LAMP 7 LUMENS 32ML
A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
54MLP FAC 32 ML

3M PROJECT LAMP 7 LUMENS 32ML

A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L

54MLP FAC 32 ML

You may start by defining a liquid unit and a liquid amount in basic parser rules as
the following:

If you test these rules in the
Profiling
perspective of studio,
you can see that these rules extract 7 L from 7
LUMENS and this is not what you expect. You do not want that the word
LUMENS is split into two tokens.

The basic rules you have defined above are ANTLR lexer rules and lexer rules are used
for tokenizing the input string. ANTLR does not provide a word boundary symbol like the
used in regular expressions. You must then be careful when choosing
lexer rules because they define how the input strings will be split in tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a numeric
value in front of it, the Amount rule in this example:

This basic rule is a lexer rule, a Format rule that
starts with an uppercase. If you test this rule in the
Profiling

perspective of the Studio, you can see that non liquid amounts are matched by this rule
and the LiquidAmount rule only matches the expected sequence of
characters.

The second approach is to use an advanced rule like a regular expression and define a
word boundary with . You can use a lexer rule to tokenize amounts where
you match any word with a numeric in front of it. Then use a regular expression that
matches liquid amounts as the following: a digit optionally followed by space and
followed by L or ML and terminated by a word boundary.

Note that the regular expression will be applied on the tokens created by the basic
lexer rule.

You can not check the results of the advanced rule by testing the rule in the

Profiling
perspective of the Studio as you do with basic
rules. The only means to see the results of advanced rules is by using them in a Job.
The results will look as the following:

3M PROJECT LAMP 7 LUMENS 32ML
&lt;record&gt;
	&lt;Amount&gt;3M&lt;/Amount&gt; 
	&lt;Amount&gt;7 LUMENS&lt;/Amount&gt;
	&lt;LiquidAmount&gt;32ML&lt;/LiquidAmount&gt; 
	&lt;UNMATCHED&gt; 
		&lt;CAPWORD&gt;PROJECT&lt;/CAPWORD&gt; 
		&lt;CAPWORD&gt;LAMP&lt;/CAPWORD&gt; 
	&lt;/UNMATCHED&gt; 
&lt;/record&gt;

3M PROJECT LAMP 7 LUMENS 32ML

<Amount>7 LUMENS</Amount>

<CAPWORD>PROJECT</CAPWORD>

</UNMATCHED>

</record>

For a Job example about the use of the above rules, see Using two parsing levels to extract information from unstructured data.

Comparing these two approaches, the first one uses only ANTLR grammar and may be more
efficient than the second solution which requires a second parsing pass to check each
token against the regular expression. But regular expressions can help people
experienced in regular expressions to create more advanced rules that could hardly be
created using ANTLR only.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 6.x

0 Comments

Inline Feedbacks

View all comments

Different rule types for different parsing levels – Docs for ESB 6.x

Different rule types for different parsing levels

My Website Links

Tags