August 15, 2023

Different rule types for different parsing levels – Docs for ESB 6.x

Different rule types for different parsing levels

The tStandardizeRow component uses basic rules based
on ANTLR grammar and advanced rules defined by
Talend
and not based on
ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing and
standardizing data. Suppose, for example, that you want to extract the liquid amount in
the following three records:

You may start by defining a liquid unit and a liquid amount in basic parser rules as
the following:

parser_rules.png

If you test these rules in the
Profiling
perspective of studio,
you can see that these rules extract 7 L from 7
LUMENS
and this is not what you expect. You do not want that the word
LUMENS is split into two tokens.

parser_rules2.png

The basic rules you have defined above are ANTLR lexer rules and lexer rules are used
for tokenizing the input string. ANTLR does not provide a word boundary symbol like the
 used in regular expressions. You must then be careful when choosing
lexer rules because they define how the input strings will be split in tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a numeric
value in front of it, the Amount rule in this example:

parser_rules3.png

This basic rule is a lexer rule, a Format rule that
starts with an uppercase. If you test this rule in the
Profiling

perspective of the Studio, you can see that non liquid amounts are matched by this rule
and the LiquidAmount rule only matches the expected sequence of
characters.

parser_rules4.png

The second approach is to use an advanced rule like a regular expression and define a
word boundary with . You can use a lexer rule to tokenize amounts where
you match any word with a numeric in front of it. Then use a regular expression that
matches liquid amounts as the following: a digit optionally followed by space and
followed by L or ML and terminated by a word boundary.

parser_rules5.png

Note that the regular expression will be applied on the tokens created by the basic
lexer rule.

You can not check the results of the advanced rule by testing the rule in the

Profiling
perspective of the Studio as you do with basic
rules. The only means to see the results of advanced rules is by using them in a Job.
The results will look as the following:

For a Job example about the use of the above rules, see Using two parsing levels to extract information from unstructured data.

Comparing these two approaches, the first one uses only ANTLR grammar and may be more
efficient than the second solution which requires a second parsing pass to check each
token against the regular expression. But regular expressions can help people
experienced in regular expressions to create more advanced rules that could hardly be
created using ANTLR only.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x