August 15, 2023

Rule types – Docs for ESB 6.x

Rule types

Two groups of rule types are provided: the basic rule types and the advanced rule
types.

  • Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given
    ANTLR symbols.

  • Advanced rule types: Regex, Index and Shape.
    Rules of these types match the tokenized data and standardize them when
    needed.

The advanced rule types are always executed after the ANTLR specific rules regardless
of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and
Using two parsing levels to extract information from unstructured data.

To create the rules of any type,
Talend
provides the pre-defined and
case-sensitive elements (ANTLR tokens) as follows for defining the composition of a
string to be matched:

  • INT: integer;

  • WORD: word;

  • WORD+: literals of several words;

  • CAPWORD: capitalized word;

  • DECIMAL: decimal float;

  • FRACTION: fraction float;

  • CURRENCY: currencies;

  • ROMAN_NUMERAL: Roman numerals;

  • ALPHANUM: combination of alphabetic and numeric characters;

  • WHITESPACE: whitespace

  • UNDEFINED: unexpected strings such as ASCII codes that any other token cannot
    recognize.

The following three tables successively present detailed information about the basic
types, the advanced types and the ANTLR symbols used by the basic rule types. These
three tables help you to complete the Conversion rules
table in the Basic settings of this component.

For basic rule types:

Basic Rule Type

Usage

Example

Conditions of rule composition

Enumeration

A rule of this type provides a list of possible matches.

RuleName: LengthUnit

RuleValue: 'inch' | 'cm'
"

– Each option must be put in a pair of single quotation marks
unless this option is a pre-defined element.

– Defined options must be separated by the | symbol.

Format

(Rule name starts with upper case)

A rule of this type uses the pre-defined elements along with any
of user-defined Enumeration,
Format or Combination rules to define the composition of a
string.

RuleName: Length

RuleValue: "DECIMAL WHITESPACE
LengthUnit"

This rule means that a whitespace between decimal and lengthunit
is required, so it matches strings like, 1.4 cm
but does not match a string like 1.4cm. To match both of these
cases, you need to define this rule as, for example, "DECIMAL
WHITESPACE* LengthUnit"
.

LengthUnit is an Enumeration rule defining " 'inch' | 'cm'
"
.

– When the name of a Format rule
starts with upper case, this rule requires the exact matching
result. It means that you need to define exactly any single element
of a string, even a whitespace.

Format (Rule name starts with
lower case)

A rule of this type is almost the same as a Format rule starting its name with upper case. The
difference is that the Format rule
with lower-case initial does not require exact match.

RuleName: length

RuleValue: "DECIMAL
LengthUnit"

The rule matches strings like 1.4 cm or
1.4cm etc. where the
Decimal is one of the pre-defined element types and
LengthUnit is an Enumeration rule defining " 'inch' | 'cm'
"
.

n/a

Combination

A rule of this type is used when you need to create several rules
of the same name.

RuleName: Size (or
size)

RuleValue: "length BY
length"

The rule matches strings like 1.4 cm by 1.4
cm
, where length is a Format rule (starting with lower case)
and BY is an Enumeration rule defining " 'By' | 'by' | 'x'
| 'X' "
.

– Literal texts or characters are not accepted as a part of the
rule value. When the literal texts or characters are needed, you
must create an Enumeration rule to
define these texts or characters and then use this Enumeration rule instead.

– When several Combination rules
use the identical rule name, they are executed in top-down order in
the Conversion rules table of the
Basic settings of tStandardizeRow, so arrange them properly
in order to obtain the best result. For an example, see the
following scenario.

Warning:

Any characters or string literals, if accepted by a rule type, must be put in
single quotation marks when used, otherwise they will be treated as ANTLR grammar
symbols or variables and generate errors or unexpected results at runtime.

For advanced rule types:

Advanced Rule Type

Usage

Example

Conditions

Regex

A rule of this type uses regular expressions to match the incoming
data tokenized by ANTLR.

RuleName: ZipCode

RuleValue: "\d{5}"

The rule matches strings like “92150”

Regular expressions must be Java compliant.

Index

A rule of this type uses a synonym index as reference to search
for the matched incoming data.

For further information about available synonym indexes, see the
appendix about data synonym dictionaries in the
Talend Studio User Guide
.

A scenario is available in Scenario 2: Standardizing addresses from unstructured data.

– In Windows, the backslashes need to be doubled or
replaced by slashes / if the path is copied from the
file system.

– Before the full path to the index, you need enter the protocol:
file://,
even if you run the Job in local mode, or
hdfs:// if the index is on a
cluster.

– When processing a record, a given Index rule matches up only the first string
identified as matchable.

– In a
Talend
Map/Reduce Job, you need to compress
each synonym index to be used as a zip file; moreover, if you use

Talend

Oozie scheduler to run that Job,
you have to place the zip file in the Hadoop distribution where the
Job is run.

Shape

A rule of this type uses pre-defined elements along with the
established Regex or Index rules or both to match the incoming
data.

RuleName: Address

RuleValue:
"<INT><WORD><StreetType>"

This rule matches the addresses like 12 main
street
, where INT
and WORD are pre-defined tokens
(rule elements) and StreetType is an Index rule which you define along with
this example rule in the Basic
settings
view of this component.

For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data.

Only the contents put in < > are recognizable.
In the other cases, the contents are considered as error or are
omitted.

For the given ANTLR symbols:

Symbols

Meaning

|

alternative

's'

char or string literal

+

1 or more

*

0 or more

?

optional or semantic predicate

~

match not

Examples of using these symbols are presented in the following scenarios, but you can
also find more examples on the following site:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x