Rule types

Two groups of rule types are provided: the basic rule types and the advanced rule
types.

Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given
ANTLR symbols.
Advanced rule types: Regex, Index and Shape.
Rules of these types match the tokenized data and standardize them when
needed.

The advanced rule types are always executed after the ANTLR specific rules regardless
of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and
Using two parsing levels to extract information from unstructured data.

To create the rules of any type,
Talend
provides the pre-defined and
case-sensitive elements (ANTLR tokens) as follows for defining the composition of a
string to be matched:

INT: integer;
WORD: word;
WORD+: literals of several words;
CAPWORD: capitalized word;
DECIMAL: decimal float;
FRACTION: fraction float;
CURRENCY: currencies;
ROMAN_NUMERAL: Roman numerals;
ALPHANUM: combination of alphabetic and numeric characters;
WHITESPACE: whitespace
UNDEFINED: unexpected strings such as ASCII codes that any other token cannot
recognize.

The following three tables successively present detailed information about the basic
types, the advanced types and the ANTLR symbols used by the basic rule types. These
three tables help you to complete the Conversion rules
table in the Basic settings of this component.

For basic rule types:

Basic Rule Type	Usage	Example	Conditions of rule composition
Enumeration	A rule of this type provides a list of possible matches.	RuleName: `LengthUnit` RuleValue: “`'inch' \| 'cm' "`	– Each option must be put in a pair of single quotation marks unless this option is a pre-defined element. – Defined options must be separated by the \| symbol.
Format (Rule name starts with upper case)	A rule of this type uses the pre-defined elements along with any of user-defined Enumeration, Format or Combination rules to define the composition of a string.	RuleName: `Length` RuleValue: `"DECIMAL WHITESPACE LengthUnit"` This rule means that a whitespace between decimal and lengthunit is required, so it matches strings like, 1.4 cm but does not match a string like 1.4cm. To match both of these cases, you need to define this rule as, for example, `"DECIMAL WHITESPACE* LengthUnit"` . `LengthUnit` is an Enumeration rule defining `" 'inch' \| 'cm' "`.	– When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace.
Format (Rule name starts with lower case)	A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match.	RuleName: `length` RuleValue: `"DECIMAL LengthUnit"` The rule matches strings like 1.4 cm or 1.4cm etc. where the `Decimal` is one of the pre-defined element types and `LengthUnit` is an Enumeration rule defining `" 'inch' \| 'cm' "`.	n/a
Combination	A rule of this type is used when you need to create several rules of the same name.	RuleName: `Size` (or `size`) RuleValue: `"length BY length"` The rule matches strings like 1.4 cm by 1.4 cm, where `length` is a Format rule (starting with lower case) and `BY` is an Enumeration rule defining `" 'By' \| 'by' \| 'x' \| 'X' "`.	– Literal texts or characters are not accepted as a part of the rule value. When the literal texts or characters are needed, you must create an Enumeration rule to define these texts or characters and then use this Enumeration rule instead. – When several Combination rules use the identical rule name, they are executed in top-down order in the Conversion rules table of the Basic settings of tStandardizeRow, so arrange them properly in order to obtain the best result. For an example, see the following scenario.

Warning:

Any characters or string literals, if accepted by a rule type, must be put in
single quotation marks when used, otherwise they will be treated as ANTLR grammar
symbols or variables and generate errors or unexpected results at runtime.

For advanced rule types:

Advanced Rule Type	Usage	Example	Conditions
Regex	A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR.	RuleName: `ZipCode` RuleValue: `"\d{5}"` The rule matches strings like “92150”	Regular expressions must be Java compliant.
Index	A rule of this type uses a synonym index as reference to search for the matched incoming data. For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide.	A scenario is available in Scenario 2: Standardizing addresses from unstructured data.	– In Windows, the backslashes need to be doubled or replaced by slashes `/` if the path is copied from the file system. – Before the full path to the index, you need enter the protocol: file://, even if you run the Job in local mode, or hdfs:// if the index is on a cluster. – When processing a record, a given Index rule matches up only the first string identified as matchable. – In a Talend Map/Reduce Job, you need to compress each synonym index to be used as a zip file; moreover, if you use Talend Oozie scheduler to run that Job, you have to place the zip file in the Hadoop distribution where the Job is run.
Shape	A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data.	RuleName: `Address` RuleValue: `"<INT><WORD><StreetType>"` This rule matches the addresses like 12 main street, where INT and WORD are pre-defined tokens (rule elements) and StreetType is an Index rule which you define along with this example rule in the Basic settings view of this component. For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data.	Only the contents put in `< >` are recognizable. In the other cases, the contents are considered as error or are omitted.

Advanced Rule Type

Usage

Example

Conditions

Regex

A rule of this type uses regular expressions to match the incoming
data tokenized by ANTLR.

RuleName: ZipCode

RuleValue: "\d{5}"

The rule matches strings like “92150”

Regular expressions must be Java compliant.

Index

A rule of this type uses a synonym index as reference to search
for the matched incoming data.

For further information about available synonym indexes, see the
appendix about data synonym dictionaries in the
Talend Studio User Guide.

A scenario is available in Scenario 2: Standardizing addresses from unstructured data.

– In Windows, the backslashes need to be doubled or
replaced by slashes / if the path is copied from the
file system.

– Before the full path to the index, you need enter the protocol:
file://,
even if you run the Job in local mode, or
hdfs:// if the index is on a
cluster.

– When processing a record, a given Index rule matches up only the first string
identified as matchable.

– In a
Talend
Map/Reduce Job, you need to compress
each synonym index to be used as a zip file; moreover, if you use

Talend

Oozie scheduler to run that Job,
you have to place the zip file in the Hadoop distribution where the
Job is run.

Shape

A rule of this type uses pre-defined elements along with the
established Regex or Index rules or both to match the incoming
data.

RuleName: Address

RuleValue:
"<INT><WORD><StreetType>"

This rule matches the addresses like 12 main
street, where INT
and WORD are pre-defined tokens
(rule elements) and StreetType is an Index rule which you define along with
this example rule in the Basic
settings view of this component.

For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data.

Only the contents put in < > are recognizable.
In the other cases, the contents are considered as error or are
omitted.

For the given ANTLR symbols:

Symbols	Meaning
`\|`	alternative
`'s'`	char or string literal
`+`	1 or more
`*`	0 or more
`?`	optional or semantic predicate
`~`	match not

Examples of using these symbols are presented in the following scenarios, but you can
also find more examples on the following site:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 6.x

0 Comments

Inline Feedbacks

View all comments

Rule types – Docs for ESB 6.x

Rule types

My Website Links

Tags