Rule types
Two groups of rule types are provided: the basic rule types and the advanced rule
types.
-
Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given
ANTLR symbols. -
Advanced rule types: Regex, Index and Shape.
Rules of these types match the tokenized data and standardize them when
needed.
The advanced rule types are always executed after the ANTLR specific rules regardless
of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and
Using two parsing levels to extract information from unstructured data.
To create the rules of any type,
Talend
provides the pre-defined and
case-sensitive elements (ANTLR tokens) as follows for defining the composition of a
string to be matched:
-
INT: integer;
-
WORD: word;
-
WORD+: literals of several words;
-
CAPWORD: capitalized word;
-
DECIMAL: decimal float;
-
FRACTION: fraction float;
-
CURRENCY: currencies;
-
ROMAN_NUMERAL: Roman numerals;
-
ALPHANUM: combination of alphabetic and numeric characters;
-
WHITESPACE: whitespace
-
UNDEFINED: unexpected strings such as ASCII codes that any other token cannot
recognize.
The following three tables successively present detailed information about the basic
types, the advanced types and the ANTLR symbols used by the basic rule types. These
three tables help you to complete the Conversion rules
table in the Basic settings of this component.
For basic rule types:
|
Basic Rule Type |
Usage |
Example |
Conditions of rule composition |
|---|---|---|---|
|
Enumeration |
A rule of this type provides a list of possible matches. |
RuleName:
RuleValue: “ |
– Each option must be put in a pair of single quotation marks – Defined options must be separated by the | symbol. |
|
Format (Rule name starts with upper case) |
A rule of this type uses the pre-defined elements along with any |
RuleName:
RuleValue: This rule means that a whitespace between decimal and lengthunit
|
– When the name of a Format rule |
|
Format (Rule name starts with |
A rule of this type is almost the same as a Format rule starting its name with upper case. The |
RuleName:
RuleValue: The rule matches strings like 1.4 cm or |
n/a |
|
Combination |
A rule of this type is used when you need to create several rules |
RuleName:
RuleValue: The rule matches strings like 1.4 cm by 1.4 |
– Literal texts or characters are not accepted as a part of the – When several Combination rules |
Any characters or string literals, if accepted by a rule type, must be put in
single quotation marks when used, otherwise they will be treated as ANTLR grammar
symbols or variables and generate errors or unexpected results at runtime.
For advanced rule types:
|
Advanced Rule Type |
Usage |
Example |
Conditions |
|---|---|---|---|
|
Regex |
A rule of this type uses regular expressions to match the incoming |
RuleName:
RuleValue: The rule matches strings like “92150” |
Regular expressions must be Java compliant. |
|
Index |
A rule of this type uses a synonym index as reference to search For further information about available synonym indexes, see the |
A scenario is available in Scenario 2: Standardizing addresses from unstructured data. |
– In Windows, the backslashes – Before the full path to the index, you need enter the protocol: – When processing a record, a given Index rule matches up only the first string – In a |
|
Shape |
A rule of this type uses pre-defined elements along with the |
RuleName:
RuleValue: This rule matches the addresses like 12 main For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data. |
Only the contents put in |
For the given ANTLR symbols:
|
Symbols |
Meaning |
|---|---|
|
|
alternative |
|
|
char or string literal |
|
|
1 or more |
|
|
0 or more |
|
|
optional or semantic predicate |
|
|
match not |
Examples of using these symbols are presented in the following scenarios, but you can
also find more examples on the following site:
https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.