July 30, 2023

tMatchGroup – Docs for ESB 7.x

tMatchGroup

Creates groups of similar data records in any source data including large volumes
of data by using one or several match rules.

tMatchGroup compares columns in both
standard input data flows and in M/R input data flows by using matching methods and
groups similar encountered duplicates together.

Several tMatchGroup components can be used sequentially to match data
against different blocking keys. This will refine the groups received by
each of the tMatchGroup components through
creating different data partitions that overlap previous data blocks and so
on.

In defining a group, the first processed record of each
group is the master record of the group. The other records are computed as
to their distances from the master records and then are distributed to the
due master record accordingly.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tMatchGroup Standard properties

These properties are used to configure tMatchGroup running in the Standard Job framework.

The Standard
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema and Edit
schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

The output schema of this component contains the following read-only
fields:

GID: provides a group identifier of the data type String.

Note: In Jobs migrated from previous releases to your
current Talend Studio,
the group identifier may be of the Long data type. To have a group
identifier of the String data type, replace the tMatchGroup components in the
migrated Jobs with tMatchGroup
components from the Palette.

GRP_SIZE: counts the number of records in the
group, computed only on the master record.

MASTER: identifies, by true or
false, if the record used in the matching comparisons
is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they
match, the input record will be in the group.

SCORE: measures the distance between the input
record and the master record according to the matching algorithm
used.

In case the tMatchGroup component is
used to have multiple output flows, the score in this column decides to
what output group the record should go.

GRP_QUALITY depends on the
Matching Algorithm:

  • Simple VSR: GRP_QUALITY provides the
    quality of similarities in the group by taking the
    minimal matching value. Only the master record has a
    quality score.
  • T-Swoosh: GRP_QUALITY provides the
    quality of similarities in the group by taking the
    minimal matching value among all record pairs of the
    group. Only the master record has a quality score.

MERGE_INFO: this output column is available
only when you have more than one tMatchGroup component in the Job. The column explains
with true or false if the record is respectively a master record or
not a master record in the first pass.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Matching Algorithm

Select from the list the algorithm you want to use in the component:
Simple VSR or T-Swoosh.

tMatchGroup_1.png

Click the import icon to import a match rule from the
Studio repository.

In the Match Rule Selector wizard, import a match
rule with the same algorithm as the selected matching algorithm in the
basic settings of the component. Otherwise, the Job runs with default
values for the parameters which are not compatible between the Simple
VSR and the t-Swoosh algorithms.

For further information about how to import rules, see Importing match rules from the studio repository

Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to apply a
matching algorithm.

Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.

For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison.

 

Matching Function

Select a matching algorithm from the list:

Exact: matches each processed entry
to all possible reference entries with exactly the same value. It
returns 1 when the two strings exactly match,
otherwise it returns 0.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the same
value while ignoring the value case.

Soundex: matches
processed entries according to a standard English phonetic algorithm. It
indexes strings by sound, as pronounced in English, for example “Hello”:
“H400”.It does not support Chinese characters.

Levenshtein (edit distance):
calculates the minimum number of edits (insertion, deletion or
substitution) required to transform one string into another. Using this
algorithm in the tMatchGroup
component, you do not need to specify a maximum distance. The component
automatically calculates a matching percentage based on the distance.
This matching score will be used for the global matching calculation,
based on the weight you assign in the Confidence
Weight
field.

Metaphone: Based
on a phonetic algorithm for indexing entries by their pronunciation. It
first loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the reference
flow. It does not support Chinese characters.

Double
Metaphone
: a new version of the Metaphone phonetic
algorithm, that produces more accurate results than the original
algorithm. It can return both a primary and a secondary code for a
string. This accounts for some ambiguous cases as well as for multiple
variants of surnames with common ancestry. It does not support Chinese characters.

Soundex FR:
matches processed entries according to a standard French phonetic
algorithm.It does not support Chinese characters.

Jaro: matches processed entries
according to spelling deviations. It counts the number of matched
characters between two strings. The higher the distance is, the more
similar the strings are.

Jaro-Winkler: a variant of Jaro, but
it gives more importance to the beginning of the string.

Fingerprint key: matches entries
after doing the following sequential process:

  1. Remove leading and trailing
    whitespace.
  2. Change all characters to their
    lowercase representation. It does not support Chinese characters.
  3. Remove all punctuation and
    control characters.
  4. Split the string into
    whitespace-separated tokens.
  5. Sort the tokens and remove
    duplicates.
  6. Join the tokens back together.
    Because the string parts are sorted, the given order of tokens
    does not matter. So, Cruise,
    Tom
    and Tom
    Cruise
    both end up with a fingerprint cruise tom and therefore end up in the
    same cluster.
  7. Normalize extended western
    characters to their ASCII representation, for example gödel to godel. This reproduces
    data entry mistakes performed when entering extended characters
    with an ASCII-only keyboard. However, this procedure can also
    lead to false positives, for example gödel and godél would both end up with godel as their
    fingerprint but they are likely to be different names. So this
    might work less effectively for datasets where extended
    characters play substantial differentiation role.

q-grams: matches processed entries by
dividing strings into letter blocks of length q in order to
create a number of q length grams. The matching result is
given as the number of q-gram matches over possible q-grams.

Hamming: calculates the minimum
number of substitutions required to transform one string into another
string having the same length. For example, the Hamming distance between
masking” and “pairing” is 3.

custom…: enables you to load an
external matching algorithm from a Java library using the custom Matcher column.

For further information about how to load an external Java library,
see tLibraryLoad.

For further information about how to create a custom matching
algorithm, see Creating a custom matching algorithm.

For a related scenario about how to use a custom matching algorithm,
see Using a custom matching algorithm to match entries.

Custom Matcher

When you select Custom as the
matching type, enter the path pointing to the custom class (external
matching algorithm) you need to use. This path is defined by yourself in
the library file (.jar file) which you can import
by using the tLibraryLoad
component.

For example, to use a MyDistance.class class
stored in the directory org/talend/mydistance in a
user-defined mydistance.jar library, the path to be
entered is org.talend.mydistance.MyDistance.

 

Tokenized measure

Tokenization is the concept of splitting a string into words. Select the
method to use to compute a tokenized measure for the selected
algorithm:

NO: no tokenization method is used on the
string. With this option, “John Doe” and “Jon Doe” should match.

Same place: splits the two strings by words to
two lists, list1 and list2. Associates each element from list1 with the
element which has the same position in list2. Using this method, “She is
red and he is pink” and “Catherine is red and he is pink” should
match.

Same order: splits the two
strings by words to two lists, list1 and list2 and assumes that list1 is
shorter than list2. Tries to associate the elements from list1 with the
elements in list2 taken in the same order. Using this method, “John Doe”
and “John B. Doe” match.

This method should be used only with strings which has a few words,
otherwise the number of possible combinations can be large.

Any order: splits the two strings by words to two
lists, list1 and list2 and assumes that list1 is shorter than list2.
Tries to assign each word of list1 to a word of list2, in order to have
the highest global similarity (with respect to the used similarity).

Using this method, “John Doe” and “Doe John” match.

 

Threshold

This column is displayed when you
selected T-Swoosh as the matching algorithm.

Two data records match when the probability is greater than or equal to
the set value.

Set a threshold between 0 and 1. You can enter up to 6 decimals.

0 means that the
similarity between values in the column is not measured. 1 means that you want each two
compared values of the column to exactly match. Default value is 1.

 

Confidence Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0. means that you
want each two compared values of the column to exactly match. Default
value is

Handle Null

To handle null values, select from the list the null operator you want
to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None means that you: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but the
first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may not
match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they match.

And for the records:

“Doe”, “”

“Doe”, “”

Null Match Null: they match.

Null Match None: they do not
match.

Null Match All: they match.

 

Survivorship Function (only available when the
T-Swoosh algorithm is selected): Select how
two similar records will be merged from the drop-down list.

  • Concatenate: this function adds the content
    of the first record and the content of the second record together.
    for example, Bill and William will be merged into BillWilliam. In
    the Parameter field, you can specify a
    separator to be used to separate values.
  • Prefer True (for booleans): This function
    always set booleans to True in the merged record, unless all
    booleans in the source records are False.
  • Prefer False (for booleans): This function
    always sets booleans to False in the merged record, unless all
    booleans in the source records are True.
  • Most common: This function validates the most
    frequently-occurring field value in each group of duplicate
    records.
  • Most
    recent
    : This function validates the latest date value in
    each group of duplicate records. If more than one date type is
    defined in the schema, select a column in Reference
    column
    . If no date type is defined in the schema,
    the data are sorted into most recent loading order.
  • Most
    ancient
    : This function validates the earliest date
    value in each group of duplicate records. If more than one date type
    is defined in the schema, select a column in Reference
    column
    . If no date type is defined in the schema,
    the data are sorted into most ancient loading order.
  • Longest (for strings): This function
    validates the longest field value in each group of duplicate
    records.
  • Shortest (for strings): This function
    validates the shortest field value in each group of duplicate
    records.
  • Largest (for numbers): This function
    validates the largest numerical value in each group of duplicate
    records.
  • Smallest (for numbers): This function
    validates the smallest numerical value in each group of duplicate
    records.
  • Most trusted source: This function takes the
    data coming from the source which has been defined as being most
    trustworthy. The most trusted data source is set in the
    Parameter field. This function is only
    used in the context of integrated matching in Talend MDM.
  Reference column

If you set
Survivor Function to Most
recent
or Most ancient, this
column is used to select the reference column.

  Parameter

If you set Survivorship
Function
to Most trusted source,
this item is used to set the name of the data source you want to use as
a base for the master record.

If you set Survivorship
Function
to Concatenate, this
item is used to specify a separator you want to use for concatenating
data.

Match Threshold

Enter the match probability. Two data
records match when the probability is greater than or equal to the set
value.

You can enter a different match threshold for each match rule.

Survivorship Rules For Columns (only available
when the T-Swoosh algorithm is selected)

Input Column: Select the column(s) from the input
flow on which you want to apply a survivorship function.

Survivorship Function: Select how two similar
records will be merged from the drop-down list.

Default Survivorship Rules

(only
available when the T-Swoosh algorithm is
selected)

Data Type: Select the data type(s) from the input
flow on which you want to apply a survivorship function.

Survivorship Function: Select how two similar
records will be merged from the drop-down list.

Blocking Selection

Input Column

If required, select the column(s) from the input flow according to
which you want to partition the processed data in blocks, this is
usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive blocks
designed to increase the proportion of matches observed while decreasing
the number of pairs to compare. Comparisons are restricted to record
pairs within each block.

Using blocking column(s) is very useful when you are processing very
big data.

Advanced settings

Store on disk

Select the Store on disk check box if
you want to store processed data blocks on the disk to maximize system
performance.

Max buffer size: Type in the size of
physical memory you want to allocate to processed data.

Temporary data directory path: Set
the location where the temporary file should be stored.

Multiple output

Select the Separate output
check box to have several output flows:

  • Uniques: when the group size (minimal
    distance computed in the record) is equal to 1, the record
    is listed in this flow.

    When records
    are not unique, they can be:

  • Matches: when the
    group quality is greater than or equal to the threshold
    you define in the Confident match
    threshold
    field, the record is listed
    in this flow.
  • Suspects: when the group quality
    is less than the threshold you define in the Confident match
    threshold
    field, the record is listed in
    this flow.
Note:

When using the Simple VSR
algorithm, the group quality is the minimal distance
computed in the record.

When using the t-Swoosh algorithm, the
group quality is the minimal distance computed among all record
pairs of the group.

Confident match
threshold
: set a numerical value between the
current Match
threshold
and 1. From this threshold, you can be
confident in the quality of the group.

Multi-pass

Select this check box to enable a tMatchGroup
component to receive data sets from another tMatchGroup that precedes it in the Job.
This will refine the groups received by each of the tMatchGroup components through creating data
partitions based on different blocking keys.

Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in
both tMatchGroup components before linking them
together. If you linked the components before selecting this check box,
select this check box in the second component in the Job flow and then,
in the first component. Otherwise, you may have an issue as there are
two columns in the output schema with the same name. Selecting this
check box in only one tMatchGroup component may cause
schema mismatch issues.

For an example Job, see Matching customer data through multiple passes

With multi-pass matching, all master records are generated but
intermediate master records are removed from the output flow. Only final
master and original records are kept at the end.

Propagate original values: This
option is available only with the T-Swoosh algorithm. Select this check box to allow the
original records from each pass (and not only the unmatched records) to
also be considered in the second pass of matching, both against each
other and against the survived masters. This helps to make sure that no
matches are missed.

Sort the output data by GID

Select this check box to group the output data by the group
identifier.

The output is sorted in ascending alphanumeric order by group
identifier.

Output distance details

Select this check box to add an output column
MATCHING_DISTANCES in the schema of the
component. This column provides the distance between the input and
master records in each group.

Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in
both tMatchGroup components before linking them
together. If you linked the components before selecting this check box,
select this check box in the second component in the Job flow and then,
in the first component. Otherwise, you may have an issue as there are
two columns in the output schema with the same name. Selecting this
check box in only one tMatchGroup component may cause
schema mismatch issues.

Display detailed labels

Select this check box to have in the output
MATCHING_DISTANCES column not only the matching
distance but also the names of the columns used as key attributes in the
applied rule.

For example, if you try to match on first name and last name fields,
lname and fname, the
output would be fname:1.0|lname:0.97 when the check box
is selected and 1.0|0.97 when it is not selected.

Deactivate matching computation when opening the
wizard

Select this check box to open the Configuration Wizard without running
the match rules defined in the wizard.

This enables you to have a better experience with the component.
Otherwise, the wizard may take some time to open.

tStatCatcher Statistics

Select this check box to collect log data at the component level. Note
that this check box is not available in the Map/Reduce version of the
component.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input flow as
well as an output flow.

Configuration wizard

The configuration wizard enables you to create different production environments,
Configurations, and their match rules.

You can also use
the configuration wizard to import match rules created and tested in the studio and use them
in your match Jobs. For further information, see Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.

To open the configuration wizard:

  1. In the studio workspace, design your job and link the components together, for
    example as below:

    tMatchGroup_2.png

  2. Double-click tMatchGroup; or right-click
    it and from the contextual menu select Configuration
    Wizard
    ; or click Preview in the
    basic settings view of
    tMatchGroup.

    tMatchGroup_3.png

  3. In the popup that opens, click Skip Computation
    if you want to open the Configuration Wizard without running the match rules defined
    in it.
The configuration wizard is composed of three areas:

  • the Configuration view, where you can set the
    match rules and the blocking column(s).

  • the matching chart, which presents the graphic matching result,

  • the matching table, which presents the details of the matching result.

The Limit field at the upper-left corner indicates the
maximum number of rows to be processed by the match rule(s) in the wizard. The by-default
maximum row number is 1000.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute
the Job.

You can use these different configurations for testing purposes for example, but
you can only save one configuration from the wizard, the open configuration.

In each configuration, you can define the parameters to generate match rules with the VSR
or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the
basic settings of the tMatchGroup component.

You can define survivorship rules, blocking key(s) and multiple conditions using several
match rules. You can also set different match intervals for each rule. The match results on
multiple conditions will list data records that meet any of the defined rules. When a
configuration has multiple conditions, the Job conducts an OR match operation. It evaluates
data records against the first rule and the records that match are not evaluated against the
other rules.

The parameters required to edit or create a match rule are:

  • The Key definition parameters.

  • The Match Threshold field.

  • A blocking key in the Blocking Selection
    table (available only for rules with the VSR algorithm).

    Defining a blocking key is not mandatory but advisable as it partitions data
    in blocks to reduce the number of records that need to be examined. For further
    information about the blocking key, see Importing match rules from the studio repository.

  • The Survivorship Rules for Columns parameters
    (available only for rules with the T-Swoosh algorithm).

  • The Default Survivorship Rules parameters for data
    types (available only for rules with the T-Swoosh algorithm).

  1. In the basic settings of the tMatchGroup
    component, select Simple VSR from the Matching Algorithm list.

    It is important to have the same type of the matching algorithm selected in the
    basic settings of the component and defined in the configuration wizard. Otherwise
    the Job runs with default values for the parameters which are not compatible between
    the two algorithms.
  2. In the basic settings of the tMatchGroup
    component, click Preview to open the configuration
    wizard.
  3. Click the [+] button on the top right corner of
    the Configuration view.

    This creates, in a new tab, an exact copy of the last configuration.
    tMatchGroup_4.png

  4. Edit or set the parameters for the new configuration in the Key definition and Blocking
    Selection
    tables.
  5. If needed, define several match rules for the open configuration as
    the following:

    1. Click the [+] button on the match rule bar to
      create an exact copy of the last rule in a new tab.
    2. Set the parameters for the new rule in the Key
      definition
      table and define its match interval.
    3. Follow the steps above to create as many match rules for a
      configuration as needed. You can define a different match interval for
      each rule.
    When a configuration has multiple conditions, the Job conducts an OR
    match operation. It evaluates data records against the first rule and the
    records that match are not evaluated against the second rule and so
    on.
  6. Click the Chart button at the top right corner of
    the wizard to execute the Job in the open configuration.

    The matching results are displayed in the matching chart and table.
    Follow the steps above to create as many new configuration in the wizard as
    needed.
  7. To execute the Job in a specific configuration, open the configuration in the
    wizard and click the Chart button.

    The matching results are displayed in the matching chart and table.
  8. At the bottom right corner of the wizard, click either:

    • OK to save the open configuration.

      You can save only one configuration in the wizard.

    • Cancel to close the wizard and keep the
      configuration saved initially in the wizard.

For an example of a match rule with the T-Swoosh algorithm, see Using survivorship functions to merge two records and create a master record.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the
analyzed data.

The Hide groups less than parameter, which is set to
2 by default, enables you to decide what groups to show in the result
chart. Usually you want to hide groups of small group size.

tMatchGroup_5.png
For example, the above matching chart indicates that:

  • 48 items are analyzed and classified into 18 groups according to a given match
    rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2.

  • 11 groups have 2 items each. In each group, the 2 items are duplicates of each
    other.

  • 3 groups have 3 items each. In each group, these items are duplicates of one
    another.

  • 3 groups have 4 items each. In each group, these items are duplicates of one
    another.

  • One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different duplicates.

tMatchGroup_6.png

This table indicates the matching details of items in each group and colors the groups in
accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide
groups of less than
parameter. This parameter enables you to hide groups of small
group size. It is set to 2 by default.

The buttons under the table helps you to navigate back and forth through pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you can import match keys from the
match rules created and tested in the
Profiling
perspective. You can then use these imported matching keys in your match
Jobs.

The tMatchGroup component enables you to import from the
Studio repository match rules based on the VSR or the T-Swoosh algorithms.

The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records and
assigns them to groups. The first processed record of each group is the master record of the
group. The VSR algorithm compares each record with the master of each group and uses the
computed distances, from master records, to decide to what group the record should go.

The T-Swoosh algorithm enables you to find duplicates and to define how
two similar records are merged to create a master record, using a survivorship function.
These new merged records are used to find new duplicates. The difference with the VSR
algorithm is that the master record is in general a new record that does not exist in
the list of input records.

  1. From the configuration wizard, click the

    tMatchGroup_1.png

    icon on the top right corner.

    The Match Rule Selector wizard opens listing
    all match rules created in the studio and saved in the repository.
    tMatchGroup_8.png

  2. Select the match rule you want to import into the tMatchGroup component and use on your data.

    A warning message displays in the wizard if the match rule you want to import is
    defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration
    wizard.
    It is important to have the same type of the matching algorithm selected in the
    basic settings of the component and imported from the configuration wizard.
    Otherwise the Job runs with default values for the parameters which are not
    compatible between the two algorithms.
  3. Select the Overwrite current Match Rule in the
    analysis
    check box if you want to replace the rule in the
    configuration wizard with the rule you import.

    If you leave the box unselected, the match keys will be imported in a new match
    rule tab without overwriting the current match rule in the wizard.
  4. Click OK.

    The matching key is imported from the match rule and listed as a new rule in the
    configuration wizard.
    tMatchGroup_9.png

  5. Click in the Input Key Attribute and select from
    the input data the column on which you want to apply the matching key.
  6. In the Match threshold
    field, enter the match probability threshold. Two data records match when the
    computed match score is greater than or equal to this value.
  7. In the Blocking Selection table, select the
    column(s) from the input flow which you want to use as a blocking key.

    Defining a blocking key is not mandatory but advisable. Using a blocking key
    partitions data in blocks and so reduces the number of records that need to be
    examined, as comparisons are restricted to record pairs within each block. Using
    blocking key(s) is very useful when you are processing big data set.
    The Blocking Selection table in the component is
    different from the Generation of Blocking Key table
    in the match rule editor in the
    Profiling
    perspective.
    The blocking column in tMatchGroup could come
    from a tGenKey component (and would be called
    T_GEN_KEY) or directly from the input schema (it could be a
    ZIP column for instance). While the Generation of Blocking Key table in the match rule editor defines
    the parameters necessary to generate a blocking key; this table is equivalent to the
    tGenKey component. The Generation of Blocking Key table generates a blocking column
    BLOCK_KEY used for blocking.
  8. Click the Chart button in the top right corner of
    the wizard to execute the Job using the imported match rule and show the matching
    results in the wizard.

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

Matching two records

You can use the tMatchGroup component to detect duplicates and
define how to merge similar records to create a master record.

Creating a master record is an iterative process: each new master record can be used to
find new duplicates.

You can choose between two different algorithms to create master records:

  • Simple VSR Matcher
  • T-Swoosh

The main difference between the two algorithms is that T-Swoosh creates, for each master
record, a new record that does not exist in the list of input records.

Match rules

You can define different match rules for the same dataset.

Two records match if at least one of the match rules is satisfied. As soon as two records
match according to a given rule, the other rules are not checked.

Matching several records

Blocking

To avoid doing a two-by-two comparison of all the input records, you can define one
or many blocking keys to split the input dataset into smaller datasets called
blocks.

In each block, the blocking keys must have the same value. Then, each block is processed
independently.

Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh
algorithms to process data. For example, if 100,000 records are split into 100 blocks of
1,000 records each, the number of comparisons are reduced by a factor 100. This means
the algorithm runs around 100 times faster.

It is recommended to use the tGenKey component to generate blocking
keys and to visualize the statistics regarding the number of blocks. In a Job,
right-click the tGenKey component and select View Key
Profile
in the contextual menu to visualize the distribution of the
number of blocks according to their size.

tMatchGroup_10.png

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks
(13 × 382). If records are compared with four columns, this means there will be 75,088
string comparisons in these 13 blocks (18,772 × 4).

The Simple VSR Matcher algorithm

The Simple VSR Matcher algorithm compares each record within same block with the
previous master records in the lookup table.

If a record does not match any of the previous master records, it is considered as a new
master record and added to the lookup table. This means that the first record of the
dataset is necessarily a master record. So, the order of the records is important and
can have an impact on the creation process of the master records.

When a record matches a master record, the Simple VSR Matcher algorithm does not further
attempt to match with other master records because all the master records in the lookup
table are not similar. So, once a record matches a master record, the chance of matching
another master record is low.

This means a record can only exist in one group of records and be linked to one master
record.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Louis Armstrong

The algorithm processes the input records as follows:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since
    record 1 does not match any record, it is added to the lookup table.
  2. The algorithm takes record 2 and compares it with record 1. Since it is not a match,
    record 2 is added to the lookup table.
  3. The algorithm takes record 3 and compares it with record 1 and record 2. Record 3
    matches record 1. So, record 3 is added to the group of record 1.
  4. The algorithm takes record 4 and compares it with record 1 and record 2 but not with
    record 3, which is not a master record. Since it is not a match, record 4 is added
    to the lookup table.

The output will look like this:

id fullName Grp_ID Grp_Size Master Score GRP_QUALITY
1 John Doe 0 2 true 1.0 0.72
3 John B. Doe 0 0 false 0.72 0
2 Donna Lewis 1 1 true 1.0 1.0
4 Louis Armstrong 2 1 true 1.0 1.0

The T-Swoosh algorithm

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm,
but it creates a master record instead of considering existing records to be master
records.

The order of the input records does not impact the matching process.

To create master records, you can design survivorship rules to decide which attribute
will survive.

There are two types of survivorship rules:

  • The rules related to matching keys: each attribute used as a matching key can have a
    specific survivorship rule.
  • The default rules: they are applied to all the attributes of the same data type
    (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column
is applied.

If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.

tMatchGroup_11.png

Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Johnnie B. Doe

The survivorship rule uses the Concatenate function with
,” as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since
    record 1 does not match any record, it is added to the set of master records. The
    queue contains now record 2, record 3 and record 4. The lookup contains record
    1.
  2. The algorithm takes record 2 and compares it with record 1. Since record 2 does not
    match any record, it is added to the set of master records. The queue contains now
    record 3 and record 4. The lookup contains record 1 and record 2.
  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches record
    1. So, record 1 and record 3 are merged to create a new master record called record
    1,3. The queue contains now record 4 and record 1,3. The lookup contains record
    2.
  4. The algorithm takes record 4 and compares it with record 2. Since it is not a match,
    record 4 is added to the set of master records. The queue contains now record 1,3.
    The lookup table contains record 2 and record 4.
  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
    1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master
    record called record 1,3,4. Record 4 is removed from the lookup table. Since record
    1,3 was the result of a previous merge, it is removed from the table. The queue now
    contains record 1,3,4. The lookup contains record 2.
  6. The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
    match, record 1,3,4 is added to the set of master records. The queue is now empty.
    The lookup contains records 1,3,4 and record 2.

The output will look like this:

id fullName GRP_ID GRP_SIZE MASTER SCORE GRP_QUALITY
1,3,4 John Doe, John B. Doe, Johnnie B. Doe 0 3 true 1.0 0.449
1 John Doe 0 0 false 0.72 0
3 John B. Doe 0 0 false 0.72 0
4 Johnnie B. Doe 0 0 false 0.78 0
2 Donna Lewis 1 1 true 1.0 1.0

As you can see in this example, the value in the GRP_QUALITY column can be less than the
Match Threshold parameter. That is
because a group is created from record pairs with a matching score greater than
or equal to the Match Threshold but the
records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
group.

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

When processing the input data using the Simple VSR Matcher algorithm, there
may be more iterations than the number of input records because a merged record may be
created on each iteration and added to the queue.

This is one of the main differences
between the Simple VSR Matcher and the T-Swoosh algorithms.

When comparing a record with a master record, the T-Swoosh algorithm makes more
comparisons per iteration than the Simple VSR matcher algorithm:

  • When using the Simple VSR matcher algorithm, the record from the queue is only
    compared with the value of the master record. There is no comparison between the
    record from the queue and the value of each of the records used to build this
    master record. Then, sort the input records so that the most trustworthy records
    appear first in the input data.
  • When using the T-Swoosh algorithm, the record from the queue is compared with
    the value of the master record and the value of each of the records used to
    build this master record, until records are considered a match.

    You can find an example of how to survive
    master records using the T-Swoosh algorithm in Talend Help Center (https://help.talend.com).

    In this example, the record “John Doe, John B. Doe” is compared
    with the record “John B. Doe” on iteration 5. There is a match if at least
    one of the three strings “John Doe, John B. Doe”, “John Doe” and “John B.
    Doe” matches the string “Johnnie B. Doe”.

Multi-pass matching

You can design a Job with consecutive tMatchGroup components to
create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code
in a customer database. In this particular case, you can use two consecutive
tMatchGroup to process the data partitions:

  • One tMatchGroup component in which the column “city” is defined
    as a blocking key.
  • One tMatchGroup component in which the column “ZipCode” is
    defined as a blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the
previous pass as the input of the current tMatchGroup component.
Multi-pass matching is more effective if the blocking keys are almost not
correlated. For example, it is not relevant to define the column “country” as a
blocking key and the column “city” as another blocking key because all the
comparisons made with the blocking key “city” will also be done with blocking key
“country”.

When using multi-pass matching with the Simple VSR matcher algorithm, only master
records of size 1 – records that did not match any record – are compared with master
records of any size. There are no comparisons between two master records that are
derived from at least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the
first tMatchGroup component has a blocking key on the column
“ZipCode”, and the second tMatchGroup component has a blocking
key on the column “city”. The attribute “name” is used as a matching key.

id name city ZipCode
1 John Doe Nantes 44000
2 John B. Doe Nantes  
3 Jon Doe Nantes 44000
4 John Doe Nantes  

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped.
In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the first
pass, record 1 and record 2, are compared. Since their group size is strictly
greater than 1, they are not compared. Then, the order in which the input records
are sorted is very important.

The following results are returned:

id name city ZipCode GID GRP_SIZE MASTER SCORE GRP_QUALITY
1 John Doe Nantes 44000 0 2 true 1.0 0.875
3 Jon Doe Nantes 44000 0 0 false 0.85 0
2 John B. Doe Nantes   1 2 true 1.0 0.727
4 John Doe Nantes   1 0 false 0.72 0

When running the T-Swoosh algorithm with the same parameters and the Most
common
survivorship function, the following results are
returned:

id name city ZipCode GID GRP_SIZE MASTER SCORE GRP_QUALITY
1 John Doe Nantes 44000 0 4 true 1.0 0.727
1 John Doe Nantes 44000 0 0 true 0.875 0
3 Jon Doe Nantes 44000 0 0 false 0.875 0
2 John B. Doe Nantes   0 0 true 0.72 0
4 John Doe Nantes   1 0 false 0.72 0

Grouping output data in separate flows according to the minimal distance
computed in each record

To create this Job, you can use different
algorithms
in the tMatchgroup component:

For another scenario that groups the output records in one single output flow, see
Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Simple VSR algorithm

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that compares columns in the
input file using the Jaro-Winkler matching method on the lname and fname column
and the q-grams matching method
on the address1 column. It then groups the
output records in output flows:

  • Uniques: lists the
    records which group size (minimal distance computed in the record) is
    equal to 1.

  • Matches: lists the
    records which group score (minimal distance computed in the record) is
    greater than or equal to the threshold you define in the Confident match threshold
    field.

  • Suspects: lists
    the records which group score (minimal distance computed in the record)
    is less than the threshold you define in the Confident match threshold field.

tMatchGroup_2.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRow components.
  2. Connect tFileInputExcel to tMatchGroup using the Main
    link.
  3. Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
  4. Connect tMatchGroup to
    the three tLogRow components using the
    Uniques, Matches and Suspects
    links.

Configuring the input component

The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.

tMatchGroup_13.png

You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your
Talend Studio
. For
further information, see the
Talend Studio User
Guide
.

  1. In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
    and set other properties in case they are not stored in the Repository.

    tMatchGroup_14.png

  2. Create the schema through the Edit Schema
    button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

    tMatchGroup_15.png

Configuring the tMatchGroup component

  1. Double-click tMatchGroup to display the
    Basic settings view and define the
    component properties.
  2. From the Matching Algorithm list, select
    Simple VSR.

    In this scenario, the match rule is based on the VSR algorithm.
  3. Click Sync columns to retrieve the schema
    from the preceding component.
  4. Click the Edit schema button to view the
    input and output schema and do any modifications in the output schema, if
    necessary.

    tMatchGroup_16.png

    In the output schema of this component there are few output standard columns
    that are read-only. For more information, see tMatchGroup Standard properties.
  5. Click OK to close the dialog box.
  6. Click the […] button next to
    Configure match rules to open the configuration
    wizard and define the component configuration and the match rule(s).

    tMatchGroup_3.png

    You can use the configuration wizard to import match rules
    created and tested in the studio and stored in the repository, and use them in
    your match Jobs. For further information, see Importing match rules from the studio repository.
    It is important to import or define the same type of the rule
    selected in the basic settings of the component, otherwise the Job runs with
    default values for the parameters which are not compatible between the two
    algorithms.
  7. Define the first match rule as the following:

    • In the Key definition table, click
      the [+] button to add to the list the
      column(s) on which you want to do the matching operation,
      lname and fname.

      Note: When you select a date column on which to apply an algorithm or a matching
      algorithm, you can decide what to compare in the date format.

      For example,
      if you want to only compare the year in the date, in the component schema set the type
      of the date column to Date and then enter “yyyy” in the Date Pattern field.
      The component then converts the date format to a string according to the pattern defined
      in the schema before starting a string comparison.

    • Click in the Matching Function column
      and select from the list Jaro-Winkler
      as the method to be used for the matching operation.

      If you select custom as a matching
      type, you must set in the Custom Matcher
      column the path pointing to the custom class (external
      matching algorithm) you need to use. This path is defined by yourself in
      the library file (.jar file).

    • From the Tokenized measure list,
      select not to use a tokenized distance for the selected
      algorithms.

    • Click in the cell of the Weight
      column to set the numerical weights for the two columns used as key
      attributes.

    • Click in the cell of the Handle Null
      column and select the null operator you want to use to handle null
      attributes in the columns. In this example, select Null Match None in order to have matching results where
      null values have minimal effect.

    • Set the match probability in the Match
      Interval
      field.

  8. Follow the same procedure in the above step to define the second match rule.

    Set the address1 column as an input attribute and select
    Jaro as the matching type. Select Null Match None as the null operator. And finally set
    the match probability which can be different from the one set for the first
    rule.
  9. Set the Hide groups of less than parameter in
    order to decide what groups to show in the result chart and matching table. This
    parameter enables you to hide groups of small group size.
  10. Click the Advanced settings tab and set the
    advanced parameters for the tMatchGroup
    component as the following:

    tMatchGroup_18.png

    • Select the Separate output check box.

      The component will have three separate output flows: Unique rows, Confident
      groups
      and Uncertain
      groups
      .

      If this check box is not selected, the tMatchGroup component will have only one output flow
      where it groups all output data. For an example scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.

    • Select the Sort the output data by
      GID
      check box to sort the output data by their group
      identifier.

    • Select the Output distance details
      and Display detailed labels check
      boxes.

      The component will output the MATCHING_DISTANCES
      column. This column provides the distance between the input and the
      master columns giving also the names of the columns against which the
      records are matched.

    • Select the Deactivate matching computation when
      opening the wizard
      check box if you do not want to run
      the match rules the next time you open the wizard.

  11. Click the Chart button in the wizard to
    execute the Job in the defined configuration and have the matching results
    directly in the wizard.

    tMatchGroup_19.png

    The matching chart gives a global picture about the duplicates in the analyzed
    data. The matching table indicates the details of items in each group and colors
    the groups in accordance with their color in the matching chart.
    The Job conducts an OR match operation on the
    records. It evaluates the records against the first rule and the records that
    match are not evaluated against the second rule. The MATCHING_DISTANCES column allows you to
    understand which rule has been used on what records. Some records are matched
    according to the second rule that uses address1 as a key attribute, whereas the other records in the group are
    matched according to the first rule which uses the lname and fname
    as key attributes.
    You can set the Hide groups of less than
    parameter in order to decide what groups to show in the matching chart and
    table

Finalizing the Job and executing it

  1. Double-click each of the tLogRow
    components to display the Basic
    settings
    view and define the component properties.
  2. Save your Job and press F6 to execute
    it.

You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.

The number of records in each of the three output blocks is listed
in the GRP_SIZE column and computed only on the master record.
The MASTER column indicates with true or false whether the
corresponding record is a master record or not. The SCORE
column lists the calculated distance between the input record and the master record
according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records with a group score between the match
interval, 0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.

tMatchGroup_20.png

All records with a group score greater than or
equal to one of the match probabilities are listed in the Matches output flow.

tMatchGroup_21.png

All records with a group size equal to 1 is listed in the
Uniques output flow.

tMatchGroup_22.png

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.

T-Swoosh algorithm

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that compares columns in the
input file using the Jaro-Winkler matching method on the lname and fname columns.
It then groups the output records in output flows:

  • Uniques: lists the
    records which group size (minimal distance computed in the record) is
    equal to 1.

  • Matches: lists the
    records which group quality is greater than or equal to the threshold
    you define in the Confident
    match threshold
    field.

  • Suspects: lists
    the records which group quality is less than the threshold you define in
    the Confident match
    threshold
    field.

The group quality is the minimal distance computed among all
record pairs of the group.

tMatchGroup_23.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup and three tLogRow components.
  2. Connect tFixedFlowInput
    to tMatchGroup using the Main link.
  3. Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
  4. Connect tMatchGroup to
    the three tLogRow components using the
    Uniques, Matches and Suspects
    links.

Configuring the input component

The input data contain seven columns: lname, fname, middle_name, address, city, state_province
and postal_code. The data have problems such as duplication,
names spelled differently or wrongly, different information for the same customer.

  1. Double-click the tFixedFlowInput
    component to display its Basic settings view.
  2. Select Built-in and click the […]
    button next to Edit Schema.
  3. Define the seven columns and click OK.
  4. Select Use Inline Content(delimited file).
  5. Fill in the Row Separator and Field
    Separator
    fields.
  6. Enter the input data in the Content field.

Configuring the tMatchGroup component

  1. Click the tMatchGroup
    component to display its Basic settings
    view.
  2. From the Matching
    Algorithm
    list, select t-Swoosh.
  3. Click Sync columns to retrieve the schema
    from the preceding component.
  4. Click the Edit schema button to view the
    input and output schema and do any modifications in the output schema, if
    necessary.

    In the output schema of this component there are output standard columns
    that are read-only. For more information, see the tMatchGroup
    properties
    .
  5. Click OK to close the dialog box.
  6. Click the Advanced
    settings
    tab and select the following check boxes:

    1. Separate output

      The component will have three separate output flows:
      Uniques,
      Matches
      and Suspects.

      If this check box is not selected, the tMatchGroup component will have only
      one output flow where it groups all output data. For an example
      scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.

    2. Sort the output data by
      GID
    3. Output distance details
    4. Display detailed labels
    5. Deactivate matching
      computation when opening the wizard
  7. Click the […] button next to Configure match
    rules
    to define the component configuration and the match
    rule(s).

    You can use the configuration wizard to import
    match rules created and tested in the studio and stored in the repository, and
    use them in your match Jobs. For further information, see Importing match rules from the studio repository.
    It is important to import or define the same
    type of the rule selected in the basic settings of the component. Otherwise the
    Job runs with default values for the parameters which are not compatible between
    the two algorithms.
  8. Define the match rule as the following:

    1. In the Key definition table,
      click the [+] button to add to
      the list the column(s) on which you want to do the matching operation,
      fname and lname.

      Note: When you select a date column on which to apply an algorithm or a matching
      algorithm, you can decide what to compare in the date format.

      For example,
      if you want to only compare the year in the date, in the component schema set the type
      of the date column to Date and then enter “yyyy” in the Date Pattern field.
      The component then converts the date format to a string according to the pattern defined
      in the schema before starting a string comparison.

    2. Click in the Matching Function
      column and select from the list Jaro-Winkler as the method to be used for the matching
      operation.

      If you select custom
      as a matching type, you must set in the Custom
      Matcher
      column the path pointing to the custom class
      (external matching algorithm) you need to use. This path is defined by
      yourself in the library file (.jar file).

    3. From the Tokenized measure list,
      select No.
    4. Click in the cell of the Threshold column and
      enter 0.7 for fname and 0.4 for lname.
    5. Click in the cell of the Confidence
      Weight
      column to set the numerical weights for the two
      columns used as key attributes: 1 for fname and 4 for lname.
    6. Click in the cell of the Handle Null column and select the null
      operator you want to use to handle null attributes in the columns. In
      this example, select Null
      Match NONE
      in order to have matching results where null
      values have minimal effect.
    7. Select Most common in the
      Survivorship Function.
  9. Follow the same procedure in the above step to define the second match rule and
    set the parameters as follows:

    1. Click the [+] button (Duplicate
      Rule
      ).
    2. Input Key Attribute: address
    3. Matching Function:
      Jaro
    4. Tokenized Measure:
      No
    5. Threshold: 0.8
    6. Confidence Weight: 1
    7. Handle Null: Null Match
      NONE
    8. Survivorship Function:
      Most common
  10. Set the Match Threshold
    parameter of each Match Rule to 0.8.
  11. Set the Hide groups of less than parameter to 2. This
    parameter enables you to hide groups of small size.
  12. Click the Chart button
    to execute the Job in the defined configuration and have the matching results
    directly in the wizard.

    tMatchGroup_24.png

    The matching chart gives a global picture about the duplicates
    in the analyzed data. The matching table indicates the details of items in each
    group and colors the groups in accordance with their color in the matching
    chart.
    The Job conducts an OR match operation on the records. It
    evaluates the records against the rule. The MATCHING_DISTANCES column allows you to
    understand which rule has been used on what records. 

    For example, in the
    second data group (brick red), the last Amic record is matched according to the second rule that uses address1 as a key attribute,
    whereas the other records in the group are matched according to the first
    rule which uses the lname and fname as key
    attributes.

    As you can see in this example, the value in the GRP_QUALITY column can be less than the
    Match Threshold parameter. That is
    because a group is created from record pairs with a matching score greater than
    or equal to the Match Threshold but the
    records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
    group.

Finalizing the Job and executing it

  1. Double-click each tLogRow
    component to display the Basic
    settings
    view.
  2. Select Table (print values in cells of a table).
  3. Save your Job and press F6 to execute
    it.

You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier is of the data type Long for Jobs that are migrated from older releases. To have the group
identifier as String, replace the tMatchGroup component in the imported Job with
tMatchGroup from the studio
Palette.

The number of records in each of the three output blocks
is listed in the GRP_SIZE column and
computed only on the master record. The MASTER column indicates with true or false whether the corresponding
record is a master record or not. The SCORE column lists the calculated distance between the input record and
the master record according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records with a group score between the match interval,
0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.

tMatchGroup_25.png

All records with a group score greater than or equal to
one of the match probabilities are listed in the Matches output flow.

tMatchGroup_26.png

All records with a group size equal to 1 is listed in the
Uniques output
flow.

tMatchGroup_27.png

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Matching customer data through multiple passes

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

The Job in this scenario, groups similar customer records by running
through two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches
to the pass that follows in order for the latter to add more matches identified with new
rules and blocking keys.

tMatchGroup_28.png

In this Job:

  • The tMysqlInput component
    connects to the customer records to be processed.

  • Each of the tGenKey
    components defines a way to partition data records. The first key partitions
    data to many groups and the second key creates fewer groups that overlaps the
    previous blocks depending on the blocking key definition.

  • The tMap component renames
    the key generated by the second tGenKey
    component.

  • The first tMatchGroup
    processes the partitions defined by the first tGenKey, and the second tMatchGroup processes those defined by the second tGenKey.

    Warning:

    The two tMatchGroup
    components must have the same schema.

  • The tLogRow component
    presents the matching results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the
Talend Studio User
Guide
.

  1. In the Repository tree view, expand
    MetadataDB
    Connections
    where you have stored the main input schema and
    drop the database table onto the design workspace. The input table used in
    this scenario is called customer.

    A dialog box is displayed with a list of components.
  2. Select the relevant database component, tMysqlInput in this example, and then click OK.
  3. Drop two tGenKey components, two
    tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.
  4. Link the input component to the tGenKey
    and tMap components using Main links.
  5. In the two tMatchGroup components, select the
    Output distance details check boxes in the
    Advanced settings view of both components
    before linking them together.

    This will provide the MATCHING_DISTANCES column in the
    output schema of each tMatchGroup.
    If the two tMatchGroup components are already
    linked to each other, you must select the Output distance
    details
    check box in the second component in the Job flow first
    otherwise you may have an issue.
  6. Link the two tMatchGroup components and
    the tLogRow component using Main links.
  7. If needed, give the components specific labels to reflect their usage in
    the Job.

    For further information about how to label a component, see
    Talend Studio

    User Guide.

Configuring input data and key generation

Connecting to the input data

  1. Double-click tMysqlInput to open its
    Component view.

    tMatchGroup_29.png

    The property fields for tMysqlInput are
    automatically filled in. If you do not define your input schema locally in
    the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
    lists.
    The input table used in this scenario is called
    customer.
  2. Modify the query in the Query box to
    select only the columns you want to match:
    account_name, lname,
    fname, mi,
    address1, city,
    state_province and
    postal_code.

Configuring the key generation for the first pass

  1. Double-click the first tGenKey to open
    the Component view.

    tMatchGroup_30.png

  2. Click

    tMatchGroup_1.png

    and import blocking keys from match rules created and
    tested in the
    Profiling
    perspective
    of
    Talend Studio
    and use them in your Job. Otherwise,
    define the blocking key parameters as described in the below steps.

  3. Under the Algorithm table, click the
    [+] button to add two rows in the
    table.
  4. In the column column, click the newly added row and
    select from the list the column you want to process using an algorithm. In this
    example, select lname.
  5. Do the same on the second row to select
    postal_code.
  6. In the pre-algorithm column, click the newly
    added row and select from the list the pre-algorithm you want to apply to the
    corresponding column.

    In this example, select remove diacritical marks and
    convert to upper case
    to remove any diacritical mark and converts
    the fields of the lname column to upper
    case.
    This conversion does not change your raw data.
  7. In the algorithm column, click the newly
    added row and select from the list the algorithm you want to apply to the
    corresponding column. In this example, select N first
    characters of each word
    .

    If you select the Show help check box, you
    can display instructions on how to set algorithms/options parameters.
  8. Do the same for the second row on the algorithm column to select first N
    characters of the string
    .
  9. Click in the Value column next to the
    algorithm column and enter the value for
    the selected algorithm, when needed.

    In this scenario, enter 1 for both rows. The first letter
    of each field in the corresponding columns will be used to generate the
    key.
    Make sure to set a value for the algorithm which need one, otherwise you may
    have a compilation error when you run the Job.

Configuring the key generation for the second pass

  1. Double-click the second tGenKey to open
    the Component view.

    tMatchGroup_32.png

  2. In the Algorithm table, define the column
    you want to use to partition data, account_num in this
    component. Select the first N characters of the
    string
    algorithm and set the value to 1
    in the Value column.

    Each of the two tGenKey components will
    generate a read_only T_GEN_KEY column in
    the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
    other.
  3. Double-click the tMap component to open
    its editor.

    tMatchGroup_33.png

  4. In the Schema editor, copy the columns
    from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.
  5. In the top part of the editor, drop all columns from the input table to
    the output table.
  6. Click Ok to save data transformation and
    close the editor.
  7. In the tGenKey basic settings, click the

    tMatchGroup_34.png

    button to verify that the two generated keys are named
    differently in the output schema.

    tMatchGroup_35.png

Configuring the two matching passes

Configuring the first pass

  1. In the basic settings of the tMatchGroup
    labelled pass1, select Simple VSR from the Matching
    Algorithm
    list.

    In this scenario, the match rule is based on the VSR algorithm.
  2. Click the Preview button to display the
    Configuration Wizard.

    tMatchGroup_36.png

  3. Click

    tMatchGroup_1.png

    and import matching keys from the match rules created and
    tested in the
    Profiling
    perspective of

    Talend Studio
    and use them in your Job. Otherwise, define
    the matching key parameters as described in the below steps.

    It is important to import or define the same type of the rule selected in the
    basic settings of the component, otherwise the Job runs with default values for
    the parameters which are not compatible between the two algorithms.
  4. In the Key definition table, click the
    [+] button to add the column(s) on which
    you want to do the matching operation, lname
    in this scenario.

    Note: When you select a date column on which to apply an algorithm or a matching
    algorithm, you can decide what to compare in the date format.

    For example,
    if you want to only compare the year in the date, in the component schema set the type
    of the date column to Date and then enter “yyyy” in the Date Pattern field.
    The component then converts the date format to a string according to the pattern defined
    in the schema before starting a string comparison.

  5. Select the Jaro-Winkler algorithm in the Matching Function column.
  6. From the Tokenized measure list, select
    Any order.
  7. Set Weight to 1 and in
    the Handle Null column, select the null
    operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
  8. Click the [+] button below the Blocking Selection table to add one row in the table
    then click in the line and select from the list the column you want to use as a
    blocking value, T_GEN_KEY in this example.

    Using a blocking value reduces the number of pairs of records that needs to be
    examined. The input data is partitioned into exhaustive blocks based on the
    functional key. This will decrease the number of pairs to compare, as comparison
    is restricted to record pairs within each block.
  9. If required, click Edit schema to open the
    schema editor and see the schema retrieved from the previous component in the
    Job.

    tMatchGroup_38.png

  10. Click the Advanced settings tab and select
    the Sort the output data by GID check box to
    arrange the output data by their group IDs.
  11. Select the Deactivate matching computation when opening
    the wizard
    check box if you do not want to run the match rules
    the next time you open the wizard.

Configuring the second pass

  1. In the basic settings of the tMatchGroup
    labelled pass2, select Simple VSR from the Matching
    Algorithm
    list.

    In this scenario, the match rule is based on the VSR algorithm.
  2. Click the Preview button to display the
    Configuration Wizard.

    If this component does not have the same schema of the preceding component, a
    warning icon appears. If so, click the Sync
    columns
    button to retrieve the schema from the preceding one and
    once done, the warning icon disappears.
  3. In the Key Definition table, click
    the [+] button to add the column(s) on
    which you want to do the matching operation, lname in this scenario.

    Note: When you select a date column on which to apply an algorithm or a matching
    algorithm, you can decide what to compare in the date format.

    For example,
    if you want to only compare the year in the date, in the component schema set the type
    of the date column to Date and then enter “yyyy” in the Date Pattern field.
    The component then converts the date format to a string according to the pattern defined
    in the schema before starting a string comparison.

  4. Select the Jaro-Winkler algorithm in the Matching Function column.
  5. Set Weight to 1 and in
    the Handle Null column, select the null
    operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
  6. Click the [+] button below the Blocking Selection table to add one row in the table
    then click in the line and select from the list the column you want to use as a
    blocking value, T_GEN_KEY1 in this example.
  7. Click the Advanced settings tab and select
    the Multi-pass check box. This option enables
    tMatchGroup to receive data sets from the
    tMatchGroup that precedes it in the Job.
  8. In the Advanced settings view, select the
    Sort the output data by GID check box to
    arrange the output data by their group IDs.
  9. Select the Deactivate matching computation when opening
    the wizard
    check box if you do not want to run the match rules
    the next time you open the wizard.

Executing the Job and showing the results on the console

In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:

  • tMatchGroup_39.png
  • Use a tReplicate component to
    replicate the input flow you want to process as shown in the above
    figure. Use a copy/paste operation to create the two parts of the
    Job.

  • Keep only the first pass,tMatchGroup component, in the upper part of the Job
    and show the match results in a tLogRow component.

  • Use two passes in the lower part of the Job and show the final
    match results in a tLogRow
    component.

  1. Double-click each of the tLogRow
    components to open the Component view and
    in the Mode area, select the Table (print values in cells of a table)
    option.
  2. Save your Job and press F6 to execute
    it.

    The results after the first pass read as follows:
    tMatchGroup_40.png

    The results after the second pass read as follows:
    tMatchGroup_41.png

    When you compare, for example, the customer name Wilson from the results of the two passes, you will find
    that more customers using the last name Wilson are grouped together after the second pass.

Using survivorship functions to merge two records and create a master
record

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that uses the T-Swoosh algorithm to find duplicates in
customer names. It compares values in the first_name column using
matching and survivorship methods, groups similar records together and creates single
representations of entities in each group by merging two similar records according to a
survivorship rule.

These master records are new records that do not exist in the input data.

tMatchGroup_42.png

You can use several matching passes with the T-Swoosh algorithm in a Job. Each pass
provides the survived masters for each group and the original records to the second pass.
However, intermediate master records are removed from the output flow. Only final master and
original records are kept at the end.

In this Job:

  • The tFileInputDelimited component provides the
    customer records to be processed.

  • The tMatchGroup component processes data using
    matching and survivorship methods.

  • The tLogRow component presents the matching and
    survivorship results.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the
repository, see the
Talend Studio User Guide
.

  1. In the Repository tree view, expand MetadataFile
    delimited
    where you have stored the main input schema and drop
    the file connection onto the design workspace. The file connection used in this
    scenario is called Customer_File.

    A dialog box is displayed with a list of components.
  2. Select tFileInputDelimited and click
    OK.
  3. Drop a tMatchGroup and a tLogRow from the Palette onto the design workspace.
  4. Link the components together using the Main
    links.

Configuring the input component

The main input file contains the columns: id,
first_name and age. The data in this input
file has problems such as duplication, names spelled differently or wrongly, different
information for the same customer.

  1. Double-click tFileInputDelimited to open its
    Component view.

    tMatchGroup_43.png

    The property fields are automatically filled in. If you do not define your
    input schema locally in the repository, fill in the details manually after
    selecting Built-in in the Schema and Property Type
    lists.
  2. Create the schema through the Edit Schema
    button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

    tMatchGroup_44.png

Creating the match rule with a matching key and a survivorship function

Setting the schema and selecting the matching algorithm

  1. Double-click tMatchGroup to display the
    Basic settings view and define the
    component properties.
  2. From the Matching Algorithm list, select
    T-Swoosh.

    In this scenario, the match rule is based on the T-Swoosh algorithm.
  3. Click Sync columns to retrieve the schema
    from the preceding component.
  4. Click the Edit schema button to view the
    input and output schema and do any modifications in the output schema, if
    necessary.

    tMatchGroup_45.png

    In the output schema of this component there are few output standard columns
    that are read-only. For more information, see tMatchGroup Standard properties.
  5. In the Advanced settings view, leave the
    parameters selected by default.

Defining the match rule

  1. In the tMatchGroup basic settings, click
    Preview to open the configuration wizard
    and define the matching key and the survivorship function.

    tMatchGroup_46.png

    You can use the configuration wizard to import match rules created and tested
    in the studio and stored in the repository, and use them in your match Jobs. For
    further information, see Importing match rules from the studio repository.
    It is important to have the same type of the matching algorithm selected in
    the basic settings of the component and defined in the configuration wizard.
    Otherwise the Job runs with default values for the parameters which are not
    compatible between the two algorithms.
  2. Define the match rule as the following:

    • In the Key definition table, click
      the [+] button to add a line in the
      table. Click in the Input Key Attribute
      column and select the column on which you want to do the matching
      operation, first_name in this scenario.

    • Click in the Matching Function column
      and select Soundex from the list. This
      method matches processed entries according to a standard English
      phonetic algorithm which indexes strings by sound, as pronounced in
      English.

    • From the Tokenized measure list,
      select not to use a tokenized distance for the selected
      algorithm.

    • Set the Threshold to 0.8 and the Confidence Weight to 1.

    • Select Null Match None in the
      Handle Null column in order to have
      matching results where null values have minimal effect.

    • Select Most common
      in the Survivorship Function
      column. This method validates the most frequent name value in each group
      of duplicates.

  3. Define the default survivorship rule as the following:

    • In the Default Survivorship
      Rules
      table, click the [+] button to add a line in the table. Click in the
      Data Type column and select
      Number.

    • Click in the Survivorship
      Function
      column and select Largest (for numbers) from the list. This method
      validates the largest numerical value in each group.

  4. Set the Hide groups of less than parameter in
    order to decide what groups to show in the result chart and matching table. This
    parameter enables you to hide groups of small group size.
  5. Click the Chart button in the wizard to
    execute the Job in the defined configuration and have the results directly in
    the wizard.

    tMatchGroup_47.png

    The matching chart gives a global picture about the duplicates in the analyzed
    data. The matching table indicates the details of the items in each group,
    colors the groups in accordance with their color in the matching chart and
    indicates with true the records which are
    master records. The master record in each group is the result of merging two
    similar records according to the phonetic algorithm and survivorship rule. The
    master record is a new record that does not exist in the input data.
  6. Click OK to close the wizard.

Finalizing the Job and executing it

  1. Double-click the tLogRow component to display
    its Basic settings view and define the
    component properties.
  2. Save your Job and press F6 to execute
    it.

    tMatchGroup_48.png

    Similar records are grouped together. The number of records in each group is
    listed in the GRP_SIZE column and computed only on the
    master record. The MASTER column indicates with true or false if
    the corresponding record is a master record or not a master record. The master
    record is created by merging the record which best matches the phonetic
    algorithms with the record which has the largest age value. The
    SCORE column lists the calculated similarity between
    the input record and the master record according to the Soundex matching algorithm.
    The identifier for each group, which is of String data
    type, is listed in the GID column next to the corresponding
    record. This identifier will be of the data type Long for
    Jobs that are migrated from older releases. To have the group identifier as
    String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the studio Palette.

tMatchGroup MapReduce properties (deprecated)

These properties are used to configure tMatchGroup running in the MapReduce Job framework.

The MapReduce
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema and Edit
schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

The output schema of this component contains the following read-only
fields:

GID: provides a group identifier
of the data type String.

Note: In Jobs migrated from previous releases to your current Talend Studio, the group identifier may be of the Long data
type. To have a group identifier of the String data type, replace
the tMatchGroup components in
the migrated Jobs with tMatchGroup components from the Palette.

GRP_SIZE: counts the number of records in the
group, computed only on the master record.

MASTER: identifies, by true or
false, if the record used in the matching comparisons
is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they
match, the input record will be in the group.

SCORE: measures the distance between the input
record and the master record according to the matching algorithm
used.

In case the tMatchGroup component is
used to have multiple output flows, the score in this column decides to
what output group the record should go.

GRP_QUALITY: provides the quality of similarities
in the group by taking the minimal matching value. Only the master
record has a quality score.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Matching Algorithm

Select from the list the algorithm you want to use in the component:
Simple VSR is the only matching
algorithm you can use with the Map/Reduce version of the
component.

If you converted a standard Job using tMatchGroup with the T-Swoosh algorithm to a Map/Reduce Job, select Simple VSR from the list and save the
converted Job before its execution. Otherwise, an error occurs.

Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to apply a
matching algorithm.

Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.

For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison.

 

Matching Function

Select a matching algorithm from the list:

Exact: matches each processed entry
to all possible reference entries with exactly the same value. It
returns 1 when the two strings exactly match,
otherwise it returns 0.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the same
value while ignoring the value case.

Soundex: matches
processed entries according to a standard English phonetic algorithm. It
indexes strings by sound, as pronounced in English, for example “Hello”:
“H400”. It does not support Chinese characters.

Levenshtein (edit distance):
calculates the minimum number of edits (insertion, deletion or
substitution) required to transform one string into another. Using this
algorithm in the tMatchGroup
component, you do not need to specify a maximum distance. The component
automatically calculates a matching percentage based on the distance.
This matching score will be used for the global matching calculation,
based on the weight you assign in the Confidence
Weight
field.

Metaphone: Based
on a phonetic algorithm for indexing entries by their pronunciation. It
first loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the reference
flow. It does not support Chinese characters.

Double Metaphone:
a new version of the Metaphone phonetic algorithm, that produces more
accurate results than the original algorithm. It can return both a
primary and a secondary code for a string. This accounts for some
ambiguous cases as well as for multiple variants of surnames with common
ancestry. It does not support Chinese characters.

Soundex FR:
matches processed entries according to a standard French phonetic
algorithm. It does not support Chinese characters.

Jaro: matches processed entries
according to spelling deviations. It counts the number of matched
characters between two strings. The higher the distance is, the more
similar the strings are.

Jaro-Winkler: a variant of Jaro, but
it gives more importance to the beginning of the string.

Fingerprint key:
matches entries after doing the following sequential process:

  1. Remove leading and trailing
    whitespace.
  2. Change all characters to their
    lowercase representation. It does not support Chinese characters.
  3. Remove all punctuation and
    control characters.
  4. Split the string into
    whitespace-separated tokens.
  5. Sort the tokens and remove
    duplicates.
  6. Join the tokens back together.
    Because the string parts are sorted, the given order of tokens
    does not matter. So, Cruise,
    Tom
    and Tom
    Cruise
    both end up with a fingerprint cruise tom and therefore end up in the
    same cluster.
  7. Normalize extended western
    characters to their ASCII representation, for example gödel to godel. This reproduce
    data entry mistakes performed when entering extended characters
    with an ASCII-only keyboard. However, this procedure can also
    lead to false positives, for example gödel and godél would both end up with godel as their
    fingerprint but they are likely to be different names. So this
    might work less effectively for datasets where extended
    characters play substantial differentiation role.

q-grams: matches processed entries by
dividing strings into letter blocks of length q in order to
create a number of q length grams. The matching result is
given as the number of q-gram matches over possible q-grams.

Hamming: calculates the minimum
number of substitutions required to transform one string into another
string having the same length. For example, the Hamming distance between
masking” and “pairing” is 3.

custom…: enables you to load an
external matching algorithm from a Java library using the custom Matcher column.

For further information about how to load an external Java library,
see tLibraryLoad.

For further information about how to create a custom matching
algorithm, see Creating a custom matching algorithm.

For a related scenario about how to use a custom matching algorithm,
see Using a custom matching algorithm to match entries.

Custom Matcher

When you select Custom as the
matching type, enter the path pointing to the custom class (external
matching algorithm) you need to use. This path is defined by yourself in
the library file (.jar file).

For example, to use a MyDistance.class class
stored in the directory org/talend/mydistance in a
user-defined mydistance.jar library, the path to be
entered is org.talend.mydistance.MyDistance.

 

Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0.

Handle Null

To handle null values, select from the list the null operator you want
to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but the
first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may not
match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they match.

And for the records:

“Doe”, “”

“Doe”, “”

Null Match Null: they match.

Null Match None: they do not
match.

Null Match All: they match.

Match Threshold

Enter the match probability. Two data records match when the
probability is above the set value.

You can enter a different match threshold for each match rule.

Blocking Selection

Input Column

If required, select the column(s) from the input flow according to
which you want to partition the processed data in blocks, this is
usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive blocks
designed to increase the proportion of matches observed while decreasing
the number of pairs to compare. Comparisons are restricted to record
pairs within each block.

Using blocking column(s) is very useful when you are processing very
big data.

Advanced settings

Store on disk

Select the Store on disk check box if
you want to store processed data blocks on the disk to maximize system
performance.

Max buffer size: Type in the size of
physical memory you want to allocate to processed data.

Temporary data directory path: Set
the location where the temporary file should be stored.

Multiple output

Select the Separate output check box
to have three different output flows:

Uniques: when the group score
(minimal distance computed in the record) is equal to
1, the record is listed in this flow.

Matches: when the group score
(minimal distance computed in the record) is higher than the threshold
you define in the Confidence threshold
field, the record is listed in this flow.

Suspects: when the group score
(minimal distance computed in the record) is below the threshold you
define in the Confidence threshold
field, the record is listed in this flow.

Confident match threshold: set a
numerical value between the current Match
threshold
and 1. Above this
threshold, you can be confident in the quality of the group.

Multi-pass

Select this check box to enable a
tMatchGroup
component to receive data sets from another tMatchGroup that precedes it in the
Job. This will refine the groups received by each of the tMatchGroup components
through creating data partitions based on different blocking keys.

Note: When using two tMatchGroup
components in a Job and this option, you must select this check box
in both tMatchGroup components before linking
them together. If you linked the components before selecting this
check box, select this check box in the second component in the Job
flow and then, in the first component. Otherwise, you may have an
issue as there are two columns in the output schema with the same
name. Selecting this check box in only one
tMatchGroup component may cause schema
mismatch issues.

For an example Job, see Matching customer data through multiple passes

Sort the output data by GID

Select this check box to group the output data by the group
identifier.

The output is sorted in ascending alphanumeric order by group
identifier.

Output distance details

Select this check box to add an output column
MATCHING_DISTANCES in the schema of the
component. This column provides the distance between the input and
master records in each group.

Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in
both tMatchGroup components before linking them
together. If you linked the components before selecting this check box,
select this check box in the second component in the Job flow and then,
in the first component. Otherwise, you may have an issue as there are
two columns in the output schema with the same name. Selecting this
check box in only one tMatchGroup component may cause
schema mismatch issues.

Display detailed labels

Select this check box to have in the output
MATCHING_DISTANCES column not only the matching
distance but also the names of the columns used as key attributes in the
applied rule.

For example, if you try to match on first name and last name fields,
lname and fname, the
output would be fname:1.0|lname:0.97 when the check box
is selected and 1.0|0.97 when it is not selected.

tStatCatcher Statistics

Select this check box to collect log data at the component level. Note
that this check box is not available in the Map/Reduce version of the
component.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

In a
Talend
Map/Reduce Job, this component is used as an intermediate
step and other components used along with it must be Map/Reduce components, too. They
generate native Map/Reduce code that can be executed directly in Hadoop.

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

For further information about a
Talend
Map/Reduce Job, see the sections
describing how to create, convert and configure a
Talend
Map/Reduce Job of the

Talend Open Studio for Big Data Getting Started Guide
.

For a scenario demonstrating a Map/Reduce Job using this component,
see Matching data through multiple passes using Map/Reduce components.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Working principle

This component implements the MapReduce model, based on the blocking keys defined in
the Blocking definition table of the Basic settings view.

tMatchGroup_49.png

This implementation proceeds as follows:

  1. Splits the input rows in groups of a given size.

  2. Implements a Map Class that creates a map between each key and a list of
    records.

  3. Shuffles the records to group those with the same key together.

  4. Applies, on each key, the algorithm defined in the Key
    definition
    table of the Basic
    settings
    view.

    Then accordingly, this component reads the records, compares them with the
    master records, groups the similar ones, and classes each of the rest as a
    master record.

  5. Outputs the groups of similar records with their group IDs, group sizes,
    matching distances and scores.

Configuration wizard in Map/Reduce Jobs

The match results on multiple conditions will list data
records that meet any of the defined rules.

You can not open
the configuration wizard unless you link an input component to the tMatchGroup component.

From the
configuration wizard in tMatchGroup, you can:

  • define multiple conditions using several match rules to group
    data,

  • set different match intervals for each rule,

  • import match rules created and tested in the studio and stored in
    the repository, and use them in your match Jobs. You can only import rules
    configured with the VSR algorithm. For further information, see Importing match rules from the studio repository.

  • select a blocking key to partition data.

tMatchGroup_50.png

  1. Click the [+] button on the match rule
    bar.
  2. Set the parameters for the new rule in the Key
    definition
    table and define its match interval.
  3. Repeat the above steps to create as many match rules as needed. You can define
    a different match interval for each rule.

    When you define multiple rules, the Job conducts an OR match operation. It
    evaluates data records against the first rule and the records that match are not
    evaluated against the second rule.
  4. In the Blocking Selection table, select the
    column(s) from the input flow which you want to use as a blocking key.

    Defining a blocking key is not mandatory but is very useful when you are
    processing big data sets. A blocking key partitions data in blocks and so
    reduces the number of records that need to be examined. This key can come from a
    tGenKey component (and would be called
    T_GEN_KEY) or directly from the input schema.
  5. At the bottom right corner of the wizard, click either:

    • OK to save the current
      configuration.

    • Cancel to close the wizard and keep
      the configuration saved initially in the wizard.

Matching data through multiple passes using Map/Reduce
components

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

Note that
Talend
Map/Reduce components are available only to users who subscribed to Big
Data.

This scenario shows how to create a
Talend
Map/Reduce Job to match data by using Map/Reduce components. It generates
Map/Reduce code and runs right in Hadoop.

tMatchGroup_51.png

The Job in this scenario, groups similar customer records by running through
two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches to
the pass that follows in order for the latter to add more matches identified with new rules
and blocking keys.

This Job is a duplication of the Standard data integration Job described in Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.

You can use
Talend Studio
to automatically convert the standard Job in the previous section to a Map/Reduce Job.
This way, you do not need to redefine the settings of the components in the Job.

Before starting to replicate this scenario, ensure that you have appropriate
rights and permissions to access the Hadoop distribution to be used.

Converting the Job to a Big Data Batch Job

  1. In the Repository tree view, right-click the Job you have created in
    the earlier scenario to open its contextual menu and select Edit properties.

    Then the Edit properties dialog box is displayed. Note that the Job must
    be closed before you are able to make any changes in this dialog box.
    This dialog box looks like the image below:

    tMatchGroup_52.png

    Note that you can change the Job name as well as the other
    descriptive information about the Job from this dialog box.
  2. From the Job Type list, select
    Big Data Batch. Then a Map/Reduce Job
    using the same name appears under the Big Data
    Batch
    sub-node of the Job
    Design
    node.

Setting up the Job

  1. Double-click the new Map/Reduce Job to open it in the workspace.

    The Map/Reduce component Palette is opened.
    tMatchGroup_53.png

  2. Delete tMysqlInput in this scenario as it is
    not a Map/Reduce component and use tRowGenerator in its place. Link it to tGenKey with a Row > Main
    link.
  3. Double-click tRowGenerator to open its
    editor.

    tMatchGroup_54.png

  4. Define the schema you want to use to write data in Hadoop.
  5. Click OK to validate your schema and close
    the editor.
  6. Leave the settings of the other components as you defined initially in the
    standard version of the Job.

Setting up Hadoop connection

  1. Click Run to open its view and then click the
    Hadoop Configuration tab to display its
    view for configuring the Hadoop connection for this Job.
  2. From the Property type list,
    select Built-in. If you have created the
    connection to be used in Repository, then
    select Repository and thus the Studio will
    reuse that set of connection information for this Job.
  3. In the Version area, select the
    Hadoop distribution to be used and its version.

    • If you use Google Cloud Dataproc, see Google Cloud Dataproc.

    • If you cannot
      find the Cloudera version to be used from this drop-down list, you can add your distribution
      via some dynamic distribution settings in the Studio.

    • If you cannot find from the list the distribution corresponding to
      yours, select Custom so as to connect to a
      Hadoop distribution not officially supported in the Studio. For a
      step-by-step example about how to use this
      Custom option, see Connecting to a custom Hadoop distribution.

  4. In the Name node field, enter the location of
    the master node, the NameNode, of the distribution to be used. For example,
    hdfs://tal-qa113.talend.lan:8020.

    • If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
      client will take care of the rest on the fly for creating the connection. The
      MapR client must be properly installed. For further information about how to set
      up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client

    • If you are using WebHDFS, the location should be
      webhdfs://masternode:portnumber; WebHDFS with SSL is not
      supported yet.

  5. In the Resource Manager field,
    enter the location of the ResourceManager of your distribution. For example,
    tal-qa114.talend.lan:8050.

    • Then you can continue to set the following parameters depending on the
      configuration of the Hadoop cluster to be used (if you leave the check
      box of a parameter clear, then at runtime, the configuration about this
      parameter in the Hadoop cluster to be used will be ignored):

      • Select the Set resourcemanager
        scheduler address
        check box and enter the Scheduler address in
        the field that appears.

      • Select the Set jobhistory
        address
        check box and enter the location of the JobHistory
        server of the Hadoop cluster to be used. This allows the metrics information of
        the current Job to be stored in that JobHistory server.

      • Select the Set staging
        directory
        check box and enter this directory defined in your
        Hadoop cluster for temporary files created by running programs. Typically, this
        directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
        such as yarn-site.xml or mapred-site.xml of your distribution.

      • Select the Use datanode hostname check box to allow the
        Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname
        property to true. When connecting to a
        S3N filesystem, you must select this check box.


  6. If you are accessing the Hadoop cluster running
    with Kerberos security, select this check box, then, enter the Kerberos
    principal name for the NameNode in the field displayed. This enables you to use
    your user name to authenticate against the credentials stored in Kerberos.

    • If this cluster is a MapR cluster of the version 5.0.0 or later, you can set the
      MapR ticket authentication configuration in addition or as an alternative by following
      the explanation in Connecting to a security-enabled MapR.

      Keep in mind that this configuration generates a new MapR security ticket for the username
      defined in the Job in each execution. If you need to reuse an existing ticket issued for the
      same username, leave both the Force MapR ticket
      authentication
      check box and the Use Kerberos
      authentication
      check box clear, and then MapR should be able to automatically
      find that ticket on the fly.

    In addition, since this component performs Map/Reduce computations, you
    also need to authenticate the related services such as the Job history server and
    the Resource manager or Jobtracker depending on your distribution in the
    corresponding field. These principals can be found in the configuration files of
    your distribution. For example, in a CDH4 distribution, the Resource manager
    principal is set in the yarn-site.xml file and the Job history
    principal in the mapred-site.xml file.

    If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
    pairs of Kerberos principals and encrypted keys. You need to enter the principal to
    be used in the Principal field and the access
    path to the keytab file itself in the Keytab
    field. This keytab file must be stored in the machine in which your Job actually
    runs, for example, on a Talend
    Jobserver.

    Note that the user that executes a keytab-enabled Job is not necessarily
    the one a principal designates but must have the right to read the keytab file being
    used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
    situation, ensure that user1 has the right to read the keytab
    file to be used.

  7. In the User name field, enter the login user
    name for your distribution. If you leave it empty, the user name of the machine
    hosting the Studio will be used.
  8. In the Temp folder field, enter the path in
    HDFS to the folder where you store the temporary files generated during
    Map/Reduce computations.

  9. Leave the default value of the Path separator in
    server
    as it is, unless you have changed the separator used by your
    Hadoop distribution’s host machine for its PATH variable or in other words, that
    separator is not a colon (:). In that situation, you must change this value to the
    one you are using in that host.

  10. Leave the Clear temporary folder check box
    selected, unless you want to keep those temporary files.
  11. Leave the Compress intermediate map output to reduce
    network traffic
    check box selected, so as to spend shorter time
    to transfer the mapper task partitions to the multiple reducers.

    However, if the data transfer in the Job is negligible, it is recommended to
    clear this check box to deactivate the compression step, because this
    compression consumes extra CPU resources.
  12. If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
    properties to be customized. Then at runtime, these changes will override the
    corresponding default properties used by the Studio for its Hadoop
    engine.

    For further information about the properties required by Hadoop, see Apache’s
    Hadoop documentation on http://hadoop.apache.org, or
    the documentation of the Hadoop distribution you need to use.

  13. If the HDFS transparent encryption has been enabled in your cluster, select
    the Setup HDFS encryption configurations check
    box and in the HDFS encryption key provider field
    that is displayed, enter the location of the KMS proxy.

    For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS.

  14. You can tune the map and reduce computations by
    selecting the Set memory check box to set proper memory allocations
    for the computations to be performed by the Hadoop system.

    The memory parameters to be set are Map (in Mb),
    Reduce (in Mb) and ApplicationMaster (in Mb). These fields allow you to dynamically allocate
    memory to the map and the reduce computations and the ApplicationMaster of YARN.

    For further information about the Resource Manager, its scheduler and the
    ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

    For further information about how to determine YARN and MapReduce memory configuration
    settings, see the documentation of the distribution you are using, such as the following
    link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.


  15. If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator
    of your distribution to trace your Job lineage to the component level, including the
    schema changes between components.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera
      Navigator.

    • Cloudera Navigator URL : enter the location of the
      Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location
      of the Navigator Metadata.

    • Activate the autocommit option: select this check box
      to make Cloudera Navigator generate the lineage of the current Job at the end of the
      execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of
      all its available entities such as HDFS files and directories, Hive queries or Pig
      scripts, it is not recommended for the production environment because it will slow the
      Job.

    • Kill the job if Cloudera Navigator fails: select this check
      box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

      Otherwise, leave it clear to allow your Job to continue to run.

    • Disable SSL validation: select this check box to
      make your Job to connect to Cloudera Navigator without the SSL validation
      process.

      This feature is meant to facilitate the test of your Job but is not
      recommended to be used in a production cluster.


  16. If you are using Hortonworks Data Platform V2.4.0 onwards and you have
    installed Atlas in your cluster, you can select the Use
    Atlas
    check box to enable Job lineage to the component level, including the
    schema changes between components.

    With this option activated, you need to set the following parameters:

    • Atlas URL: enter the location of the Atlas to be
      connected to. It is often http://name_of_your_atlas_node:port

    • Die on error: select this check box to stop the Job
      execution when Atlas-related issues occur, such as connection issues to Atlas.

      Otherwise, leave it clear to allow your Job to continue to run.

    In the Username and Password fields, enter the authentication information for access to
    Atlas.

Executing the Job

Save the Job and press F6 to execute
it.

Match results are displayed on the studio console.
tMatchGroup_55.png

Matches are calculated by running through the two passes. Each pass
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.
More customers with similar last names are grouped together after the
second pass.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x