tMatchGroup
Creates groups of similar data records in any source data including large volumes
of data by using one or several match rules.
tMatchGroup compares columns in both
standard input data flows and in M/R input data flows by using matching methods and
groups similar encountered duplicates together.
Several tMatchGroup components can be used sequentially to match data
against different blocking keys. This will refine the groups received by
each of the tMatchGroup components through
creating different data partitions that overlap previous data blocks and so
on.
In defining a group, the first processed record of each
group is the master record of the group. The other records are computed as
to their distances from the master records and then are distributed to the
due master record accordingly.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Standard: see tMatchGroup Standard properties.
The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.
-
MapReduce: see tMatchGroup MapReduce properties (deprecated).
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
tMatchGroup Standard properties
These properties are used to configure tMatchGroup running in the Standard Job framework.
The Standard
tMatchGroup component belongs to the Data Quality family.
The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields Click Sync The output schema of this component contains the following read-only – GID: provides a group identifier of the data type String.
Note: In Jobs migrated from previous releases to your
current Talend Studio, the group identifier may be of the Long data type. To have a group identifier of the String data type, replace the tMatchGroup components in the migrated Jobs with tMatchGroup components from the Palette. – GRP_SIZE: counts the number of records in the – MASTER: identifies, by Each input record will be compared to the master record, if they – SCORE: measures the distance between the input In case the tMatchGroup component is – GRP_QUALITY depends on the
Matching Algorithm:
– MERGE_INFO: this output column is available |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Matching Algorithm |
Select from the list the algorithm you want to use in the component: |
|
Click the import icon to import a match rule from the In the Match Rule Selector wizard, import a match For further information about how to import rules, see Importing match rules from the studio repository |
Key Definition |
Input Key Attribute Select the column(s) from the input flow on which you want to apply a Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format. For example, |
 |
Matching Function Select a matching algorithm from the list:
Exact: matches each processed entry
Exact – ignore case: matches each
Soundex: matches
Levenshtein (edit distance):
Metaphone: Based
Double
Soundex FR:
Jaro: matches processed entries
Jaro-Winkler: a variant of Jaro, but
Fingerprint key: matches entries
after doing the following sequential process:
q-grams: matches processed entries by
Hamming: calculates the minimum
custom…: enables you to load an For further information about how to load an external Java library, For further information about how to create a custom matching For a related scenario about how to use a custom matching algorithm, |
Custom Matcher When you select Custom as the For example, to use a MyDistance.class class |
|
 |
Tokenized measure Tokenization is the concept of splitting a string into words. Select the
NO: no tokenization method is used on the
Same place: splits the two strings by words to
Same order: splits the two This method should be used only with strings which has a few words,
Any order: splits the two strings by words to two Using this method, “John Doe” and “Doe John” match. |
 |
Threshold This column is displayed when you Two data records match when the probability is greater than or equal to Set a threshold between 0 and 1. You can enter up to 6 decimals. 0 means that the |
 |
Confidence Weight Set a numerical weight for each attribute (column) of the key |
Handle Null To handle null values, select from the list the null operator you want
Null Match Null: a Null attribute
Null Match None means that you: a Null attribute
Null Match All: a Null attribute For example, if we have two columns, name and If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not
Null Match Null: they do not
Null Match None: they do not Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match.
Null Match None: they do not Null Match All: they match. |
|
 |
Survivorship Function (only available when the
|
 | Reference column
If you set |
 | Parameter
If you set Survivorship If you set Survivorship |
Match Threshold |
Enter the match probability. Two data You can enter a different match threshold for each match rule. |
Survivorship Rules For Columns (only available |
Input Column: Select the column(s) from the input Survivorship Function: Select how two similar |
Default Survivorship Rules (only |
Data Type: Select the data type(s) from the input Survivorship Function: Select how two similar |
Blocking Selection |
Input Column If required, select the column(s) from the input flow according to Blocking reduces the number of pairs of records that needs to be Using blocking column(s) is very useful when you are processing very |
Advanced settings
Store on disk |
Select the Store on disk check box if
Max buffer size: Type in the size of
Temporary data directory path: Set |
Multiple output |
Select the Separate output
check box to have several output flows:
Note:
When using the Simple VSR When using the t-Swoosh algorithm, the
Confident match |
Multi-pass |
Select this check box to enable a tMatchGroup Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. For an example Job, see Matching customer data through multiple passes With multi-pass matching, all master records are generated but
Propagate original values: This |
Sort the output data by GID |
Select this check box to group the output data by the group The output is sorted in ascending alphanumeric order by group |
Output distance details |
Select this check box to add an output column Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. |
Display detailed labels |
Select this check box to have in the output For example, if you try to match on first name and last name fields, |
Deactivate matching computation when opening the |
Select this check box to open the Configuration Wizard without running This enables you to have a better experience with the component. |
tStatCatcher Statistics |
Select this check box to collect log data at the component level. Note |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is an intermediary step. It requires an input flow as |
Configuration wizard
The configuration wizard enables you to create different production environments,
Configurations, and their match rules.
You can also use
the configuration wizard to import match rules created and tested in the studio and use them
in your match Jobs. For further information, see Importing match rules from the studio repository.
You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.
To open the configuration wizard:
-
In the studio workspace, design your job and link the components together, for
example as below: -
Double-click tMatchGroup; or right-click
it and from the contextual menu select Configuration
Wizard; or click Preview in the
basic settings view of
tMatchGroup. -
In the popup that opens, click Skip Computation
if you want to open the Configuration Wizard without running the match rules defined
in it.
-
the Configuration view, where you can set the
match rules and the blocking column(s). -
the matching chart, which presents the graphic matching result,
-
the matching table, which presents the details of the matching result.
The Limit field at the upper-left corner indicates the
maximum number of rows to be processed by the match rule(s) in the wizard. The by-default
maximum row number is 1000.
Configuration view
From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute
the Job.
You can use these different configurations for testing purposes for example, but
you can only save one configuration from the wizard, the open configuration.
In each configuration, you can define the parameters to generate match rules with the VSR
or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the
basic settings of the tMatchGroup component.
You can define survivorship rules, blocking key(s) and multiple conditions using several
match rules. You can also set different match intervals for each rule. The match results on
multiple conditions will list data records that meet any of the defined rules. When a
configuration has multiple conditions, the Job conducts an OR match operation. It evaluates
data records against the first rule and the records that match are not evaluated against the
other rules.
-
The Key definition parameters.
-
The Match Threshold field.
-
A blocking key in the Blocking Selection
table (available only for rules with the VSR algorithm).Defining a blocking key is not mandatory but advisable as it partitions data
in blocks to reduce the number of records that need to be examined. For further
information about the blocking key, see Importing match rules from the studio repository. -
The Survivorship Rules for Columns parameters
(available only for rules with the T-Swoosh algorithm). -
The Default Survivorship Rules parameters for data
types (available only for rules with the T-Swoosh algorithm).
-
In the basic settings of the tMatchGroup
component, select Simple VSR from the Matching Algorithm list.It is important to have the same type of the matching algorithm selected in the
basic settings of the component and defined in the configuration wizard. Otherwise
the Job runs with default values for the parameters which are not compatible between
the two algorithms. -
In the basic settings of the tMatchGroup
component, click Preview to open the configuration
wizard. -
Click the [+] button on the top right corner of
the Configuration view.This creates, in a new tab, an exact copy of the last configuration. -
Edit or set the parameters for the new configuration in the Key definition and Blocking
Selection tables. -
If needed, define several match rules for the open configuration as
the following:-
Click the [+] button on the match rule bar to
create an exact copy of the last rule in a new tab. -
Set the parameters for the new rule in the Key
definition table and define its match interval. -
Follow the steps above to create as many match rules for a
configuration as needed. You can define a different match interval for
each rule.
When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the
records that match are not evaluated against the second rule and so
on. -
Click the [+] button on the match rule bar to
-
Click the Chart button at the top right corner of
the wizard to execute the Job in the open configuration.The matching results are displayed in the matching chart and table.Follow the steps above to create as many new configuration in the wizard as
needed. -
To execute the Job in a specific configuration, open the configuration in the
wizard and click the Chart button.The matching results are displayed in the matching chart and table. -
At the bottom right corner of the wizard, click either:
-
OK to save the open configuration.
You can save only one configuration in the wizard.
-
Cancel to close the wizard and keep the
configuration saved initially in the wizard.
-
For an example of a match rule with the T-Swoosh algorithm, see Using survivorship functions to merge two records and create a master record.
Matching chart
From the matching chart, you can have a global picture about the duplicates in the
analyzed data.
The Hide groups less than parameter, which is set to
2 by default, enables you to decide what groups to show in the result
chart. Usually you want to hide groups of small group size.
-
48 items are analyzed and classified into 18 groups according to a given match
rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2. -
11 groups have 2 items each. In each group, the 2 items are duplicates of each
other. -
3 groups have 3 items each. In each group, these items are duplicates of one
another. -
3 groups have 4 items each. In each group, these items are duplicates of one
another. -
One single group has 5 duplicate items.
Matching table
From the matching table, you can read details about the different duplicates.
This table indicates the matching details of items in each group and colors the groups in
accordance with their color in the matching chart.
You can decide what groups to show in this table by setting the Hide
groups of less than parameter. This parameter enables you to hide groups of small
group size. It is set to 2 by default.
The buttons under the table helps you to navigate back and forth through pages.
Importing match rules from the studio repository
From the tMatchGroup configuration wizard, you can import match keys from the
match rules created and tested in the
Profiling
perspective. You can then use these imported matching keys in your match
Jobs.
The tMatchGroup component enables you to import from the
Studio repository match rules based on the VSR or the T-Swoosh algorithms.
The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records and
assigns them to groups. The first processed record of each group is the master record of the
group. The VSR algorithm compares each record with the master of each group and uses the
computed distances, from master records, to decide to what group the record should go.
The T-Swoosh algorithm enables you to find duplicates and to define how
two similar records are merged to create a master record, using a survivorship function.
These new merged records are used to find new duplicates. The difference with the VSR
algorithm is that the master record is in general a new record that does not exist in
the list of input records.
-
From the configuration wizard, click the
icon on the top right corner.
The Match Rule Selector wizard opens listing
all match rules created in the studio and saved in the repository. -
Select the match rule you want to import into the tMatchGroup component and use on your data.
A warning message displays in the wizard if the match rule you want to import is
defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration
wizard.It is important to have the same type of the matching algorithm selected in the
basic settings of the component and imported from the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms. -
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.If you leave the box unselected, the match keys will be imported in a new match
rule tab without overwriting the current match rule in the wizard. -
Click OK.
The matching key is imported from the match rule and listed as a new rule in the
configuration wizard. -
Click in the Input Key Attribute and select from
the input data the column on which you want to apply the matching key. -
In the Match threshold
field, enter the match probability threshold. Two data records match when the
computed match score is greater than or equal to this value. -
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.Defining a blocking key is not mandatory but advisable. Using a blocking key
partitions data in blocks and so reduces the number of records that need to be
examined, as comparisons are restricted to record pairs within each block. Using
blocking key(s) is very useful when you are processing big data set.The Blocking Selection table in the component is
different from the Generation of Blocking Key table
in the match rule editor in the
Profiling
perspective.The blocking column in tMatchGroup could come
from a tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema (it could be a
ZIP column for instance). While the Generation of Blocking Key table in the match rule editor defines
the parameters necessary to generate a blocking key; this table is equivalent to the
tGenKey component. The Generation of Blocking Key table generates a blocking column
BLOCK_KEY used for blocking. -
Click the Chart button in the top right corner of
the wizard to execute the Job using the imported match rule and show the matching
results in the wizard.
Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms
Matching two records
define how to merge similar records to create a master record.
Creating a master record is an iterative process: each new master record can be used to
find new duplicates.
You can choose between two different algorithms to create master records:
- Simple VSR Matcher
- T-Swoosh
The main difference between the two algorithms is that T-Swoosh creates, for each master
record, a new record that does not exist in the list of input records.
Match rules
Two records match if at least one of the match rules is satisfied. As soon as two records
match according to a given rule, the other rules are not checked.
Matching several records
Blocking
or many blocking keys to split the input dataset into smaller datasets called
blocks.
In each block, the blocking keys must have the same value. Then, each block is processed
independently.
Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh
algorithms to process data. For example, if 100,000 records are split into 100 blocks of
1,000 records each, the number of comparisons are reduced by a factor 100. This means
the algorithm runs around 100 times faster.
It is recommended to use the tGenKey component to generate blocking
keys and to visualize the statistics regarding the number of blocks. In a Job,
right-click the tGenKey component and select View Key
Profile in the contextual menu to visualize the distribution of the
number of blocks according to their size.
In this example, the average block size is around 40.
For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks
(13 × 382). If records are compared with four columns, this means there will be 75,088
string comparisons in these 13 blocks (18,772 × 4).
The Simple VSR Matcher algorithm
previous master records in the lookup table.
If a record does not match any of the previous master records, it is considered as a new
master record and added to the lookup table. This means that the first record of the
dataset is necessarily a master record. So, the order of the records is important and
can have an impact on the creation process of the master records.
When a record matches a master record, the Simple VSR Matcher algorithm does not further
attempt to match with other master records because all the master records in the lookup
table are not similar. So, once a record matches a master record, the chance of matching
another master record is low.
This means a record can only exist in one group of records and be linked to one master
record.
For example, take the following set of records as input:
id | fullName |
---|---|
1 | John Doe |
2 | Donna Lewis |
3 | John B. Doe |
4 | Louis Armstrong |
The algorithm processes the input records as follows:
- The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the lookup table. - The algorithm takes record 2 and compares it with record 1. Since it is not a match,
record 2 is added to the lookup table. - The algorithm takes record 3 and compares it with record 1 and record 2. Record 3
matches record 1. So, record 3 is added to the group of record 1. - The algorithm takes record 4 and compares it with record 1 and record 2 but not with
record 3, which is not a master record. Since it is not a match, record 4 is added
to the lookup table.
The output will look like this:
id | fullName | Grp_ID | Grp_Size | Master | Score | GRP_QUALITY |
---|---|---|---|---|---|---|
1 | John Doe | 0 | 2 | true | 1.0 | 0.72 |
3 | John B. Doe | 0 | 0 | false | 0.72 | 0 |
2 | Donna Lewis | 1 | 1 | true | 1.0 | 1.0 |
4 | Louis Armstrong | 2 | 1 | true | 1.0 | 1.0 |
The T-Swoosh algorithm
The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm,
but it creates a master record instead of considering existing records to be master
records.
The order of the input records does not impact the matching process.
To create master records, you can design survivorship rules to decide which attribute
will survive.
There are two types of survivorship rules:
- The rules related to matching keys: each attribute used as a matching key can have a
specific survivorship rule. - The default rules: they are applied to all the attributes of the same data type
(Boolean, String, Date, Number).
If a column is a matching key, the rule related to matching keys specific to this column
is applied.
If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.
Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.
For example, take the following set of records as input:
id | fullName |
---|---|
1 | John Doe |
2 | Donna Lewis |
3 | John B. Doe |
4 | Johnnie B. Doe |
The survivorship rule uses the Concatenate function with
“,” as a parameter to separate values.
At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:
- The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the set of master records. The
queue contains now record 2, record 3 and record 4. The lookup contains record
1. - The algorithm takes record 2 and compares it with record 1. Since record 2 does not
match any record, it is added to the set of master records. The queue contains now
record 3 and record 4. The lookup contains record 1 and record 2. - The algorithm takes record 3 and compares it with record 1. Record 3 matches record
1. So, record 1 and record 3 are merged to create a new master record called record
1,3. The queue contains now record 4 and record 1,3. The lookup contains record
2. - The algorithm takes record 4 and compares it with record 2. Since it is not a match,
record 4 is added to the set of master records. The queue contains now record 1,3.
The lookup table contains record 2 and record 4. - The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master
record called record 1,3,4. Record 4 is removed from the lookup table. Since record
1,3 was the result of a previous merge, it is removed from the table. The queue now
contains record 1,3,4. The lookup contains record 2. - The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
match, record 1,3,4 is added to the set of master records. The queue is now empty.
The lookup contains records 1,3,4 and record 2.
The output will look like this:
id | fullName | GRP_ID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|
1,3,4 | John Doe, John B. Doe, Johnnie B. Doe | 0 | 3 | true | 1.0 | 0.449 |
1 | John Doe | 0 | 0 | false | 0.72 | 0 |
3 | John B. Doe | 0 | 0 | false | 0.72 | 0 |
4 | Johnnie B. Doe | 0 | 0 | false | 0.78 | 0 |
2 | Donna Lewis | 1 | 1 | true | 1.0 | 1.0 |
As you can see in this example, the value in the GRP_QUALITY column can be less than the
Match Threshold parameter. That is
because a group is created from record pairs with a matching score greater than
or equal to the Match Threshold but the
records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
group.
The differences between the Simple VSR Matcher and the T-Swoosh algorithms
When processing the input data using the Simple VSR Matcher algorithm, there
may be more iterations than the number of input records because a merged record may be
created on each iteration and added to the queue.
This is one of the main differences
between the Simple VSR Matcher and the T-Swoosh algorithms.
comparisons per iteration than the Simple VSR matcher algorithm:
- When using the Simple VSR matcher algorithm, the record from the queue is only
compared with the value of the master record. There is no comparison between the
record from the queue and the value of each of the records used to build this
master record. Then, sort the input records so that the most trustworthy records
appear first in the input data. - When using the T-Swoosh algorithm, the record from the queue is compared with
the value of the master record and the value of each of the records used to
build this master record, until records are considered a match.You can find an example of how to survive
master records using the T-Swoosh algorithm in Talend Help Center (https://help.talend.com).In this example, the record “John Doe, John B. Doe” is compared
with the record “John B. Doe” on iteration 5. There is a match if at least
one of the three strings “John Doe, John B. Doe”, “John Doe” and “John B.
Doe” matches the string “Johnnie B. Doe”.
Multi-pass matching
You can design a Job with consecutive tMatchGroup components to
create data partitions based on different blocking keys.
For example, you want to find duplicates having either the same city or the same zip code
in a customer database. In this particular case, you can use two consecutive
tMatchGroup to process the data partitions:
- One tMatchGroup component in which the column “city” is defined
as a blocking key. - One tMatchGroup component in which the column “ZipCode” is
defined as a blocking key.
What is multi-pass matching?
The idea behind multi-pass matching is to reuse the master records defined in the
previous pass as the input of the current tMatchGroup component.
Multi-pass matching is more effective if the blocking keys are almost not
correlated. For example, it is not relevant to define the column “country” as a
blocking key and the column “city” as another blocking key because all the
comparisons made with the blocking key “city” will also be done with blocking key
“country”.
When using multi-pass matching with the Simple VSR matcher algorithm, only master
records of size 1 – records that did not match any record – are compared with master
records of any size. There are no comparisons between two master records that are
derived from at least two children each.
An example of multi-pass matching
In the following example, the dataset contains four records. It is assumed that the
first tMatchGroup component has a blocking key on the column
“ZipCode”, and the second tMatchGroup component has a blocking
key on the column “city”. The attribute “name” is used as a matching key.
id | name | city | ZipCode |
---|---|---|---|
1 | John Doe | Nantes | 44000 |
2 | John B. Doe | Nantes | Â |
3 | Jon Doe | Nantes | 44000 |
4 | John Doe | Nantes | Â |
After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped.
In these groups, record 1 and record 2 are master records.
In the second tMatchGroup, only the master records from the first
pass, record 1 and record 2, are compared. Since their group size is strictly
greater than 1, they are not compared. Then, the order in which the input records
are sorted is very important.
The following results are returned:
id | name | city | ZipCode | GID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|---|---|
1 | John Doe | Nantes | 44000 | 0 | 2 | true | 1.0 | 0.875 |
3 | Jon Doe | Nantes | 44000 | 0 | 0 | false | 0.85 | 0 |
2 | John B. Doe | Nantes | Â | 1 | 2 | true | 1.0 | 0.727 |
4 | John Doe | Nantes | Â | 1 | 0 | false | 0.72 | 0 |
When running the T-Swoosh algorithm with the same parameters and the Most
common survivorship function, the following results are
returned:
id | name | city | ZipCode | GID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|---|---|
1 | John Doe | Nantes | 44000 | 0 | 4 | true | 1.0 | 0.727 |
1 | John Doe | Nantes | 44000 | 0 | 0 | true | 0.875 | 0 |
3 | Jon Doe | Nantes | 44000 | 0 | 0 | false | 0.875 | 0 |
2 | John B. Doe | Nantes | Â | 0 | 0 | true | 0.72 | 0 |
4 | John Doe | Nantes | Â | 1 | 0 | false | 0.72 | 0 |
Grouping output data in separate flows according to the minimal distance
computed in each record
algorithms in the tMatchgroup component:
For another scenario that groups the output records in one single output flow, see
Comparing columns and grouping in the output flow duplicate records that have the same functional key.
Simple VSR algorithm
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
input file using the Jaro-Winkler matching method on the lname and fname column
and the q-grams matching method
on the address1 column. It then groups the
output records in output flows:
-
Uniques: lists the
records which group size (minimal distance computed in the record) is
equal to 1. -
Matches: lists the
records which group score (minimal distance computed in the record) is
greater than or equal to the threshold you define in the Confident match threshold
field. -
Suspects: lists
the records which group score (minimal distance computed in the record)
is less than the threshold you define in the Confident match threshold field.
Setting up the Job
- Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRow components.
-
Connect tFileInputExcel to tMatchGroup using the Main
link. - Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
-
Connect tMatchGroup to
the three tLogRow components using the
Uniques, Matches and Suspects
links.
Configuring the input component
The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.
You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your
Talend Studio
. For
further information, see the
Talend Studio User
Guide.
-
In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
and set other properties in case they are not stored in the Repository. -
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.
Configuring the tMatchGroup component
-
Double-click tMatchGroup to display the
Basic settings view and define the
component properties. -
From the Matching Algorithm list, select
Simple VSR.In this scenario, the match rule is based on the VSR algorithm. -
Click Sync columns to retrieve the schema
from the preceding component. -
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties. - Click OK to close the dialog box.
-
Click the […] button next to
Configure match rules to open the configuration
wizard and define the component configuration and the match rule(s).You can use the configuration wizard to import match rules
created and tested in the studio and stored in the repository, and use them in
your match Jobs. For further information, see Importing match rules from the studio repository.It is important to import or define the same type of the rule
selected in the basic settings of the component, otherwise the Job runs with
default values for the parameters which are not compatible between the two
algorithms. -
Define the first match rule as the following:
-
In the Key definition table, click
the [+] button to add to the list the
column(s) on which you want to do the matching operation,
lname and fname.Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison. -
Click in the Matching Function column
and select from the list Jaro-Winkler
as the method to be used for the matching operation.If you select custom as a matching
type, you must set in the Custom Matcher
column the path pointing to the custom class (external
matching algorithm) you need to use. This path is defined by yourself in
the library file (.jar file). -
From the Tokenized measure list,
select not to use a tokenized distance for the selected
algorithms. -
Click in the cell of the Weight
column to set the numerical weights for the two columns used as key
attributes. -
Click in the cell of the Handle Null
column and select the null operator you want to use to handle null
attributes in the columns. In this example, select Null Match None in order to have matching results where
null values have minimal effect. -
Set the match probability in the Match
Interval field.
-
-
Follow the same procedure in the above step to define the second match rule.
Set the address1 column as an input attribute and select
Jaro as the matching type. Select Null Match None as the null operator. And finally set
the match probability which can be different from the one set for the first
rule. -
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size. -
Click the Advanced settings tab and set the
advanced parameters for the tMatchGroup
component as the following:-
Select the Separate output check box.
The component will have three separate output flows: Unique rows, Confident
groups and Uncertain
groups.If this check box is not selected, the tMatchGroup component will have only one output flow
where it groups all output data. For an example scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key. -
Select the Sort the output data by
GID check box to sort the output data by their group
identifier. -
Select the Output distance details
and Display detailed labels check
boxes.The component will output the MATCHING_DISTANCES
column. This column provides the distance between the input and the
master columns giving also the names of the columns against which the
records are matched. -
Select the Deactivate matching computation when
opening the wizard check box if you do not want to run
the match rules the next time you open the wizard.
-
-
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the matching results
directly in the wizard.The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of items in each group and colors
the groups in accordance with their color in the matching chart.The Job conducts an OR match operation on the
records. It evaluates the records against the first rule and the records that
match are not evaluated against the second rule. The MATCHING_DISTANCES column allows you to
understand which rule has been used on what records. Some records are matched
according to the second rule that uses address1 as a key attribute, whereas the other records in the group are
matched according to the first rule which uses the lname and fname
as key attributes.You can set the Hide groups of less than
parameter in order to decide what groups to show in the matching chart and
table
Finalizing the Job and executing it
-
Double-click each of the tLogRow
components to display the Basic
settings view and define the component properties. -
Save your Job and press F6 to execute
it.
You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.
The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.
The number of records in each of the three output blocks is listed
in the GRP_SIZE column and computed only on the master record.
The MASTER column indicates with true or false whether the
corresponding record is a master record or not. The SCORE
column lists the calculated distance between the input record and the master record
according to the Jaro-Winkler and Jaro matching algorithms.
The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.
All records with a group score between the match
interval, 0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.
All records with a group score greater than or
equal to one of the match probabilities are listed in the Matches output flow.
All records with a group size equal to 1 is listed in the
Uniques output flow.
For another scenario that groups the output records in one single output flow
based on a generated functional key, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.
T-Swoosh algorithm
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
input file using the Jaro-Winkler matching method on the lname and fname columns.
It then groups the output records in output flows:
-
Uniques: lists the
records which group size (minimal distance computed in the record) is
equal to 1. -
Matches: lists the
records which group quality is greater than or equal to the threshold
you define in the Confident
match threshold field. -
Suspects: lists
the records which group quality is less than the threshold you define in
the Confident match
threshold field.
The group quality is the minimal distance computed among all
record pairs of the group.
Setting up the Job
- Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup and three tLogRow components.
-
Connect tFixedFlowInput
to tMatchGroup using the Main link. - Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
-
Connect tMatchGroup to
the three tLogRow components using the
Uniques, Matches and Suspects
links.
Configuring the input component
The input data contain seven columns: lname, fname, middle_name, address, city, state_province
and postal_code. The data have problems such as duplication,
names spelled differently or wrongly, different information for the same customer.
-
Double-click the tFixedFlowInput
component to display its Basic settings view. -
Select Built-in and click the […]
button next to Edit Schema. - Define the seven columns and click OK.
- Select Use Inline Content(delimited file).
-
Fill in the Row Separator and Field
Separator fields. - Enter the input data in the Content field.
Configuring the tMatchGroup component
-
Click the tMatchGroup
component to display its Basic settings
view. -
From the Matching
Algorithm list, select t-Swoosh. -
Click Sync columns to retrieve the schema
from the preceding component. -
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.In the output schema of this component there are output standard columns
that are read-only. For more information, see the tMatchGroup
properties. - Click OK to close the dialog box.
-
Click the Advanced
settings tab and select the following check boxes:-
Separate output
The component will have three separate output flows:
Uniques,
Matches
and Suspects.If this check box is not selected, the tMatchGroup component will have only
one output flow where it groups all output data. For an example
scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key. -
Sort the output data by
GID - Output distance details
- Display detailed labels
-
Deactivate matching
computation when opening the wizard
-
Separate output
-
Click the […] button next to Configure match
rules to define the component configuration and the match
rule(s).You can use the configuration wizard to import
match rules created and tested in the studio and stored in the repository, and
use them in your match Jobs. For further information, see Importing match rules from the studio repository.It is important to import or define the same
type of the rule selected in the basic settings of the component. Otherwise the
Job runs with default values for the parameters which are not compatible between
the two algorithms. -
Define the match rule as the following:
-
In the Key definition table,
click the [+] button to add to
the list the column(s) on which you want to do the matching operation,
fname and lname.Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison. -
Click in the Matching Function
column and select from the list Jaro-Winkler as the method to be used for the matching
operation.If you select custom
as a matching type, you must set in the Custom
Matcher column the path pointing to the custom class
(external matching algorithm) you need to use. This path is defined by
yourself in the library file (.jar file). -
From the Tokenized measure list,
select No. -
Click in the cell of the Threshold column and
enter 0.7 for fname and 0.4 for lname. -
Click in the cell of the Confidence
Weight column to set the numerical weights for the two
columns used as key attributes: 1 for fname and 4 for lname. -
Click in the cell of the Handle Null column and select the null
operator you want to use to handle null attributes in the columns. In
this example, select Null
Match NONE in order to have matching results where null
values have minimal effect. -
Select Most common in the
Survivorship Function.
-
In the Key definition table,
-
Follow the same procedure in the above step to define the second match rule and
set the parameters as follows:-
Click the [+] button (Duplicate
Rule). - Input Key Attribute: address
-
Matching Function:
Jaro -
Tokenized Measure:
No - Threshold: 0.8
- Confidence Weight: 1
-
Handle Null: Null Match
NONE -
Survivorship Function:
Most common
-
Click the [+] button (Duplicate
-
Set the Match Threshold
parameter of each Match Rule to 0.8. -
Set the Hide groups of less than parameter to 2. This
parameter enables you to hide groups of small size. -
Click the Chart button
to execute the Job in the defined configuration and have the matching results
directly in the wizard.The matching chart gives a global picture about the duplicates
in the analyzed data. The matching table indicates the details of items in each
group and colors the groups in accordance with their color in the matching
chart.The Job conducts an OR match operation on the records. It
evaluates the records against the rule. The MATCHING_DISTANCES column allows you to
understand which rule has been used on what records.ÂFor example, in the
second data group (brick red), the last Amic record is matched according to the second rule that uses address1 as a key attribute,
whereas the other records in the group are matched according to the first
rule which uses the lname and fname as key
attributes.As you can see in this example, the value in the GRP_QUALITY column can be less than the
Match Threshold parameter. That is
because a group is created from record pairs with a matching score greater than
or equal to the Match Threshold but the
records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
group.
Finalizing the Job and executing it
-
Double-click each tLogRow
component to display the Basic
settings view. - Select Table (print values in cells of a table).
-
Save your Job and press F6 to execute
it.
You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.
The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier is of the data type Long for Jobs that are migrated from older releases. To have the group
identifier as String, replace the tMatchGroup component in the imported Job with
tMatchGroup from the studio
Palette.
The number of records in each of the three output blocks
is listed in the GRP_SIZE column and
computed only on the master record. The MASTER column indicates with true or false whether the corresponding
record is a master record or not. The SCORE column lists the calculated distance between the input record and
the master record according to the Jaro-Winkler and Jaro matching algorithms.
The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.
All records with a group score between the match interval,
0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.
All records with a group score greater than or equal to
one of the match probabilities are listed in the Matches output flow.
All records with a group size equal to 1 is listed in the
Uniques output
flow.
For another scenario that groups the output records in one single output flow
based on a generated functional key, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.
Matching customer data through multiple passes
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
The Job in this scenario, groups similar customer records by running
through two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches
to the pass that follows in order for the latter to add more matches identified with new
rules and blocking keys.
In this Job:
-
The tMysqlInput component
connects to the customer records to be processed. -
Each of the tGenKey
components defines a way to partition data records. The first key partitions
data to many groups and the second key creates fewer groups that overlaps the
previous blocks depending on the blocking key definition. -
The tMap component renames
the key generated by the second tGenKey
component. -
The first tMatchGroup
processes the partitions defined by the first tGenKey, and the second tMatchGroup processes those defined by the second tGenKey.Warning:The two tMatchGroup
components must have the same schema. -
The tLogRow component
presents the matching results after the two passes.
Setting up the Job
In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the
Talend Studio User
Guide.
-
In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schema and
drop the database table onto the design workspace. The input table used in
this scenario is called customer.A dialog box is displayed with a list of components. - Select the relevant database component, tMysqlInput in this example, and then click OK.
-
Drop two tGenKey components, two
tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace. -
Link the input component to the tGenKey
and tMap components using Main links. -
In the two tMatchGroup components, select the
Output distance details check boxes in the
Advanced settings view of both components
before linking them together.This will provide the MATCHING_DISTANCES column in the
output schema of each tMatchGroup.If the two tMatchGroup components are already
linked to each other, you must select the Output distance
details check box in the second component in the Job flow first
otherwise you may have an issue. -
Link the two tMatchGroup components and
the tLogRow component using Main links. -
If needed, give the components specific labels to reflect their usage in
the Job.For further information about how to label a component, see
Talend Studio
User Guide.
Configuring input data and key generation
Connecting to the input data
-
Double-click tMysqlInput to open its
Component view.The property fields for tMysqlInput are
automatically filled in. If you do not define your input schema locally in
the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
lists.The input table used in this scenario is called
customer. -
Modify the query in the Query box to
select only the columns you want to match:
account_name, lname,
fname, mi,
address1, city,
state_province and
postal_code.
Configuring the key generation for the first pass
-
Double-click the first tGenKey to open
the Component view. -
Click
and import blocking keys from match rules created and
tested in the
Profiling
perspective
of
Talend Studio
and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps. -
Under the Algorithm table, click the
[+] button to add two rows in the
table. -
In the column column, click the newly added row and
select from the list the column you want to process using an algorithm. In this
example, select lname. -
Do the same on the second row to select
postal_code. -
In the pre-algorithm column, click the newly
added row and select from the list the pre-algorithm you want to apply to the
corresponding column.In this example, select remove diacritical marks and
convert to upper case to remove any diacritical mark and converts
the fields of the lname column to upper
case.This conversion does not change your raw data. -
In the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word.If you select the Show help check box, you
can display instructions on how to set algorithms/options parameters. -
Do the same for the second row on the algorithm column to select first N
characters of the string. -
Click in the Value column next to the
algorithm column and enter the value for
the selected algorithm, when needed.In this scenario, enter 1 for both rows. The first letter
of each field in the corresponding columns will be used to generate the
key.Make sure to set a value for the algorithm which need one, otherwise you may
have a compilation error when you run the Job.
Configuring the key generation for the second pass
-
Double-click the second tGenKey to open
the Component view. -
In the Algorithm table, define the column
you want to use to partition data, account_num in this
component. Select the first N characters of the
string algorithm and set the value to 1
in the Value column.Each of the two tGenKey components will
generate a read_only T_GEN_KEY column in
the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
other. -
Double-click the tMap component to open
its editor. -
In the Schema editor, copy the columns
from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example. -
In the top part of the editor, drop all columns from the input table to
the output table. -
Click Ok to save data transformation and
close the editor. -
In the tGenKey basic settings, click the
button to verify that the two generated keys are named
differently in the output schema.
Configuring the two matching passes
Configuring the first pass
-
In the basic settings of the tMatchGroup
labelled pass1, select Simple VSR from the Matching
Algorithm list.In this scenario, the match rule is based on the VSR algorithm. -
Click the Preview button to display the
Configuration Wizard. -
Click
and import matching keys from the match rules created and
tested in the
Profiling
perspective of
Talend Studio
and use them in your Job. Otherwise, define
the matching key parameters as described in the below steps.It is important to import or define the same type of the rule selected in the
basic settings of the component, otherwise the Job runs with default values for
the parameters which are not compatible between the two algorithms. -
In the Key definition table, click the
[+] button to add the column(s) on which
you want to do the matching operation, lname
in this scenario.Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison. - Select the Jaro-Winkler algorithm in the Matching Function column.
-
From the Tokenized measure list, select
Any order. -
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario. -
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY in this example.Using a blocking value reduces the number of pairs of records that needs to be
examined. The input data is partitioned into exhaustive blocks based on the
functional key. This will decrease the number of pairs to compare, as comparison
is restricted to record pairs within each block. -
If required, click Edit schema to open the
schema editor and see the schema retrieved from the previous component in the
Job. -
Click the Advanced settings tab and select
the Sort the output data by GID check box to
arrange the output data by their group IDs. -
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.
Configuring the second pass
-
In the basic settings of the tMatchGroup
labelled pass2, select Simple VSR from the Matching
Algorithm list.In this scenario, the match rule is based on the VSR algorithm. -
Click the Preview button to display the
Configuration Wizard.If this component does not have the same schema of the preceding component, a
warning icon appears. If so, click the Sync
columns button to retrieve the schema from the preceding one and
once done, the warning icon disappears. -
In the Key Definition table, click
the [+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison. - Select the Jaro-Winkler algorithm in the Matching Function column.
-
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario. -
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY1 in this example.
-
Click the Advanced settings tab and select
the Multi-pass check box. This option enables
tMatchGroup to receive data sets from the
tMatchGroup that precedes it in the Job.
-
In the Advanced settings view, select the
Sort the output data by GID check box to
arrange the output data by their group IDs. -
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.
Executing the Job and showing the results on the console
them with the groups created after the second pass, you must modify the Job as
the following:
-
-
Use a tReplicate component to
replicate the input flow you want to process as shown in the above
figure. Use a copy/paste operation to create the two parts of the
Job. -
Keep only the first pass,tMatchGroup component, in the upper part of the Job
and show the match results in a tLogRow component. -
Use two passes in the lower part of the Job and show the final
match results in a tLogRow
component.
-
Double-click each of the tLogRow
components to open the Component view and
in the Mode area, select the Table (print values in cells of a table)
option. -
Save your Job and press F6 to execute
it.The results after the first pass read as follows:The results after the second pass read as follows:When you compare, for example, the customer name Wilson from the results of the two passes, you will find
that more customers using the last name Wilson are grouped together after the second pass.
Using survivorship functions to merge two records and create a master
record
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This scenario describes a basic Job that uses the T-Swoosh algorithm to find duplicates in
customer names. It compares values in the first_name column using
matching and survivorship methods, groups similar records together and creates single
representations of entities in each group by merging two similar records according to a
survivorship rule.
These master records are new records that do not exist in the input data.
You can use several matching passes with the T-Swoosh algorithm in a Job. Each pass
provides the survived masters for each group and the original records to the second pass.
However, intermediate master records are removed from the output flow. Only final master and
original records are kept at the end.
In this Job:
-
The tFileInputDelimited component provides the
customer records to be processed. -
The tMatchGroup component processes data using
matching and survivorship methods. -
The tLogRow component presents the matching and
survivorship results.
Setting up the Job
In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the
repository, see the
Talend Studio User Guide.
-
In the Repository tree view, expand Metadata – File
delimited where you have stored the main input schema and drop
the file connection onto the design workspace. The file connection used in this
scenario is called Customer_File.A dialog box is displayed with a list of components. -
Select tFileInputDelimited and click
OK. - Drop a tMatchGroup and a tLogRow from the Palette onto the design workspace.
-
Link the components together using the Main
links.
Configuring the input component
The main input file contains the columns: id,
first_name and age. The data in this input
file has problems such as duplication, names spelled differently or wrongly, different
information for the same customer.
-
Double-click tFileInputDelimited to open its
Component view.The property fields are automatically filled in. If you do not define your
input schema locally in the repository, fill in the details manually after
selecting Built-in in the Schema and Property Type
lists. -
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.
Creating the match rule with a matching key and a survivorship function
Setting the schema and selecting the matching algorithm
-
Double-click tMatchGroup to display the
Basic settings view and define the
component properties. -
From the Matching Algorithm list, select
T-Swoosh.In this scenario, the match rule is based on the T-Swoosh algorithm. -
Click Sync columns to retrieve the schema
from the preceding component. -
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties. -
In the Advanced settings view, leave the
parameters selected by default.
Defining the match rule
-
In the tMatchGroup basic settings, click
Preview to open the configuration wizard
and define the matching key and the survivorship function.You can use the configuration wizard to import match rules created and tested
in the studio and stored in the repository, and use them in your match Jobs. For
further information, see Importing match rules from the studio repository.It is important to have the same type of the matching algorithm selected in
the basic settings of the component and defined in the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms. -
Define the match rule as the following:
-
In the Key definition table, click
the [+] button to add a line in the
table. Click in the Input Key Attribute
column and select the column on which you want to do the matching
operation, first_name in this scenario. -
Click in the Matching Function column
and select Soundex from the list. This
method matches processed entries according to a standard English
phonetic algorithm which indexes strings by sound, as pronounced in
English. -
From the Tokenized measure list,
select not to use a tokenized distance for the selected
algorithm. -
Set the Threshold to 0.8 and the Confidence Weight to 1.
-
Select Null Match None in the
Handle Null column in order to have
matching results where null values have minimal effect. -
Select Most common
in the Survivorship Function
column. This method validates the most frequent name value in each group
of duplicates.
-
-
Define the default survivorship rule as the following:
-
In the Default Survivorship
Rules table, click the [+] button to add a line in the table. Click in the
Data Type column and select
Number. -
Click in the Survivorship
Function column and select Largest (for numbers) from the list. This method
validates the largest numerical value in each group.
-
-
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size. -
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the results directly in
the wizard.The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of the items in each group,
colors the groups in accordance with their color in the matching chart and
indicates with true the records which are
master records. The master record in each group is the result of merging two
similar records according to the phonetic algorithm and survivorship rule. The
master record is a new record that does not exist in the input data. - Click OK to close the wizard.
Finalizing the Job and executing it
-
Double-click the tLogRow component to display
its Basic settings view and define the
component properties. -
Save your Job and press F6 to execute
it.Similar records are grouped together. The number of records in each group is
listed in the GRP_SIZE column and computed only on the
master record. The MASTER column indicates with true or false if
the corresponding record is a master record or not a master record. The master
record is created by merging the record which best matches the phonetic
algorithms with the record which has the largest age value. The
SCORE column lists the calculated similarity between
the input record and the master record according to the Soundex matching algorithm.The identifier for each group, which is of String data
type, is listed in the GID column next to the corresponding
record. This identifier will be of the data type Long for
Jobs that are migrated from older releases. To have the group identifier as
String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the studio Palette.
tMatchGroup MapReduce properties (deprecated)
These properties are used to configure tMatchGroup running in the MapReduce Job framework.
The MapReduce
tMatchGroup component belongs to the Data Quality family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields Click Sync The output schema of this component contains the following read-only
GID: provides a group identifier
of the data type String. Note: In Jobs migrated from previous releases to your current Talend Studio, the group identifier may be of the Long data
type. To have a group identifier of the String data type, replace the tMatchGroup components in the migrated Jobs with tMatchGroup components from the Palette.
GRP_SIZE: counts the number of records in the
MASTER: identifies, by Each input record will be compared to the master record, if they
SCORE: measures the distance between the input In case the tMatchGroup component is
GRP_QUALITY: provides the quality of similarities |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Matching Algorithm |
Select from the list the algorithm you want to use in the component: If you converted a standard Job using tMatchGroup with the T-Swoosh algorithm to a Map/Reduce Job, select Simple VSR from the list and save the |
Key Definition |
Input Key Attribute Select the column(s) from the input flow on which you want to apply a Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format. For example, |
 |
Matching Function Select a matching algorithm from the list:
Exact: matches each processed entry
Exact – ignore case: matches each
Soundex: matches
Levenshtein (edit distance):
Metaphone: Based
Double Metaphone:
Soundex FR:
Jaro: matches processed entries
Jaro-Winkler: a variant of Jaro, but
Fingerprint key:
matches entries after doing the following sequential process:
q-grams: matches processed entries by
Hamming: calculates the minimum
custom…: enables you to load an For further information about how to load an external Java library, For further information about how to create a custom matching For a related scenario about how to use a custom matching algorithm, |
Custom Matcher When you select Custom as the For example, to use a MyDistance.class class |
|
 |
Weight Set a numerical weight for each attribute (column) of the key |
Handle Null To handle null values, select from the list the null operator you want
Null Match Null: a Null attribute
Null Match None: a Null attribute
Null Match All: a Null attribute For example, if we have two columns, name and If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not
Null Match Null: they do not
Null Match None: they do not Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match.
Null Match None: they do not Null Match All: they match. |
|
Match Threshold |
Enter the match probability. Two data records match when the You can enter a different match threshold for each match rule. |
Blocking Selection |
Input Column If required, select the column(s) from the input flow according to Blocking reduces the number of pairs of records that needs to be Using blocking column(s) is very useful when you are processing very |
Advanced settings
Store on disk |
Select the Store on disk check box if
Max buffer size: Type in the size of
Temporary data directory path: Set |
Multiple output |
Select the Separate output check box –Uniques: when the group score –Matches: when the group score –Suspects: when the group score
Confident match threshold: set a |
Multi-pass |
Select this check box to enable a
tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. For an example Job, see Matching customer data through multiple passes |
Sort the output data by GID |
Select this check box to group the output data by the group The output is sorted in ascending alphanumeric order by group |
Output distance details |
Select this check box to add an output column Note: When using two tMatchGroup
components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. |
Display detailed labels |
Select this check box to have in the output For example, if you try to match on first name and last name fields, |
tStatCatcher Statistics |
Select this check box to collect log data at the component level. Note |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
In a You need to use the Hadoop Configuration tab in the For further information about a For a scenario demonstrating a Map/Reduce Job using this component, Note that in this documentation, unless otherwise |
Working principle
This component implements the MapReduce model, based on the blocking keys defined in
the Blocking definition table of the Basic settings view.
This implementation proceeds as follows:
-
Splits the input rows in groups of a given size.
-
Implements a Map Class that creates a map between each key and a list of
records. -
Shuffles the records to group those with the same key together.
-
Applies, on each key, the algorithm defined in the Key
definition table of the Basic
settings view.Then accordingly, this component reads the records, compares them with the
master records, groups the similar ones, and classes each of the rest as a
master record. -
Outputs the groups of similar records with their group IDs, group sizes,
matching distances and scores.
Configuration wizard in Map/Reduce Jobs
The match results on multiple conditions will list data
records that meet any of the defined rules.
You can not open
the configuration wizard unless you link an input component to the tMatchGroup component.
configuration wizard in tMatchGroup, you can:
-
define multiple conditions using several match rules to group
data, -
set different match intervals for each rule,
-
import match rules created and tested in the studio and stored in
the repository, and use them in your match Jobs. You can only import rules
configured with the VSR algorithm. For further information, see Importing match rules from the studio repository. -
select a blocking key to partition data.
-
Click the [+] button on the match rule
bar. -
Set the parameters for the new rule in the Key
definition table and define its match interval. -
Repeat the above steps to create as many match rules as needed. You can define
a different match interval for each rule.When you define multiple rules, the Job conducts an OR match operation. It
evaluates data records against the first rule and the records that match are not
evaluated against the second rule. -
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.Defining a blocking key is not mandatory but is very useful when you are
processing big data sets. A blocking key partitions data in blocks and so
reduces the number of records that need to be examined. This key can come from a
tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema. -
At the bottom right corner of the wizard, click either:
-
OK to save the current
configuration. -
Cancel to close the wizard and keep
the configuration saved initially in the wizard.
-
Matching data through multiple passes using Map/Reduce
components
This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.
Note that
Talend
Map/Reduce components are available only to users who subscribed to Big
Data.
This scenario shows how to create a
Talend
Map/Reduce Job to match data by using Map/Reduce components. It generates
Map/Reduce code and runs right in Hadoop.
The Job in this scenario, groups similar customer records by running through
two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches to
the pass that follows in order for the latter to add more matches identified with new rules
and blocking keys.
This Job is a duplication of the Standard data integration Job described in Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.
You can use
Talend Studio
to automatically convert the standard Job in the previous section to a Map/Reduce Job.
This way, you do not need to redefine the settings of the components in the Job.
Before starting to replicate this scenario, ensure that you have appropriate
rights and permissions to access the Hadoop distribution to be used.
Converting the Job to a Big Data Batch Job
-
In the Repository tree view, right-click the Job you have created in
the earlier scenario to open its contextual menu and select Edit properties.Then the Edit properties dialog box is displayed. Note that the Job must
be closed before you are able to make any changes in this dialog box.This dialog box looks like the image below:Note that you can change the Job name as well as the other
descriptive information about the Job from this dialog box. -
From the Job Type list, select
Big Data Batch. Then a Map/Reduce Job
using the same name appears under the Big Data
Batch sub-node of the Job
Design node.
Setting up the Job
-
Double-click the new Map/Reduce Job to open it in the workspace.
The Map/Reduce component Palette is opened.
-
Delete tMysqlInput in this scenario as it is
not a Map/Reduce component and use tRowGenerator in its place. Link it to tGenKey with a Row > Main
link. -
Double-click tRowGenerator to open its
editor. - Define the schema you want to use to write data in Hadoop.
-
Click OK to validate your schema and close
the editor. -
Leave the settings of the other components as you defined initially in the
standard version of the Job.
Setting up Hadoop connection
-
Click Run to open its view and then click the
Hadoop Configuration tab to display its
view for configuring the Hadoop connection for this Job. -
From the Property type list,
select Built-in. If you have created the
connection to be used in Repository, then
select Repository and thus the Studio will
reuse that set of connection information for this Job. -
In the Version area, select the
Hadoop distribution to be used and its version.-
If you use Google Cloud Dataproc, see Google Cloud Dataproc.
-
If you cannot
find the Cloudera version to be used from this drop-down list, you can add your distribution
via some dynamic distribution settings in the Studio. -
If you cannot find from the list the distribution corresponding to
yours, select Custom so as to connect to a
Hadoop distribution not officially supported in the Studio. For a
step-by-step example about how to use this
Custom option, see Connecting to a custom Hadoop distribution.
-
-
In the Name node field, enter the location of
the master node, the NameNode, of the distribution to be used. For example,
hdfs://tal-qa113.talend.lan:8020.-
If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
client will take care of the rest on the fly for creating the connection. The
MapR client must be properly installed. For further information about how to set
up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client -
If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet.
-
-
In the Resource Manager field,
enter the location of the ResourceManager of your distribution. For example,
tal-qa114.talend.lan:8050.-
Then you can continue to set the following parameters depending on the
configuration of the Hadoop cluster to be used (if you leave the check
box of a parameter clear, then at runtime, the configuration about this
parameter in the Hadoop cluster to be used will be ignored):-
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution. -
Select the Use datanode hostname check box to allow the
Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname
property to true. When connecting to a
S3N filesystem, you must select this check box.
-
-
-
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos
principal name for the NameNode in the field displayed. This enables you to use
your user name to authenticate against the credentials stored in Kerberos.
-
If this cluster is a MapR cluster of the version 5.0.0 or later, you can set the
MapR ticket authentication configuration in addition or as an alternative by following
the explanation in Connecting to a security-enabled MapR.Keep in mind that this configuration generates a new MapR security ticket for the username
defined in the Job in each execution. If you need to reuse an existing ticket issued for the
same username, leave both the Force MapR ticket
authentication check box and the Use Kerberos
authentication check box clear, and then MapR should be able to automatically
find that ticket on the fly.
In addition, since this component performs Map/Reduce computations, you
also need to authenticate the related services such as the Job history server and
the Resource manager or Jobtracker depending on your distribution in the
corresponding field. These principals can be found in the configuration files of
your distribution. For example, in a CDH4 distribution, the Resource manager
principal is set in the yarn-site.xml file and the Job history
principal in the mapred-site.xml file.If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used. -
-
In the User name field, enter the login user
name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
In the Temp folder field, enter the path in
HDFS to the folder where you store the temporary files generated during
Map/Reduce computations. -
Leave the default value of the Path separator in
server as it is, unless you have changed the separator used by your
Hadoop distribution’s host machine for its PATH variable or in other words, that
separator is not a colon (:). In that situation, you must change this value to the
one you are using in that host.
-
Leave the Clear temporary folder check box
selected, unless you want to keep those temporary files. -
Leave the Compress intermediate map output to reduce
network traffic check box selected, so as to spend shorter time
to transfer the mapper task partitions to the multiple reducers.However, if the data transfer in the Job is negligible, it is recommended to
clear this check box to deactivate the compression step, because this
compression consumes extra CPU resources. -
If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
properties to be customized. Then at runtime, these changes will override the
corresponding default properties used by the Studio for its Hadoop
engine.For further information about the properties required by Hadoop, see Apache’s
Hadoop documentation on http://hadoop.apache.org, or
the documentation of the Hadoop distribution you need to use. -
If the HDFS transparent encryption has been enabled in your cluster, select
the Setup HDFS encryption configurations check
box and in the HDFS encryption key provider field
that is displayed, enter the location of the KMS proxy.
For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS.
-
You can tune the map and reduce computations by
selecting the Set memory check box to set proper memory allocations
for the computations to be performed by the Hadoop system.The memory parameters to be set are Map (in Mb),
Reduce (in Mb) and ApplicationMaster (in Mb). These fields allow you to dynamically allocate
memory to the map and the reduce computations and the ApplicationMaster of YARN.For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html. -
If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator
of your distribution to trace your Job lineage to the component level, including the
schema changes between components.
With this option activated, you need to set the following parameters:
-
Username and Password: this is the credentials you use to connect to your Cloudera
Navigator. -
Cloudera Navigator URL : enter the location of the
Cloudera Navigator to be connected to. -
Cloudera Navigator Metadata URL: enter the location
of the Navigator Metadata. -
Activate the autocommit option: select this check box
to make Cloudera Navigator generate the lineage of the current Job at the end of the
execution of this Job.Since this option actually forces Cloudera Navigator to generate lineages of
all its available entities such as HDFS files and directories, Hive queries or Pig
scripts, it is not recommended for the production environment because it will slow the
Job. -
Kill the job if Cloudera Navigator fails: select this check
box to stop the execution of the Job when the connection to your Cloudera Navigator fails.Otherwise, leave it clear to allow your Job to continue to run.
-
Disable SSL validation: select this check box to
make your Job to connect to Cloudera Navigator without the SSL validation
process.This feature is meant to facilitate the test of your Job but is not
recommended to be used in a production cluster.
-
-
If you are using Hortonworks Data Platform V2.4.0 onwards and you have
installed Atlas in your cluster, you can select the Use
Atlas check box to enable Job lineage to the component level, including the
schema changes between components.
With this option activated, you need to set the following parameters:
-
Atlas URL: enter the location of the Atlas to be
connected to. It is often http://name_of_your_atlas_node:port -
Die on error: select this check box to stop the Job
execution when Atlas-related issues occur, such as connection issues to Atlas.Otherwise, leave it clear to allow your Job to continue to run.
In the Username and Password fields, enter the authentication information for access to
Atlas. -
Executing the Job
it.
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.
second pass.