tMatchGroup

Creates groups of similar data records in any source data including large volumes
of data by using one or several match rules.

tMatchGroup compares columns in both
standard input data flows and in M/R input data flows by using matching methods and
groups similar encountered duplicates together.

Several tMatchGroup components can be used sequentially to match data
against different blocking keys. This will refine the groups received by
each of the tMatchGroup components through
creating different data partitions that overlap previous data blocks and so
on.

In defining a group, the first processed record of each
group is the master record of the group. The other records are computed as
to their distances from the master records and then are distributed to the
due master record accordingly.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Standard: see tMatchGroup Standard properties.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.
MapReduce: see tMatchGroup MapReduce properties (deprecated).

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

tMatchGroup Standard properties

These properties are used to configure tMatchGroup running in the Standard Job framework.

The Standard
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. The output schema of this component contains the following read-only fields: – GID: provides a group identifier of the data type String. Note: In Jobs migrated from previous releases to your current Talend Studio, the group identifier may be of the Long data type. To have a group identifier of the String data type, replace the tMatchGroup components in the migrated Jobs with tMatchGroup components from the Palette. – GRP_SIZE: counts the number of records in the group, computed only on the master record. – MASTER: identifies, by `true` or `false`, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. – SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. – GRP_QUALITY depends on the Matching Algorithm: Simple VSR: GRP_QUALITY provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score. T-Swoosh: GRP_QUALITY provides the quality of similarities in the group by taking the minimal matching value among all record pairs of the group. Only the master record has a quality score. – MERGE_INFO: this output column is available only when you have more than one tMatchGroup component in the Job. The column explains with true or false if the record is respectively a master record or not a master record in the first pass.
	Built-In: You create and store the schema locally for this component only.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Matching Algorithm	Select from the list the algorithm you want to use in the component: Simple VSR or T-Swoosh.
	Click the import icon to import a match rule from the Studio repository. In the Match Rule Selector wizard, import a match rule with the same algorithm as the selected matching algorithm in the basic settings of the component. Otherwise, the Job runs with default values for the parameters which are not compatible between the Simple VSR and the t-Swoosh algorithms. For further information about how to import rules, see Importing match rules from the studio repository
Key Definition	Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Note: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
	Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example “Hello”: “H400”.It does not support Chinese characters. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. It does not support Chinese characters. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. It does not support Chinese characters. Soundex FR: matches processed entries according to a standard French phonetic algorithm.It does not support Chinese characters. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string. Fingerprint key: matches entries after doing the following sequential process: Remove leading and trailing whitespace. Change all characters to their lowercase representation. It does not support Chinese characters. Remove all punctuation and control characters. Split the string into whitespace-separated tokens. Sort the tokens and remove duplicates. Join the tokens back together. Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster. Normalize extended western characters to their ASCII representation, for example gödel to godel. This reproduces data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q` length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between “masking” and “pairing” is 3. custom…: enables you to load an external matching algorithm from a Java library using the custom Matcher column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Using a custom matching algorithm to match entries.
	Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file) which you can import by using the tLibraryLoad component. For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
	Tokenized measure Tokenization is the concept of splitting a string into words. Select the method to use to compute a tokenized measure for the selected algorithm: NO: no tokenization method is used on the string. With this option, “John Doe” and “Jon Doe” should match. Same place: splits the two strings by words to two lists, list1 and list2. Associates each element from list1 with the element which has the same position in list2. Using this method, “She is red and he is pink” and “Catherine is red and he is pink” should match. Same order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to associate the elements from list1 with the elements in list2 taken in the same order. Using this method, “John Doe” and “John B. Doe” match. This method should be used only with strings which has a few words, otherwise the number of possible combinations can be large. Any order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to assign each word of list1 to a word of list2, in order to have the highest global similarity (with respect to the used similarity). Using this method, “John Doe” and “Doe John” match.
	Threshold This column is displayed when you selected T-Swoosh as the matching algorithm. Two data records match when the probability is greater than or equal to the set value. Set a threshold between 0 and 1. You can enter up to 6 decimals. 0 means that the similarity between values in the column is not measured. 1 means that you want each two compared values of the column to exactly match. Default value is 1.
	Confidence Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0. means that you want each two compared values of the column to exactly match. Default value is
	Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None means that you: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
	Survivorship Function (only available when the T-Swoosh algorithm is selected): Select how two similar records will be merged from the drop-down list. Concatenate: this function adds the content of the first record and the content of the second record together. for example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values. Prefer True (for booleans): This function always set booleans to True in the merged record, unless all booleans in the source records are False. Prefer False (for booleans): This function always sets booleans to False in the merged record, unless all booleans in the source records are True. Most common: This function validates the most frequently-occurring field value in each group of duplicate records. Most recent: This function validates the latest date value in each group of duplicate records. If more than one date type is defined in the schema, select a column in Reference column. If no date type is defined in the schema, the data are sorted into most recent loading order. Most ancient: This function validates the earliest date value in each group of duplicate records. If more than one date type is defined in the schema, select a column in Reference column. If no date type is defined in the schema, the data are sorted into most ancient loading order. Longest (for strings): This function validates the longest field value in each group of duplicate records. Shortest (for strings): This function validates the shortest field value in each group of duplicate records. Largest (for numbers): This function validates the largest numerical value in each group of duplicate records. Smallest (for numbers): This function validates the smallest numerical value in each group of duplicate records. Most trusted source: This function takes the data coming from the source which has been defined as being most trustworthy. The most trusted data source is set in the Parameter field. This function is only used in the context of integrated matching in Talend MDM.
	Reference column If you set Survivor Function to Most recent or Most ancient, this column is used to select the reference column.
	Parameter If you set Survivorship Function to Most trusted source, this item is used to set the name of the data source you want to use as a base for the master record. If you set Survivorship Function to Concatenate, this item is used to specify a separator you want to use for concatenating data.
Match Threshold	Enter the match probability. Two data records match when the probability is greater than or equal to the set value. You can enter a different match threshold for each match rule.
Survivorship Rules For Columns (only available when the T-Swoosh algorithm is selected)	Input Column: Select the column(s) from the input flow on which you want to apply a survivorship function. Survivorship Function: Select how two similar records will be merged from the drop-down list.
Default Survivorship Rules (only available when the T-Swoosh algorithm is selected)	Data Type: Select the data type(s) from the input flow on which you want to apply a survivorship function. Survivorship Function: Select how two similar records will be merged from the drop-down list.
Blocking Selection	Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.

Advanced settings

Store on disk	Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored.
Multiple output	Select the Separate output check box to have several output flows: Uniques: when the group size (minimal distance computed in the record) is equal to 1, the record is listed in this flow. When records are not unique, they can be: Matches: when the group quality is greater than or equal to the threshold you define in the Confident match threshold field, the record is listed in this flow. Suspects: when the group quality is less than the threshold you define in the Confident match threshold field, the record is listed in this flow. Note: When using the Simple VSR algorithm, the group quality is the minimal distance computed in the record. When using the t-Swoosh algorithm, the group quality is the minimal distance computed among all record pairs of the group. Confident match threshold: set a numerical value between the current Match threshold and 1. From this threshold, you can be confident in the quality of the group.
Multi-pass	Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. Note: When using two tMatchGroup components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. For an example Job, see Matching customer data through multiple passes With multi-pass matching, all master records are generated but intermediate master records are removed from the output flow. Only final master and original records are kept at the end. Propagate original values: This option is available only with the T-Swoosh algorithm. Select this check box to allow the original records from each pass (and not only the unmatched records) to also be considered in the second pass of matching, both against each other and against the survived masters. This helps to make sure that no matches are missed.
Sort the output data by GID	Select this check box to group the output data by the group identifier. The output is sorted in ascending alphanumeric order by group identifier.
Output distance details	Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Note: When using two tMatchGroup components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues.
Display detailed labels	Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be `fname:1.0\|lname:0.97` when the check box is selected and `1.0\|0.97` when it is not selected.
Deactivate matching computation when opening the wizard	Select this check box to open the Configuration Wizard without running the match rules defined in the wizard. This enables you to have a better experience with the component. Otherwise, the wizard may take some time to open.
tStatCatcher Statistics	Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component is an intermediary step. It requires an input flow as well as an output flow.

Configuration wizard

The configuration wizard enables you to create different production environments,
Configurations, and their match rules.

You can also use
the configuration wizard to import match rules created and tested in the studio and use them
in your match Jobs. For further information, see Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.

To open the configuration wizard:

In the studio workspace, design your job and link the components together, for
example as below:
Double-click tMatchGroup; or right-click
it and from the contextual menu select Configuration
Wizard; or click Preview in the
basic settings view of
tMatchGroup.
In the popup that opens, click Skip Computation
if you want to open the Configuration Wizard without running the match rules defined
in it.

The configuration wizard is composed of three areas:

the Configuration view, where you can set the
match rules and the blocking column(s).
the matching chart, which presents the graphic matching result,
the matching table, which presents the details of the matching result.

The Limit field at the upper-left corner indicates the
maximum number of rows to be processed by the match rule(s) in the wizard. The by-default
maximum row number is 1000.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute
the Job.

You can use these different configurations for testing purposes for example, but
you can only save one configuration from the wizard, the open configuration.

In each configuration, you can define the parameters to generate match rules with the VSR
or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the
basic settings of the tMatchGroup component.

You can define survivorship rules, blocking key(s) and multiple conditions using several
match rules. You can also set different match intervals for each rule. The match results on
multiple conditions will list data records that meet any of the defined rules. When a
configuration has multiple conditions, the Job conducts an OR match operation. It evaluates
data records against the first rule and the records that match are not evaluated against the
other rules.

The parameters required to edit or create a match rule are:

The Key definition parameters.
The Match Threshold field.
A blocking key in the Blocking Selection
table (available only for rules with the VSR algorithm).

Defining a blocking key is not mandatory but advisable as it partitions data
in blocks to reduce the number of records that need to be examined. For further
information about the blocking key, see Importing match rules from the studio repository.
The Survivorship Rules for Columns parameters
(available only for rules with the T-Swoosh algorithm).
The Default Survivorship Rules parameters for data
types (available only for rules with the T-Swoosh algorithm).

In the basic settings of the tMatchGroup
component, select Simple VSR from the Matching Algorithm list.

It is important to have the same type of the matching algorithm selected in the
basic settings of the component and defined in the configuration wizard. Otherwise
the Job runs with default values for the parameters which are not compatible between
the two algorithms.
In the basic settings of the tMatchGroup
component, click Preview to open the configuration
wizard.
Click the [+] button on the top right corner of
the Configuration view.

This creates, in a new tab, an exact copy of the last configuration.
Edit or set the parameters for the new configuration in the Key definition and Blocking
Selection tables.
If needed, define several match rules for the open configuration as
the following:
1. Click the [+] button on the match rule bar to
  create an exact copy of the last rule in a new tab.
2. Set the parameters for the new rule in the Key
  definition table and define its match interval.
3. Follow the steps above to create as many match rules for a
  configuration as needed. You can define a different match interval for
  each rule.
When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the
records that match are not evaluated against the second rule and so
on.
Click the Chart button at the top right corner of
the wizard to execute the Job in the open configuration.

The matching results are displayed in the matching chart and table.

Follow the steps above to create as many new configuration in the wizard as
needed.
To execute the Job in a specific configuration, open the configuration in the
wizard and click the Chart button.

The matching results are displayed in the matching chart and table.
At the bottom right corner of the wizard, click either:
- OK to save the open configuration.
  
  You can save only one configuration in the wizard.
- Cancel to close the wizard and keep the
  configuration saved initially in the wizard.

For an example of a match rule with the T-Swoosh algorithm, see Using survivorship functions to merge two records and create a master record.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the
analyzed data.

The Hide groups less than parameter, which is set to
2 by default, enables you to decide what groups to show in the result
chart. Usually you want to hide groups of small group size.

For example, the above matching chart indicates that:

48 items are analyzed and classified into 18 groups according to a given match
rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2.
11 groups have 2 items each. In each group, the 2 items are duplicates of each
other.
3 groups have 3 items each. In each group, these items are duplicates of one
another.
3 groups have 4 items each. In each group, these items are duplicates of one
another.
One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different duplicates.

This table indicates the matching details of items in each group and colors the groups in
accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide
groups of less than parameter. This parameter enables you to hide groups of small
group size. It is set to 2 by default.

The buttons under the table helps you to navigate back and forth through pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you can import match keys from the
match rules created and tested in the
Profiling
perspective. You can then use these imported matching keys in your match
Jobs.

The tMatchGroup component enables you to import from the
Studio repository match rules based on the VSR or the T-Swoosh algorithms.

The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records and
assigns them to groups. The first processed record of each group is the master record of the
group. The VSR algorithm compares each record with the master of each group and uses the
computed distances, from master records, to decide to what group the record should go.

The T-Swoosh algorithm enables you to find duplicates and to define how
two similar records are merged to create a master record, using a survivorship function.
These new merged records are used to find new duplicates. The difference with the VSR
algorithm is that the master record is in general a new record that does not exist in
the list of input records.

From the configuration wizard, click the

icon on the top right corner.

The Match Rule Selector wizard opens listing
all match rules created in the studio and saved in the repository.
Select the match rule you want to import into the tMatchGroup component and use on your data.

A warning message displays in the wizard if the match rule you want to import is
defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration
wizard.

It is important to have the same type of the matching algorithm selected in the
basic settings of the component and imported from the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms.
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.

If you leave the box unselected, the match keys will be imported in a new match
rule tab without overwriting the current match rule in the wizard.
Click OK.

The matching key is imported from the match rule and listed as a new rule in the
configuration wizard.
Click in the Input Key Attribute and select from
the input data the column on which you want to apply the matching key.
In the Match threshold
field, enter the match probability threshold. Two data records match when the
computed match score is greater than or equal to this value.
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.

Defining a blocking key is not mandatory but advisable. Using a blocking key
partitions data in blocks and so reduces the number of records that need to be
examined, as comparisons are restricted to record pairs within each block. Using
blocking key(s) is very useful when you are processing big data set.

The Blocking Selection table in the component is
different from the Generation of Blocking Key table
in the match rule editor in the
Profiling
perspective.

The blocking column in tMatchGroup could come
from a tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema (it could be a
ZIP column for instance). While the Generation of Blocking Key table in the match rule editor defines
the parameters necessary to generate a blocking key; this table is equivalent to the
tGenKey component. The Generation of Blocking Key table generates a blocking column
BLOCK_KEY used for blocking.
Click the Chart button in the top right corner of
the wizard to execute the Job using the imported match rule and show the matching
results in the wizard.

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

Matching two records

You can use the tMatchGroup component to detect duplicates and
define how to merge similar records to create a master record.

Creating a master record is an iterative process: each new master record can be used to
find new duplicates.

You can choose between two different algorithms to create master records:

Simple VSR Matcher
T-Swoosh

The main difference between the two algorithms is that T-Swoosh creates, for each master
record, a new record that does not exist in the list of input records.

Match rules

You can define different match rules for the same dataset.

Two records match if at least one of the match rules is satisfied. As soon as two records
match according to a given rule, the other rules are not checked.

Matching several records

Blocking

To avoid doing a two-by-two comparison of all the input records, you can define one
or many blocking keys to split the input dataset into smaller datasets called
blocks.

In each block, the blocking keys must have the same value. Then, each block is processed
independently.

Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh
algorithms to process data. For example, if 100,000 records are split into 100 blocks of
1,000 records each, the number of comparisons are reduced by a factor 100. This means
the algorithm runs around 100 times faster.

It is recommended to use the tGenKey component to generate blocking
keys and to visualize the statistics regarding the number of blocks. In a Job,
right-click the tGenKey component and select View Key
Profile in the contextual menu to visualize the distribution of the
number of blocks according to their size.

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks
(13 × 382). If records are compared with four columns, this means there will be 75,088
string comparisons in these 13 blocks (18,772 × 4).

The Simple VSR Matcher algorithm

The Simple VSR Matcher algorithm compares each record within same block with the
previous master records in the lookup table.

If a record does not match any of the previous master records, it is considered as a new
master record and added to the lookup table. This means that the first record of the
dataset is necessarily a master record. So, the order of the records is important and
can have an impact on the creation process of the master records.

When a record matches a master record, the Simple VSR Matcher algorithm does not further
attempt to match with other master records because all the master records in the lookup
table are not similar. So, once a record matches a master record, the chance of matching
another master record is low.

This means a record can only exist in one group of records and be linked to one master
record.

For example, take the following set of records as input:

id	fullName
1	John Doe
2	Donna Lewis
3	John B. Doe
4	Louis Armstrong

The algorithm processes the input records as follows:

The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the lookup table.
The algorithm takes record 2 and compares it with record 1. Since it is not a match,
record 2 is added to the lookup table.
The algorithm takes record 3 and compares it with record 1 and record 2. Record 3
matches record 1. So, record 3 is added to the group of record 1.
The algorithm takes record 4 and compares it with record 1 and record 2 but not with
record 3, which is not a master record. Since it is not a match, record 4 is added
to the lookup table.

The output will look like this:

id	fullName	Grp_ID	Grp_Size	Master	Score	GRP_QUALITY
1	John Doe	0	2	true	1.0	0.72
3	John B. Doe	0	0	false	0.72	0
2	Donna Lewis	1	1	true	1.0	1.0
4	Louis Armstrong	2	1	true	1.0	1.0

The T-Swoosh algorithm

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm,
but it creates a master record instead of considering existing records to be master
records.

The order of the input records does not impact the matching process.

To create master records, you can design survivorship rules to decide which attribute
will survive.

There are two types of survivorship rules:

The rules related to matching keys: each attribute used as a matching key can have a
specific survivorship rule.
The default rules: they are applied to all the attributes of the same data type
(Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column
is applied.

If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.

Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.

For example, take the following set of records as input:

id	fullName
1	John Doe
2	Donna Lewis
3	John B. Doe
4	Johnnie B. Doe

The survivorship rule uses the Concatenate function with
“,” as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:

The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the set of master records. The
queue contains now record 2, record 3 and record 4. The lookup contains record
1.
The algorithm takes record 2 and compares it with record 1. Since record 2 does not
match any record, it is added to the set of master records. The queue contains now
record 3 and record 4. The lookup contains record 1 and record 2.
The algorithm takes record 3 and compares it with record 1. Record 3 matches record
1. So, record 1 and record 3 are merged to create a new master record called record
1,3. The queue contains now record 4 and record 1,3. The lookup contains record
2.
The algorithm takes record 4 and compares it with record 2. Since it is not a match,
record 4 is added to the set of master records. The queue contains now record 1,3.
The lookup table contains record 2 and record 4.
The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master
record called record 1,3,4. Record 4 is removed from the lookup table. Since record
1,3 was the result of a previous merge, it is removed from the table. The queue now
contains record 1,3,4. The lookup contains record 2.
The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
match, record 1,3,4 is added to the set of master records. The queue is now empty.
The lookup contains records 1,3,4 and record 2.

The output will look like this:

id	fullName	GRP_ID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1,3,4	John Doe, John B. Doe, Johnnie B. Doe	0	3	true	1.0	0.449
1	John Doe	0	0	false	0.72	0
3	John B. Doe	0	0	false	0.72	0
4	Johnnie B. Doe	0	0	false	0.78	0
2	Donna Lewis	1	1	true	1.0	1.0

As you can see in this example, the value in the GRP_QUALITY column can be less than the
Match Threshold parameter. That is
because a group is created from record pairs with a matching score greater than
or equal to the Match Threshold but the
records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
group.

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

When processing the input data using the Simple VSR Matcher algorithm, there
may be more iterations than the number of input records because a merged record may be
created on each iteration and added to the queue.

This is one of the main differences
between the Simple VSR Matcher and the T-Swoosh algorithms.

When comparing a record with a master record, the T-Swoosh algorithm makes more
comparisons per iteration than the Simple VSR matcher algorithm:

When using the Simple VSR matcher algorithm, the record from the queue is only
compared with the value of the master record. There is no comparison between the
record from the queue and the value of each of the records used to build this
master record. Then, sort the input records so that the most trustworthy records
appear first in the input data.
When using the T-Swoosh algorithm, the record from the queue is compared with
the value of the master record and the value of each of the records used to
build this master record, until records are considered a match.

You can find an example of how to survive
master records using the T-Swoosh algorithm in Talend Help Center (https://help.talend.com).

In this example, the record “John Doe, John B. Doe” is compared
with the record “John B. Doe” on iteration 5. There is a match if at least
one of the three strings “John Doe, John B. Doe”, “John Doe” and “John B.
Doe” matches the string “Johnnie B. Doe”.

Multi-pass matching

You can design a Job with consecutive tMatchGroup components to
create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code
in a customer database. In this particular case, you can use two consecutive
tMatchGroup to process the data partitions:

One tMatchGroup component in which the column “city” is defined
as a blocking key.
One tMatchGroup component in which the column “ZipCode” is
defined as a blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the
previous pass as the input of the current tMatchGroup component.
Multi-pass matching is more effective if the blocking keys are almost not
correlated. For example, it is not relevant to define the column “country” as a
blocking key and the column “city” as another blocking key because all the
comparisons made with the blocking key “city” will also be done with blocking key
“country”.

When using multi-pass matching with the Simple VSR matcher algorithm, only master
records of size 1 – records that did not match any record – are compared with master
records of any size. There are no comparisons between two master records that are
derived from at least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the
first tMatchGroup component has a blocking key on the column
“ZipCode”, and the second tMatchGroup component has a blocking
key on the column “city”. The attribute “name” is used as a matching key.

id	name	city	ZipCode
1	John Doe	Nantes	44000
2	John B. Doe	Nantes
3	Jon Doe	Nantes	44000
4	John Doe	Nantes

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped.
In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the first
pass, record 1 and record 2, are compared. Since their group size is strictly
greater than 1, they are not compared. Then, the order in which the input records
are sorted is very important.

The following results are returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	2	true	1.0	0.875
3	Jon Doe	Nantes	44000	0	0	false	0.85	0
2	John B. Doe	Nantes		1	2	true	1.0	0.727
4	John Doe	Nantes		1	0	false	0.72	0

When running the T-Swoosh algorithm with the same parameters and the Most
common survivorship function, the following results are
returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	4	true	1.0	0.727
1	John Doe	Nantes	44000	0	0	true	0.875	0
3	Jon Doe	Nantes	44000	0	0	false	0.875	0
2	John B. Doe	Nantes		0	0	true	0.72	0
4	John Doe	Nantes		1	0	false	0.72	0

Grouping output data in separate flows according to the minimal distance
computed in each record

To create this Job, you can use different
algorithms in the tMatchgroup component:

For another scenario that groups the output records in one single output flow, see
Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Simple VSR algorithm

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that compares columns in the
input file using the Jaro-Winkler matching method on the lname and fname column
and the q-grams matching method
on the address1 column. It then groups the
output records in output flows:

Uniques: lists the
records which group size (minimal distance computed in the record) is
equal to 1.
Matches: lists the
records which group score (minimal distance computed in the record) is
greater than or equal to the threshold you define in the Confident match threshold
field.
Suspects: lists
the records which group score (minimal distance computed in the record)
is less than the threshold you define in the Confident match threshold field.

Setting up the Job

Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRow components.
Connect tFileInputExcel to tMatchGroup using the Main
link.
Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
Connect tMatchGroup to
the three tLogRow components using the
Uniques, Matches and Suspects
links.

Configuring the input component

The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.

You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your
Talend Studio
. For
further information, see the
Talend Studio User
Guide.

In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
and set other properties in case they are not stored in the Repository.
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Configuring the tMatchGroup component

Double-click tMatchGroup to display the
Basic settings view and define the
component properties.
From the Matching Algorithm list, select
Simple VSR.

In this scenario, the match rule is based on the VSR algorithm.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties.
Click OK to close the dialog box.
Click the […] button next to
Configure match rules to open the configuration
wizard and define the component configuration and the match rule(s).

You can use the configuration wizard to import match rules
created and tested in the studio and stored in the repository, and use them in
your match Jobs. For further information, see Importing match rules from the studio repository.

It is important to import or define the same type of the rule
selected in the basic settings of the component, otherwise the Job runs with
default values for the parameters which are not compatible between the two
algorithms.
Define the first match rule as the following:
- In the Key definition table, click
  the [+] button to add to the list the
  column(s) on which you want to do the matching operation,
  lname and fname.
  
  Note: When you select a date column on which to apply an algorithm or a matching
  algorithm, you can decide what to compare in the date format.
  
  For example,
  if you want to only compare the year in the date, in the component schema set the type
  of the date column to Date and then enter “yyyy” in the Date Pattern field.
  The component then converts the date format to a string according to the pattern defined
  in the schema before starting a string comparison.
- Click in the Matching Function column
  and select from the list Jaro-Winkler
  as the method to be used for the matching operation.
  
  If you select custom as a matching
  type, you must set in the Custom Matcher
  column the path pointing to the custom class (external
  matching algorithm) you need to use. This path is defined by yourself in
  the library file (.jar file).
- From the Tokenized measure list,
  select not to use a tokenized distance for the selected
  algorithms.
- Click in the cell of the Weight
  column to set the numerical weights for the two columns used as key
  attributes.
- Click in the cell of the Handle Null
  column and select the null operator you want to use to handle null
  attributes in the columns. In this example, select Null Match None in order to have matching results where
  null values have minimal effect.
- Set the match probability in the Match
  Interval field.
Follow the same procedure in the above step to define the second match rule.

Set the address1 column as an input attribute and select
Jaro as the matching type. Select Null Match None as the null operator. And finally set
the match probability which can be different from the one set for the first
rule.
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size.
Click the Advanced settings tab and set the
advanced parameters for the tMatchGroup
component as the following:
- Select the Separate output check box.
  
  The component will have three separate output flows: Unique rows, Confident
  groups and Uncertain
  groups.
  
  If this check box is not selected, the tMatchGroup component will have only one output flow
  where it groups all output data. For an example scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.
- Select the Sort the output data by
  GID check box to sort the output data by their group
  identifier.
- Select the Output distance details
  and Display detailed labels check
  boxes.
  
  The component will output the MATCHING_DISTANCES
  column. This column provides the distance between the input and the
  master columns giving also the names of the columns against which the
  records are matched.
- Select the Deactivate matching computation when
  opening the wizard check box if you do not want to run
  the match rules the next time you open the wizard.
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the matching results
directly in the wizard.

The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of items in each group and colors
the groups in accordance with their color in the matching chart.

The Job conducts an OR match operation on the
records. It evaluates the records against the first rule and the records that
match are not evaluated against the second rule. The MATCHING_DISTANCES column allows you to
understand which rule has been used on what records. Some records are matched
according to the second rule that uses address1 as a key attribute, whereas the other records in the group are
matched according to the first rule which uses the lname and fname
as key attributes.

You can set the Hide groups of less than
parameter in order to decide what groups to show in the matching chart and
table

Finalizing the Job and executing it

Double-click each of the tLogRow
components to display the Basic
settings view and define the component properties.
Save your Job and press F6 to execute
it.

You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.

The number of records in each of the three output blocks is listed
in the GRP_SIZE column and computed only on the master record.
The MASTER column indicates with true or false whether the
corresponding record is a master record or not. The SCORE
column lists the calculated distance between the input record and the master record
according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records with a group score between the match
interval, 0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.

All records with a group score greater than or
equal to one of the match probabilities are listed in the Matches output flow.

All records with a group size equal to 1 is listed in the
Uniques output flow.

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.

T-Swoosh algorithm

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that compares columns in the
input file using the Jaro-Winkler matching method on the lname and fname columns.
It then groups the output records in output flows:

Uniques: lists the
records which group size (minimal distance computed in the record) is
equal to 1.
Matches: lists the
records which group quality is greater than or equal to the threshold
you define in the Confident
match threshold field.
Suspects: lists
the records which group quality is less than the threshold you define in
the Confident match
threshold field.

The group quality is the minimal distance computed among all
record pairs of the group.

Setting up the Job

Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup and three tLogRow components.
Connect tFixedFlowInput
to tMatchGroup using the Main link.
Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
Connect tMatchGroup to
the three tLogRow components using the
Uniques, Matches and Suspects
links.

Configuring the input component

The input data contain seven columns: lname, fname, middle_name, address, city, state_province
and postal_code. The data have problems such as duplication,
names spelled differently or wrongly, different information for the same customer.

Double-click the tFixedFlowInput
component to display its Basic settings view.
Select Built-in and click the […]
button next to Edit Schema.
Define the seven columns and click OK.
Select Use Inline Content(delimited file).
Fill in the Row Separator and Field
Separator fields.
Enter the input data in the Content field.

Configuring the tMatchGroup component

Click the tMatchGroup
component to display its Basic settings
view.
From the Matching
Algorithm list, select t-Swoosh.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are output standard columns
that are read-only. For more information, see the tMatchGroup
properties.
Click OK to close the dialog box.
Click the Advanced
settings tab and select the following check boxes:
1. Separate output
  
  The component will have three separate output flows:
  Uniques,
  Matches
  and Suspects.
  
  If this check box is not selected, the tMatchGroup component will have only
  one output flow where it groups all output data. For an example
  scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.
2. Sort the output data by
  GID
3. Output distance details
4. Display detailed labels
5. Deactivate matching
  computation when opening the wizard
Click the […] button next to Configure match
rules to define the component configuration and the match
rule(s).

You can use the configuration wizard to import
match rules created and tested in the studio and stored in the repository, and
use them in your match Jobs. For further information, see Importing match rules from the studio repository.

It is important to import or define the same
type of the rule selected in the basic settings of the component. Otherwise the
Job runs with default values for the parameters which are not compatible between
the two algorithms.
Define the match rule as the following:
1. In the Key definition table,
  click the [+] button to add to
  the list the column(s) on which you want to do the matching operation,
  fname and lname.
  
  Note: When you select a date column on which to apply an algorithm or a matching
  algorithm, you can decide what to compare in the date format.
  
  For example,
  if you want to only compare the year in the date, in the component schema set the type
  of the date column to Date and then enter “yyyy” in the Date Pattern field.
  The component then converts the date format to a string according to the pattern defined
  in the schema before starting a string comparison.
2. Click in the Matching Function
  column and select from the list Jaro-Winkler as the method to be used for the matching
  operation.
  
  If you select custom
  as a matching type, you must set in the Custom
  Matcher column the path pointing to the custom class
  (external matching algorithm) you need to use. This path is defined by
  yourself in the library file (.jar file).
3. From the Tokenized measure list,
  select No.
4. Click in the cell of the Threshold column and
  enter 0.7 for fname and 0.4 for lname.
5. Click in the cell of the Confidence
  Weight column to set the numerical weights for the two
  columns used as key attributes: 1 for fname and 4 for lname.
6. Click in the cell of the Handle Null column and select the null
  operator you want to use to handle null attributes in the columns. In
  this example, select Null
  Match NONE in order to have matching results where null
  values have minimal effect.
7. Select Most common in the
  Survivorship Function.
Follow the same procedure in the above step to define the second match rule and
set the parameters as follows:
1. Click the [+] button (Duplicate
  Rule).
2. Input Key Attribute: address
3. Matching Function:
  Jaro
4. Tokenized Measure:
  No
5. Threshold: 0.8
6. Confidence Weight: 1
7. Handle Null: Null Match
  NONE
8. Survivorship Function:
  Most common
Set the Match Threshold
parameter of each Match Rule to 0.8.
Set the Hide groups of less than parameter to 2. This
parameter enables you to hide groups of small size.
Click the Chart button
to execute the Job in the defined configuration and have the matching results
directly in the wizard.

The matching chart gives a global picture about the duplicates
in the analyzed data. The matching table indicates the details of items in each
group and colors the groups in accordance with their color in the matching
chart.

The Job conducts an OR match operation on the records. It
evaluates the records against the rule. The MATCHING_DISTANCES column allows you to
understand which rule has been used on what records.

For example, in the
second data group (brick red), the last Amic record is matched according to the second rule that uses address1 as a key attribute,
whereas the other records in the group are matched according to the first
rule which uses the lname and fname as key
attributes.

As you can see in this example, the value in the GRP_QUALITY column can be less than the
Match Threshold parameter. That is
because a group is created from record pairs with a matching score greater than
or equal to the Match Threshold but the
records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the
group.

Finalizing the Job and executing it

Double-click each tLogRow
component to display the Basic
settings view.
Select Table (print values in cells of a table).
Save your Job and press F6 to execute
it.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier is of the data type Long for Jobs that are migrated from older releases. To have the group
identifier as String, replace the tMatchGroup component in the imported Job with
tMatchGroup from the studio
Palette.

The number of records in each of the three output blocks
is listed in the GRP_SIZE column and
computed only on the master record. The MASTER column indicates with true or false whether the corresponding
record is a master record or not. The SCORE column lists the calculated distance between the input record and
the master record according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records with a group score between the match interval,
0.95 or 0.85 depending on the applied rule, and the confidence
threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.

All records with a group score greater than or equal to
one of the match probabilities are listed in the Matches output flow.

All records with a group size equal to 1 is listed in the
Uniques output
flow.

Matching customer data through multiple passes

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

The Job in this scenario, groups similar customer records by running
through two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches
to the pass that follows in order for the latter to add more matches identified with new
rules and blocking keys.

In this Job:

The tMysqlInput component
connects to the customer records to be processed.
Each of the tGenKey
components defines a way to partition data records. The first key partitions
data to many groups and the second key creates fewer groups that overlaps the
previous blocks depending on the blocking key definition.
The tMap component renames
the key generated by the second tGenKey
component.
The first tMatchGroup
processes the partitions defined by the first tGenKey, and the second tMatchGroup processes those defined by the second tGenKey.

Warning:

The two tMatchGroup
components must have the same schema.
The tLogRow component
presents the matching results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the
Talend Studio User
Guide.

In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schema and
drop the database table onto the design workspace. The input table used in
this scenario is called customer.

A dialog box is displayed with a list of components.
Select the relevant database component, tMysqlInput in this example, and then click OK.
Drop two tGenKey components, two
tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.
Link the input component to the tGenKey
and tMap components using Main links.
In the two tMatchGroup components, select the
Output distance details check boxes in the
Advanced settings view of both components
before linking them together.

This will provide the MATCHING_DISTANCES column in the
output schema of each tMatchGroup.

If the two tMatchGroup components are already
linked to each other, you must select the Output distance
details check box in the second component in the Job flow first
otherwise you may have an issue.
Link the two tMatchGroup components and
the tLogRow component using Main links.
If needed, give the components specific labels to reflect their usage in
the Job.

For further information about how to label a component, see
Talend Studio

User Guide.

Configuring input data and key generation

Connecting to the input data

Double-click tMysqlInput to open its
Component view.

The property fields for tMysqlInput are
automatically filled in. If you do not define your input schema locally in
the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
lists.

The input table used in this scenario is called
customer.
Modify the query in the Query box to
select only the columns you want to match:
account_name, lname,
fname, mi,
address1, city,
state_province and
postal_code.

Configuring the key generation for the first pass

Double-click the first tGenKey to open
the Component view.
Click

and import blocking keys from match rules created and
tested in the
Profiling
perspective
of
Talend Studio
and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the
[+] button to add two rows in the
table.
In the column column, click the newly added row and
select from the list the column you want to process using an algorithm. In this
example, select lname.
Do the same on the second row to select
postal_code.
In the pre-algorithm column, click the newly
added row and select from the list the pre-algorithm you want to apply to the
corresponding column.

In this example, select remove diacritical marks and
convert to upper case to remove any diacritical mark and converts
the fields of the lname column to upper
case.

This conversion does not change your raw data.
In the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word.

If you select the Show help check box, you
can display instructions on how to set algorithms/options parameters.
Do the same for the second row on the algorithm column to select first N
characters of the string.
Click in the Value column next to the
algorithm column and enter the value for
the selected algorithm, when needed.

In this scenario, enter 1 for both rows. The first letter
of each field in the corresponding columns will be used to generate the
key.

Make sure to set a value for the algorithm which need one, otherwise you may
have a compilation error when you run the Job.

Configuring the key generation for the second pass

Double-click the second tGenKey to open
the Component view.
In the Algorithm table, define the column
you want to use to partition data, account_num in this
component. Select the first N characters of the
string algorithm and set the value to 1
in the Value column.

Each of the two tGenKey components will
generate a read_only T_GEN_KEY column in
the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
other.
Double-click the tMap component to open
its editor.
In the Schema editor, copy the columns
from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.
In the top part of the editor, drop all columns from the input table to
the output table.
Click Ok to save data transformation and
close the editor.
In the tGenKey basic settings, click the

button to verify that the two generated keys are named
differently in the output schema.

Configuring the two matching passes

Configuring the first pass

In the basic settings of the tMatchGroup
labelled pass1, select Simple VSR from the Matching
Algorithm list.

In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the
Configuration Wizard.
Click

and import matching keys from the match rules created and
tested in the
Profiling
perspective of

Talend Studio
and use them in your Job. Otherwise, define
the matching key parameters as described in the below steps.

It is important to import or define the same type of the rule selected in the
basic settings of the component, otherwise the Job runs with default values for
the parameters which are not compatible between the two algorithms.
In the Key definition table, click the
[+] button to add the column(s) on which
you want to do the matching operation, lname
in this scenario.

Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.

For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
From the Tokenized measure list, select
Any order.
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY in this example.

Using a blocking value reduces the number of pairs of records that needs to be
examined. The input data is partitioned into exhaustive blocks based on the
functional key. This will decrease the number of pairs to compare, as comparison
is restricted to record pairs within each block.
If required, click Edit schema to open the
schema editor and see the schema retrieved from the previous component in the
Job.
Click the Advanced settings tab and select
the Sort the output data by GID check box to
arrange the output data by their group IDs.
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.

Configuring the second pass

In the basic settings of the tMatchGroup
labelled pass2, select Simple VSR from the Matching
Algorithm list.

In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the
Configuration Wizard.

If this component does not have the same schema of the preceding component, a
warning icon appears. If so, click the Sync
columns button to retrieve the schema from the preceding one and
once done, the warning icon disappears.
In the Key Definition table, click
the [+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.

Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.

For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY1 in this example.
Click the Advanced settings tab and select
the Multi-pass check box. This option enables
tMatchGroup to receive data sets from the
tMatchGroup that precedes it in the Job.
In the Advanced settings view, select the
Sort the output data by GID check box to
arrange the output data by their group IDs.
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.

Executing the Job and showing the results on the console

In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:

Use a tReplicate component to
replicate the input flow you want to process as shown in the above
figure. Use a copy/paste operation to create the two parts of the
Job.
Keep only the first pass,tMatchGroup component, in the upper part of the Job
and show the match results in a tLogRow component.
Use two passes in the lower part of the Job and show the final
match results in a tLogRow
component.

Double-click each of the tLogRow
components to open the Component view and
in the Mode area, select the Table (print values in cells of a table)
option.
Save your Job and press F6 to execute
it.

The results after the first pass read as follows:

The results after the second pass read as follows:

When you compare, for example, the customer name Wilson from the results of the two passes, you will find
that more customers using the last name Wilson are grouped together after the second pass.

Using survivorship functions to merge two records and create a master
record

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a basic Job that uses the T-Swoosh algorithm to find duplicates in
customer names. It compares values in the first_name column using
matching and survivorship methods, groups similar records together and creates single
representations of entities in each group by merging two similar records according to a
survivorship rule.

These master records are new records that do not exist in the input data.

You can use several matching passes with the T-Swoosh algorithm in a Job. Each pass
provides the survived masters for each group and the original records to the second pass.
However, intermediate master records are removed from the output flow. Only final master and
original records are kept at the end.

In this Job:

The tFileInputDelimited component provides the
customer records to be processed.
The tMatchGroup component processes data using
matching and survivorship methods.
The tLogRow component presents the matching and
survivorship results.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the
repository, see the
Talend Studio User Guide.

In the Repository tree view, expand Metadata – File
delimited where you have stored the main input schema and drop
the file connection onto the design workspace. The file connection used in this
scenario is called Customer_File.

A dialog box is displayed with a list of components.
Select tFileInputDelimited and click
OK.
Drop a tMatchGroup and a tLogRow from the Palette onto the design workspace.
Link the components together using the Main
links.

Configuring the input component

The main input file contains the columns: id,
first_name and age. The data in this input
file has problems such as duplication, names spelled differently or wrongly, different
information for the same customer.

Double-click tFileInputDelimited to open its
Component view.

The property fields are automatically filled in. If you do not define your
input schema locally in the repository, fill in the details manually after
selecting Built-in in the Schema and Property Type
lists.
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Creating the match rule with a matching key and a survivorship function

Setting the schema and selecting the matching algorithm

Double-click tMatchGroup to display the
Basic settings view and define the
component properties.
From the Matching Algorithm list, select
T-Swoosh.

In this scenario, the match rule is based on the T-Swoosh algorithm.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties.
In the Advanced settings view, leave the
parameters selected by default.

Defining the match rule

In the tMatchGroup basic settings, click
Preview to open the configuration wizard
and define the matching key and the survivorship function.

You can use the configuration wizard to import match rules created and tested
in the studio and stored in the repository, and use them in your match Jobs. For
further information, see Importing match rules from the studio repository.

It is important to have the same type of the matching algorithm selected in
the basic settings of the component and defined in the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms.
Define the match rule as the following:
- In the Key definition table, click
  the [+] button to add a line in the
  table. Click in the Input Key Attribute
  column and select the column on which you want to do the matching
  operation, first_name in this scenario.
- Click in the Matching Function column
  and select Soundex from the list. This
  method matches processed entries according to a standard English
  phonetic algorithm which indexes strings by sound, as pronounced in
  English.
- From the Tokenized measure list,
  select not to use a tokenized distance for the selected
  algorithm.
- Set the Threshold to 0.8 and the Confidence Weight to 1.
- Select Null Match None in the
  Handle Null column in order to have
  matching results where null values have minimal effect.
- Select Most common
  in the Survivorship Function
  column. This method validates the most frequent name value in each group
  of duplicates.
Define the default survivorship rule as the following:
- In the Default Survivorship
  Rules table, click the [+] button to add a line in the table. Click in the
  Data Type column and select
  Number.
- Click in the Survivorship
  Function column and select Largest (for numbers) from the list. This method
  validates the largest numerical value in each group.
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size.
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the results directly in
the wizard.

The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of the items in each group,
colors the groups in accordance with their color in the matching chart and
indicates with true the records which are
master records. The master record in each group is the result of merging two
similar records according to the phonetic algorithm and survivorship rule. The
master record is a new record that does not exist in the input data.
Click OK to close the wizard.

Finalizing the Job and executing it

Double-click the tLogRow component to display
its Basic settings view and define the
component properties.
Save your Job and press F6 to execute
it.

Similar records are grouped together. The number of records in each group is
listed in the GRP_SIZE column and computed only on the
master record. The MASTER column indicates with true or false if
the corresponding record is a master record or not a master record. The master
record is created by merging the record which best matches the phonetic
algorithms with the record which has the largest age value. The
SCORE column lists the calculated similarity between
the input record and the master record according to the Soundex matching algorithm.

The identifier for each group, which is of String data
type, is listed in the GID column next to the corresponding
record. This identifier will be of the data type Long for
Jobs that are migrated from older releases. To have the group identifier as
String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the studio Palette.

tMatchGroup MapReduce properties (deprecated)

These properties are used to configure tMatchGroup running in the MapReduce Job framework.

The MapReduce
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. The output schema of this component contains the following read-only fields: GID: provides a group identifier of the data type String. Note: In Jobs migrated from previous releases to your current Talend Studio, the group identifier may be of the Long data type. To have a group identifier of the String data type, replace the tMatchGroup components in the migrated Jobs with tMatchGroup components from the Palette. GRP_SIZE: counts the number of records in the group, computed only on the master record. MASTER: identifies, by `true` or `false`, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score.
	Built-In: You create and store the schema locally for this component only.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Matching Algorithm	Select from the list the algorithm you want to use in the component: Simple VSR is the only matching algorithm you can use with the Map/Reduce version of the component. If you converted a standard Job using tMatchGroup with the T-Swoosh algorithm to a Map/Reduce Job, select Simple VSR from the list and save the converted Job before its execution. Otherwise, an error occurs.
Key Definition	Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Note: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
	Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example “Hello”: “H400”. It does not support Chinese characters. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. It does not support Chinese characters. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. It does not support Chinese characters. Soundex FR: matches processed entries according to a standard French phonetic algorithm. It does not support Chinese characters. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string. Fingerprint key: matches entries after doing the following sequential process: Remove leading and trailing whitespace. Change all characters to their lowercase representation. It does not support Chinese characters. Remove all punctuation and control characters. Split the string into whitespace-separated tokens. Sort the tokens and remove duplicates. Join the tokens back together. Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster. Normalize extended western characters to their ASCII representation, for example gödel to godel. This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q` length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between “masking” and “pairing” is 3. custom…: enables you to load an external matching algorithm from a Java library using the custom Matcher column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Using a custom matching algorithm to match entries.
	Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file). For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
	Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.
	Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
Match Threshold	Enter the match probability. Two data records match when the probability is above the set value. You can enter a different match threshold for each match rule.
Blocking Selection	Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.

Advanced settings

Store on disk	Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored.
Multiple output	Select the Separate output check box to have three different output flows: –Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow. –Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow. –Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow. Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group.
Multi-pass	Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. Note: When using two tMatchGroup components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues. For an example Job, see Matching customer data through multiple passes
Sort the output data by GID	Select this check box to group the output data by the group identifier. The output is sorted in ascending alphanumeric order by group identifier.
Output distance details	Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Note: When using two tMatchGroup components in a Job and this option, you must select this check box in both tMatchGroup components before linking them together. If you linked the components before selecting this check box, select this check box in the second component in the Job flow and then, in the first component. Otherwise, you may have an issue as there are two columns in the output schema with the same name. Selecting this check box in only one tMatchGroup component may cause schema mismatch issues.
Display detailed labels	Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be `fname:1.0\|lname:0.97` when the check box is selected and `1.0\|0.97` when it is not selected.
tStatCatcher Statistics	Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop. You need to use the Hadoop Configuration tab in the Run view to define the connection to a given Hadoop distribution for the whole Job. For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Open Studio for Big Data Getting Started Guide . For a scenario demonstrating a Map/Reduce Job using this component, see Matching data through multiple passes using Map/Reduce components. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Usage rule

In a
Talend
Map/Reduce Job, this component is used as an intermediate
step and other components used along with it must be Map/Reduce components, too. They
generate native Map/Reduce code that can be executed directly in Hadoop.

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

For further information about a
Talend
Map/Reduce Job, see the sections
describing how to create, convert and configure a
Talend
Map/Reduce Job of the

Talend Open Studio for Big Data Getting Started Guide
.

For a scenario demonstrating a Map/Reduce Job using this component,
see Matching data through multiple passes using Map/Reduce components.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Working principle

This component implements the MapReduce model, based on the blocking keys defined in
the Blocking definition table of the Basic settings view.

This implementation proceeds as follows:

Splits the input rows in groups of a given size.
Implements a Map Class that creates a map between each key and a list of
records.
Shuffles the records to group those with the same key together.
Applies, on each key, the algorithm defined in the Key
definition table of the Basic
settings view.

Then accordingly, this component reads the records, compares them with the
master records, groups the similar ones, and classes each of the rest as a
master record.
Outputs the groups of similar records with their group IDs, group sizes,
matching distances and scores.

Configuration wizard in Map/Reduce Jobs

The match results on multiple conditions will list data
records that meet any of the defined rules.

You can not open
the configuration wizard unless you link an input component to the tMatchGroup component.

From the
configuration wizard in tMatchGroup, you can:

define multiple conditions using several match rules to group
data,
set different match intervals for each rule,
import match rules created and tested in the studio and stored in
the repository, and use them in your match Jobs. You can only import rules
configured with the VSR algorithm. For further information, see Importing match rules from the studio repository.
select a blocking key to partition data.

Click the [+] button on the match rule
bar.
Set the parameters for the new rule in the Key
definition table and define its match interval.
Repeat the above steps to create as many match rules as needed. You can define
a different match interval for each rule.

When you define multiple rules, the Job conducts an OR match operation. It
evaluates data records against the first rule and the records that match are not
evaluated against the second rule.
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.

Defining a blocking key is not mandatory but is very useful when you are
processing big data sets. A blocking key partitions data in blocks and so
reduces the number of records that need to be examined. This key can come from a
tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema.
At the bottom right corner of the wizard, click either:
- OK to save the current
  configuration.
- Cancel to close the wizard and keep
  the configuration saved initially in the wizard.

Matching data through multiple passes using Map/Reduce
components

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

Note that
Talend
Map/Reduce components are available only to users who subscribed to Big
Data.

This scenario shows how to create a
Talend
Map/Reduce Job to match data by using Map/Reduce components. It generates
Map/Reduce code and runs right in Hadoop.

The Job in this scenario, groups similar customer records by running through
two subsequent matching passes (tMatchGroup
components) and outputs the calculated matches in groups. Each pass provides its matches to
the pass that follows in order for the latter to add more matches identified with new rules
and blocking keys.

This Job is a duplication of the Standard data integration Job described in Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.

You can use
Talend Studio
to automatically convert the standard Job in the previous section to a Map/Reduce Job.
This way, you do not need to redefine the settings of the components in the Job.

Before starting to replicate this scenario, ensure that you have appropriate
rights and permissions to access the Hadoop distribution to be used.

Converting the Job to a Big Data Batch Job

In the Repository tree view, right-click the Job you have created in
the earlier scenario to open its contextual menu and select Edit properties.

Then the Edit properties dialog box is displayed. Note that the Job must
be closed before you are able to make any changes in this dialog box.

This dialog box looks like the image below:

Note that you can change the Job name as well as the other
descriptive information about the Job from this dialog box.
From the Job Type list, select
Big Data Batch. Then a Map/Reduce Job
using the same name appears under the Big Data
Batch sub-node of the Job
Design node.

Setting up the Job

Double-click the new Map/Reduce Job to open it in the workspace.

The Map/Reduce component Palette is opened.
Delete tMysqlInput in this scenario as it is
not a Map/Reduce component and use tRowGenerator in its place. Link it to tGenKey with a Row > Main
link.
Double-click tRowGenerator to open its
editor.
Define the schema you want to use to write data in Hadoop.
Click OK to validate your schema and close
the editor.
Leave the settings of the other components as you defined initially in the
standard version of the Job.

Setting up Hadoop connection

Click Run to open its view and then click the
Hadoop Configuration tab to display its
view for configuring the Hadoop connection for this Job.
From the Property type list,
select Built-in. If you have created the
connection to be used in Repository, then
select Repository and thus the Studio will
reuse that set of connection information for this Job.
In the Version area, select the
Hadoop distribution to be used and its version.
- If you use Google Cloud Dataproc, see Google Cloud Dataproc.
- If you cannot
  find the Cloudera version to be used from this drop-down list, you can add your distribution
  via some dynamic distribution settings in the Studio.
- If you cannot find from the list the distribution corresponding to
  yours, select Custom so as to connect to a
  Hadoop distribution not officially supported in the Studio. For a
  step-by-step example about how to use this
  Custom option, see Connecting to a custom Hadoop distribution.
In the Name node field, enter the location of
the master node, the NameNode, of the distribution to be used. For example,
hdfs://tal-qa113.talend.lan:8020.
- If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
  client will take care of the rest on the fly for creating the connection. The
  MapR client must be properly installed. For further information about how to set
  up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client
- If you are using WebHDFS, the location should be
  webhdfs://masternode:portnumber; WebHDFS with SSL is not
  supported yet.
In the Resource Manager field,
enter the location of the ResourceManager of your distribution. For example,
tal-qa114.talend.lan:8050.
- Then you can continue to set the following parameters depending on the
  configuration of the Hadoop cluster to be used (if you leave the check
  box of a parameter clear, then at runtime, the configuration about this
  parameter in the Hadoop cluster to be used will be ignored):
  - Select the Set resourcemanager
    scheduler address check box and enter the Scheduler address in
    the field that appears.
  - Select the Set jobhistory
    address check box and enter the location of the JobHistory
    server of the Hadoop cluster to be used. This allows the metrics information of
    the current Job to be stored in that JobHistory server.
  - Select the Set staging
    directory check box and enter this directory defined in your
    Hadoop cluster for temporary files created by running programs. Typically, this
    directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
    such as yarn-site.xml or mapred-site.xml of your distribution.
  - Select the Use datanode hostname check box to allow the
    Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname
    property to true. When connecting to a
    S3N filesystem, you must select this check box.
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos
principal name for the NameNode in the field displayed. This enables you to use
your user name to authenticate against the credentials stored in Kerberos.
- If this cluster is a MapR cluster of the version 5.0.0 or later, you can set the
  MapR ticket authentication configuration in addition or as an alternative by following
  the explanation in Connecting to a security-enabled MapR.
  
  Keep in mind that this configuration generates a new MapR security ticket for the username
  defined in the Job in each execution. If you need to reuse an existing ticket issued for the
  same username, leave both the Force MapR ticket
  authentication check box and the Use Kerberos
  authentication check box clear, and then MapR should be able to automatically
  find that ticket on the fly.
In addition, since this component performs Map/Reduce computations, you
also need to authenticate the related services such as the Job history server and
the Resource manager or Jobtracker depending on your distribution in the
corresponding field. These principals can be found in the configuration files of
your distribution. For example, in a CDH4 distribution, the Resource manager
principal is set in the yarn-site.xml file and the Job history
principal in the mapred-site.xml file.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
In the User name field, enter the login user
name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
In the Temp folder field, enter the path in
HDFS to the folder where you store the temporary files generated during
Map/Reduce computations.
Leave the default value of the Path separator in
server as it is, unless you have changed the separator used by your
Hadoop distribution’s host machine for its PATH variable or in other words, that
separator is not a colon (:). In that situation, you must change this value to the
one you are using in that host.
Leave the Clear temporary folder check box
selected, unless you want to keep those temporary files.
Leave the Compress intermediate map output to reduce
network traffic check box selected, so as to spend shorter time
to transfer the mapper task partitions to the multiple reducers.

However, if the data transfer in the Job is negligible, it is recommended to
clear this check box to deactivate the compression step, because this
compression consumes extra CPU resources.
If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
properties to be customized. Then at runtime, these changes will override the
corresponding default properties used by the Studio for its Hadoop
engine.

For further information about the properties required by Hadoop, see Apache’s
Hadoop documentation on http://hadoop.apache.org, or
the documentation of the Hadoop distribution you need to use.
If the HDFS transparent encryption has been enabled in your cluster, select
the Setup HDFS encryption configurations check
box and in the HDFS encryption key provider field
that is displayed, enter the location of the KMS proxy.

For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS.
You can tune the map and reduce computations by
selecting the Set memory check box to set proper memory allocations
for the computations to be performed by the Hadoop system.

The memory parameters to be set are Map (in Mb),
Reduce (in Mb) and ApplicationMaster (in Mb). These fields allow you to dynamically allocate
memory to the map and the reduce computations and the ApplicationMaster of YARN.

For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.
If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator
of your distribution to trace your Job lineage to the component level, including the
schema changes between components.
With this option activated, you need to set the following parameters:
- Username and Password: this is the credentials you use to connect to your Cloudera
  Navigator.
- Cloudera Navigator URL : enter the location of the
  Cloudera Navigator to be connected to.
- Cloudera Navigator Metadata URL: enter the location
  of the Navigator Metadata.
- Activate the autocommit option: select this check box
  to make Cloudera Navigator generate the lineage of the current Job at the end of the
  execution of this Job.
  
  Since this option actually forces Cloudera Navigator to generate lineages of
  all its available entities such as HDFS files and directories, Hive queries or Pig
  scripts, it is not recommended for the production environment because it will slow the
  Job.
- Kill the job if Cloudera Navigator fails: select this check
  box to stop the execution of the Job when the connection to your Cloudera Navigator fails.
  
  Otherwise, leave it clear to allow your Job to continue to run.
- Disable SSL validation: select this check box to
  make your Job to connect to Cloudera Navigator without the SSL validation
  process.
  
  This feature is meant to facilitate the test of your Job but is not
  recommended to be used in a production cluster.
If you are using Hortonworks Data Platform V2.4.0 onwards and you have
installed Atlas in your cluster, you can select the Use
Atlas check box to enable Job lineage to the component level, including the
schema changes between components.
With this option activated, you need to set the following parameters:
- Atlas URL: enter the location of the Atlas to be
  connected to. It is often http://name_of_your_atlas_node:port
- Die on error: select this check box to stop the Job
  execution when Atlas-related issues occur, such as connection issues to Atlas.
  
  Otherwise, leave it clear to allow your Job to continue to run.
In the Username and Password fields, enter the authentication information for access to
Atlas.

Executing the Job

Save the Job and press F6 to execute
it.

Match results are displayed on the studio console.

Matches are calculated by running through the two passes. Each pass
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.

More customers with similar last names are grouped together after the
second pass.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 7.x

0 Comments

Inline Feedbacks

View all comments

tMatchGroup

tMatchGroup Standard properties

Basic settings

Advanced settings

Global Variables

Usage

Configuration wizard

Configuration view

Matching chart

Matching table

Importing match rules from the studio repository

Using tMatchGroup with the Simple VSR Matcher and T-Swoosh algorithms

Matching two records

Match rules

Matching several records

Blocking

The Simple VSR Matcher algorithm

The T-Swoosh algorithm

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

Multi-pass matching

What is multi-pass matching?

An example of multi-pass matching

Grouping output data in separate flows according to the minimal distance computed in each record

Simple VSR algorithm

Setting up the Job

Configuring the input component

Configuring the tMatchGroup component

Finalizing the Job and executing it

T-Swoosh algorithm

Setting up the Job

Configuring the input component

Configuring the tMatchGroup component

Finalizing the Job and executing it

Matching customer data through multiple passes

Setting up the Job

Configuring input data and key generation

Connecting to the input data

Configuring the key generation for the first pass

Configuring the key generation for the second pass

Configuring the two matching passes

Configuring the first pass

Configuring the second pass

Executing the Job and showing the results on the console

Using survivorship functions to merge two records and create a master record

Setting up the Job

Configuring the input component

Creating the match rule with a matching key and a survivorship function

Setting the schema and selecting the matching algorithm

Defining the match rule

Finalizing the Job and executing it

tMatchGroup MapReduce properties (deprecated)

Basic settings

Advanced settings

Global Variables

Usage

Working principle

Configuration wizard in Map/Reduce Jobs

Matching data through multiple passes using Map/Reduce components

Converting the Job to a Big Data Batch Job

Setting up the Job

Setting up Hadoop connection

Executing the Job

Grouping output data in separate flows according to the minimal distance
computed in each record

Using survivorship functions to merge two records and create a master
record

Matching data through multiple passes using Map/Reduce
components