tMatchGroup

Creates groups of similar data records in any source data including large volumes
of data by using one or several match rules.

tMatchGroup compares columns in both
standard input data flows and in M/R input data flows by using matching methods and
groups similar encountered duplicates together.

Several tMatchGroup components can be used sequentially to match data
against different blocking keys. This will refine the groups received by
each of the tMatchGroup components through
creating different data partitions that overlap previous data blocks and so
on.

In defining a group, the first processed record of each
group is the master record of the group. The other records are computed as
to their distances from the master records and then are distributed to the
due master record accordingly.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

Standard: see tMatchGroup Standard properties.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.
MapReduce: see tMatchGroup MapReduce properties.

The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.

tMatchGroup Standard properties

These properties are used to configure tMatchGroup running in the Standard Job framework.

The Standard
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.

Basic settings

Schema and Edit schema	A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Click Sync columns to retrieve the schema from the previous component in the Job. The output schema of this component contains the following read-only fields: – GID: provides a group identifier of the data type String. Note: All Jobs with tMatchGroup that are migrated from older releases into your current studio may provide a group identifier of the data type Long. If you want to have a group identifier of the data type String, you must replace the tMatchGroup component in these Jobs with tMatchGroup from the studio Palette. – GRP_SIZE: counts the number of records in the group, computed only on the master record. – MASTER: identifies, by `true` or `false`, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. – SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. – GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score. – MERGE_INFO: this output column is available only when you have more than one tMatchGroup component in the Job. The column explains with true or false if the record is respectively a master record or not a master record in the first pass.
	Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.
Matching Algorithm	Select from the list the algorithm you want to use in the component: Simple VSR or T-Swoosh. It is important to have the same type of the matching algorithm selected in the basic settings of the component and defined in the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms. For further information about the two algorithms, see Talend Studio User Guide.
	Click the import icon to import a match rule from the Studio repository. It is important to import the same type of the rule selected in the basic settings of the component, otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms. For further information about how to import rules, see Importing match rules from the studio repository
Key Definition	Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Note: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
	Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example “Hello”: “H400”. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. Soundex FR: matches processed entries according to a standard French phonetic algorithm. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string. Fingerprint key: matches entries after doing the following sequential process: remove leading and trailing whitespace, change all characters to their lowercase representation, remove all punctuation and control characters, split the string into whitespace-separated tokens, sort the tokens and remove duplicates, join the tokens back together, Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster. normalize extended western characters to their ASCII representation, for example gödel to godel. This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q` length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between “masking” and “pairing” is 3. custom…: enables you to load an external matching algorithm from a Java library using the custom Matcher column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Scenario 2: Using a custom matching algorithm to match entries.
	Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file) which you can import by using the tLibraryLoad component. For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
	Threshold This column is displayed when you select to have T-Swoosh as the matching algorithm. Set a threshold between 0 and 1. 0 means that the similarity between values in the column is not measured. 1 means that you want each two compared values of the column to exactly match. Default value is 1.
	Tokenized measure Tokenization is the concept of splitting a string into words. Select the method to use to compute a tokenized measure for the selected algorithm: NO: no tokenization method is used on the string. With this option, “John Doe” and “Jon Doe” should match. Same place: splits the two strings by words to two lists, list1 and list2. Associates each element from list1 with the element which has the same position in list2. Using this method, “She is red and he is pink” and “Catherine is red and he is pink” should match. Same order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to associates the elements from list1 with the elements in list2 taken in the same order. Using this method, “John Doe” and “John B. Doe” match. This method should be used only with strings which has a few words, otherwise the number of possible combinations can be large. Any order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to assign each word of list1 to a word of list2, in order to have the highest global similarity (with respect to the used similarity). Using this method, “John Doe” and “Doe John” match.
	Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.
	Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
Match Threshold	Enter the match probability. Two data records match when the probability is above the set value. You can enter a different match threshold for each match rule.
Blocking Selection	Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.

Advanced settings

Store on disk	Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored.
Multiple output	Select the Separate output check box to have three different output flows: –Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow. –Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow. –Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow. Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group.
Multi-pass	Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. For an example Job, see Scenario 2: Matching customer data through multiple passes With multi-pass matching, all master records are generated but intermediate master records are removed from the output flow. Only final master and original records are kept at the end. Propagate original values: This option is available only with the T-Swoosh algorithm. Select this check box to allow the original records from each pass (and not only the unmatched records) to also be considered in the second pass of matching, both against each other and against the survived masters. This helps to make sure that no matches are missed.
Sort the output data by GID	Select this check box to group the output data by the group identifier.
Output distance details	Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Note: When you use two tMatchGroup components in a Job and you want to use the Output distance details option, you must select this check box in both components before you link them together. If the components are linked, select the check box in the second component in the Job flow first then in the first component, otherwise you may have an issue as there are two columns in the output schema with the same name. Selecting this option in only one tMatchGroup is not useful and may bring schema mismatch issues.
Display detailed labels	Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be `fname:1.0\|lname:0.97` when the check box is selected and `1.0\|0.97` when it is not selected.
Deactivate matching computation when opening the wizard	Select this check box to open the Configuration Wizard without running the match rules defined in the wizard. This enables you to have a better experience with the component. Otherwise, the wizard may take some time to open.
tStatCatcher Statistics	Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component is an intermediary step. It requires an input flow as well as an output flow.

Configuration wizard

The configuration wizard enables you to create different production environments,
Configurations, and their match rules.

You can also use
the configuration wizard to import match rules created and tested in the studio and use them
in your match Jobs. For further information, see Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.

To open the configuration wizard:

In the studio workspace, design your job and link the components together, for
example as below:
Double-click tMatchGroup; or right-click
it and from the contextual menu select Configuration
Wizard; or click Preview in the
basic settings view of
tMatchGroup.
In the popup that opens, click Skip Computation
if you want to open the Configuration Wizard without running the match rules defined
in it.

The configuration wizard is composed of three areas:

the Configuration view, where you can set the
match rules and the blocking column(s).
the matching chart, which presents the graphic matching result,
the matching table, which presents the details of the matching result.

The Limit field at the upper-left corner indicates the
maximum number of rows to be processed by the match rule(s) in the wizard. The by-default
maximum row number is 1000.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute
the Job.

You can use these different configurations for testing purposes for example, but
you can only save one configuration from the wizard, the open configuration.

In each configuration, you can define the parameters to generate match rules with the VSR
or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the
basic settings of the tMatchGroup component.

You can define survivorship rules, blocking key(s) and multiple conditions using several
match rules. You can also set different match intervals for each rule. The match results on
multiple conditions will list data records that meet any of the defined rules. When a
configuration has multiple conditions, the Job conducts an OR match operation. It evaluates
data records against the first rule and the records that match are not evaluated against the
other rules.

The parameters required to edit or create a match rule are:

The Key definition parameters.
The Match Threshold field.
A blocking key in the Blocking Selection
table (available only for rules with the VSR algorithm).

Defining a blocking key is not mandatory but advisable as it partitions data
in blocks to reduce the number of records that need to be examined. For further
information about the blocking key, see Importing match rules from the studio repository.
The Default Survivorship Rules parameters
(available only for rules with the T-Swoosh algorithm).

In the basic settings of the tMatchGroup
component, select Simple VSR from the Matching Algorithm list.

It is important to have the same type of the matching algorithm selected in the
basic settings of the component and defined in the configuration wizard. Otherwise
the Job runs with default values for the parameters which are not compatible between
the two algorithms.
In the basic settings of the tMatchGroup
component, click Preview to open the configuration
wizard.
Click the [+] button on the top right corner of
the Configuration view.

This creates, in a new tab, an exact copy of the last configuration.
Edit or set the parameters for the new configuration in the Key definition and Blocking
Selection tables.
If needed, define several match rules for the open configuration as
the following:
1. Click the [+] button on the match rule bar to
  create an exact copy of the last rule in a new tab.
2. Set the parameters for the new rule in the Key
  definition table and define its match interval.
3. Follow the steps above to create as many match rules for a
  configuration as needed. You can define a different match interval for
  each rule.
When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the
records that match are not evaluated against the second rule and so
on.
Click the Chart button at the top right corner of
the wizard to execute the Job in the open configuration.

The matching results are displayed in the matching chart and table.

Follow the steps above to create as many new configuration in the wizard as
needed.
To execute the Job in a specific configuration, open the configuration in the
wizard and click the Chart button.

The matching results are displayed in the matching chart and table.
At the bottom right corner of the wizard, click either:
- OK to save the open configuration.
  
  You can save only one configuration in the wizard.
- Cancel to close the wizard and keep the
  configuration saved initially in the wizard.

For an example of a match rule with the T-Swoosh algorithm, see Scenario 3: Using survivorship functions to merge two records and create a master record.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the
analyzed data.

The Hide groups less than parameter, which is set to
2 by default, enables you to decide what groups to show in the result
chart. Usually you want to hide groups of small group size.

For example, the above matching chart indicates that:

48 items are analyzed and classified into 18 groups according to a given match
rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2.
11 groups have 2 items each. In each group, the 2 items are duplicates of each
other.
3 groups have 3 items each. In each group, these items are duplicates of one
another.
3 groups have 4 items each. In each group, these items are duplicates of one
another.
One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different duplicates.

This table indicates the matching details of items in each group and colors the groups in
accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide
groups of less than parameter. This parameter enables you to hide groups of small
group size. It is set to 2 by default.

The buttons under the table helps you to navigate back and forth through pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you can import
match keys from the match rules created and tested in the
Profiling
perspective
of
Talend Studio
. You can then use these imported matching keys in your match
Jobs.

The tMatchGroup component enables you to import from the
Studio repository match rules based on the VSR or the T-Swoosh algorithms.

The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records and
assigns them to groups. The first processed record of each group is the master record of the
group. The VSR algorithm compares each record with the master of each group and uses the
computed distances, from master records, to decide to what group the record should go.

The T-Swoosh algorithm enables you to find duplicates and to define how
two similar records are merged to create a master record, using a survivorship function.
These new merged records are used to find new duplicates. The difference with the VSR
algorithm is that the master record is in general a new record that does not exist in
the list of input records.

From the configuration wizard, click the

icon on the top right corner.

The [Match Rule Selector] wizard opens listing
all match rules created in the studio and saved in the repository.
Select the match rule you want to import into the tMatchGroup component and use on your data.

A warning message displays in the wizard if the match rule you want to import is
defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration
wizard.

It is important to have the same type of the matching algorithm selected in the
basic settings of the component and imported from the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms.
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.

If you leave the box unselected, the match keys will be imported in a new match
rule tab without overwriting the current match rule in the wizard.
Click OK.

The matching key is imported from the match rule and listed as a new rule in the
configuration wizard.
Click in the Input Key Attribute and select from
the input data the column on which you want to apply the matching key.
In the Match threshold field, enter the match
probability threshold. Two data records match when the computed match score is above
this value.
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.

Defining a blocking key is not mandatory but advisable. Using a blocking key
partitions data in blocks and so reduces the number of records that need to be
examined, as comparisons are restricted to record pairs within each block. Using
blocking key(s) is very useful when you are processing big data set.

The Blocking Selection table in the component is
different from the Generation of Blocking Key table
in the match rule editor in the
Profiling
perspective.

The blocking column in tMatchGroup could come
from a tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema (it could be a
ZIP column for instance). While the Generation of Blocking Key table in the match rule editor defines
the parameters necessary to generate a blocking key; this table is equivalent to the
tGenKey component. The Generation of Blocking Key table generates a blocking column
BLOCK_KEY used for blocking.
Click the Chart button in the top right corner of
the wizard to execute the Job using the imported match rule and show the matching
results in the wizard.

Matching two records

You can use the tMatchGroup component to detect duplicates
and define how to merge similar records to create a master record.

Creating a master record is an iterative process: each new master record can be used to
find new duplicates.

You can choose between two different algorithms to create master records:

Simple VSR Matcher
T-Swoosh

The main difference between the two algorithms is that T-Swoosh creates, for each master
record, a new record that does not exist in the list of input records.

Matching measures

To compare one attribute of two records, you can use any of the implemented matching
functions, such as Exact, Levenshtein and
Jaro-Winkler, or a custom matching algorithm you
created.

You can also compare two records on many attributes. For two records to match, the two
following conditions must hold:

When using the T-Swoosh algorithm, the score for each matching function in the
match rule must exceed the threshold, if any specified. By default, the
threshold is set to 1. This means exact match for most
matching functions, excepted for Exact – ignore case and
potentially any custom matching function.
The global score, computed as a weighted score of the different matching
functions, must exceed the match threshold. The score is equal to
Σ(w_i ×
s_i(r₁,r₂)) / Σw_i where
w_i is the confidence
weight of the matching function i and
s_i(r₁,r₂)
is the score of the matching function i over
records r₁ and
r₂.

In this example, the score for the Levenshtein metric on the
attribute country must exceed 0.7 and
the global score, with a confidence weight of 1 on each of
the two measures, must exceed 0.85.

This example shows the weighted average computation that yields the global score of
two similar records.

Match rules

You can define different match rules for the same dataset.

Two records match if at least one of the match rules is satisfied. As soon as two records
match according to a given rule, the other rules are not checked.

Matching several records

Blocking

To avoid doing a two-by-two comparison of all the input records, you can define one
or many blocking keys to split the input dataset into smaller datasets called
blocks.

In each block, the blocking keys must have the same value. Then, each block is processed
independently.

Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh
algorithms to process data. For example, if 100,000 records are split into 100 blocks of
1,000 records each, the number of comparisons are reduced by a factor 100. This means
the algorithm runs around 100 times faster.

It is recommended to use the tGenKey component to generate
blocking keys and to visualize the statistics regarding the number of blocks. In a Job,
right-click the tGenKey component and select View Key
Profile in the contextual menu to visualize the distribution of the
number of blocks according to their size.

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there a 18,772 comparisons in these 13 blocks (13 × 382).
If records are compared with four columns, this means there will be 75,088 string
comparisons in these 13 blocks (18,772 × 4).

The Simple VSR Matcher algorithm

The Simple VSR Matcher algorithm compares each record within same block with the
previous master records in the lookup table.

If a record does not match any of the previous master records, it is considered as a new
master record and added to the lookup table. This means that the first record of the
dataset is necessarily a master record.

When a record matches a master record, the Simple VSR Matcher algorithm does not further
attempt to match with other master records because all the master records in the lookup
table are not similar. So, once a record matches a master record, the chance of matching
another master record is low.

This means a record can only exist in one group of records and be linked to one master
record.

For example, take the following set of records as input:

id	fullName
1	John Doe
2	Donna Lewis
3	John B. Doe
4	Louis Armstrong

The algorithm processes the input records as follows:

The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the lookup table.
The algorithm takes record 2 and compares it with record 1. Since it is not a match,
record 2 is added to the lookup table.
The algorithm takes record 3 and compare it with record 1 and record 2. Record 3
matches record 1. So, record 3 is added to the group of record 1.
The algorithm takes record 4 and compares it with record 1 and record 2 but not
with record 3, which is not a master record. Since it is not a match, record 4
is added to the lookup table.

The output will look like this:

id	fullName	Grp_ID	Grp_Size	Master	Score	GRP_QUALITY
1	John Doe	0	2	true	1.0	0.72
3	John B. Doe	0	0	false	0.72	0
2	Donna Lewis	1	1	true	1.0	1.0
4	Louis Armstrong	2	1	true	1.0	1.0

The T-Swoosh algorithm

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm,
but it creates a master record instead of considering existing records to be master
records.

To create master records, you can design survivorship rules to decide which attribute
will survive.

There are two types of survivorship rules:

The rules related to matching keys: each attribute used as a matching key can
have a specific survivorship rule.
The default rules: they are applied to all the attributes of the same data type
(Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column
is applied.

If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.

Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.

For example, take the following set of records as input:

id	fullName
1	John Doe
2	Donna Lewis
3	John B. Doe
4	Louis Armstrong

The survivorship rule uses the Concatenate function with
, as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:

The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the set of master records.
The queue contains now record 2, record 3 and record 4. The lookup contains
record 1.
The algorithm takes record 2 and compares it with record 1. Since record 2 does
not match any record, it is added to the set of master records. The queue
contains now record 3 and record 4. The lookup contains record 1 and record
2.
The algorithm takes record 3 and compares it with record 1. Record 3 matches
record 1. So, record 1 and record 3 are merged to create a new master record
called record 1,3. The queue contains now record 4 and record 1,3. The lookup
contains record 2.
The algorithm takes record 4 and compares it with record 2. Since it is not a
match, record 4 is added to the set of master records. The queue contains now
record 1,3. The lookup table contains record 2 and record 4.
The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new
master record called record 1,3,4. Record 4 is removed from the lookup table.
Since record 1,3 was the result of a previous merge, it is removed from the
table. The queue now contains record 1,3,4. The lookup contains record 2.
The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
match, record 1,3,4 is added to the set of master records. The queue is now
empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id	fullName	GRP_ID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1,3,4	John Doe, John B. Doe, Johnnie B. Doe	0	3	true	1.0	0.72
1	John Doe	0	0	false	0.72	0
3	John B. Doe	0	0	false	0.72	0
4	Johnnie B. Doe	0	0	true	0.78	0
2	Donna Lewis	1	1	true	1.0	1.0

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

When processing the input data using the Simple VSR Matcher algorithm, there
may be more iterations than the number of input records because a merged record may be
created on each iteration and added to the queue.

This is one of the main differences
between the Simple VSR Matcher and the T-Swoosh algorithms.

When comparing a record with a master record, the T-Swoosh algorithm makes more
comparisons per iteration than the Simple VSR matcher algorithm:

When using the Simple VSR matcher algorithm, the record from the queue is
only compared with the value of the master record.
When using the T-Swoosh algorithm, the record from the queue is compared with
the value of the master record and the value of each of the records used to
build this master record.

You can find an example of how to survive master
records using the T-Swoosh algorithm in Talend Help Center (https://help.talend.com).

In this example, the record “John Doe, John B. Doe” is compared with the
record “John B. Doe” on iteration 5. There is a match if at least one of the
three strings “John Doe, John B. Doe”, “John Doe” and “John B. Doe” matches
the string “Johnnie B. Doe”.

Multi-pass matching

You can design a Job with consecutive tMatchGroup components
to create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code
in a customer database. In this particular case, you can use two consecutive
tMatchGroup to process the data partitions:

One tMatchGroup in which the column “city” is defined as a
blocking key.
One tMatchGroup in which the column “ZipCode” as a
blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the
previous pass as the input of the current tMatchGroup
component. Multi-pass matching is more effective if the blocking keys are almost not
correlated. For example, it is not relevant to define the column “country” as a
blocking key and the column “city” as another blocking key because all the
comparisons made with the blocking key “city” will also be done with blocking key
“country”.

When using multi-pass matching with the VSR algorithm, only master records of size 1
– records that did not match any record – are compared with master records of any
size. There are no comparisons between two master records that are derived from at
least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the
first tMatchGroup component has a blocking key on the column
“ZipCode”, and the second tMatchGroup component has a
blocking key on the column “city”. The attribute “name” is used as a matching
key.

id	name	city	ZipCode
1	John Doe	Nantes	44000
2	John B. Doe	Nantes
3	Jon Doe	Nantes	44000
4	John Doe	Nantes

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped.
In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the
first pass, record 1 and record 2, are compared. Since their group size is strictly
greater than 1, they are not compared.

The following results are returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	2	true	1.0	0.875
3	Jon Doe	Nantes	44000	0	0	false	0.85	0
2	John B. Doe	Nantes		1	2	true	1.0	0.72
4	John Doe	Nantes		1	0	false	0.72	0

When running the T-Swoosh algorithm with the same parameters and the Most
common survivorship function, the following results are
returned:

id	name	city	ZipCode	GID	GRP_SIZE	MASTER	SCORE	GRP_QUALITY
1	John Doe	Nantes	44000	0	4	true	1.0	0.72
1	John Doe	Nantes	44000	0	0	true	0.875	0
3	Jon Doe	Nantes	44000	0	0	false	0.875	0
2	John B. Doe	Nantes		0	0	true	0.72	0
4	John Doe	Nantes		1	0	false	0.72	0

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record

This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.

This scenario describes a basic Job that compares columns in the input file using the
Jaro-Winkler matching method on the
lname and fname column and the q-grams matching method on the address1
column. It then groups the output records in three output flows:

Uniques: lists the records which group
score (minimal distance computed in the record) is equal to
1.
Matches: lists the records which group
score (minimal distance computed in the record) is higher than the threshold
you define in the Confidence threshold
field.
Suspects: lists the records which group
score (minimal distance computed in the record) is below the threshold you
define in the Confidence threshold
field.

For another scenario that groups the output records in one single output flow, see
Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Setting up the Job

Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.
Connect tFileInputExcel to tMatchGroup using the Main
link.
Connect tMatchGroup to the three
tLogRow components using the Unique rows, Confident
groups and Uncertain groups
links.

Warning:

To be able to set three different output flows for the processed
records, you must first select the Separate
output check box in the Advanced
settings view of the tMatchGroup component. For further information, see the
section about configuring the tMatchGroup component.

Configuring the input component

The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.

You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your
Talend Studio
. For
further information, see the
Talend Studio User
Guide.

In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
and set other properties in case they are not stored in the Repository.
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Configuring the tMatchGroup component

Double-click tMatchGroup to display the
Basic settings view and define the
component properties.
From the Matching Algorithm list, select
Simple VSR.

In this scenario, the match rule is based on the VSR algorithm.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties.
Click OK to close the dialog box.
Click Preview to open the configuration
wizard and define the component configuration and the match rule(s).

You can use the configuration wizard to import match rules created and tested
in the studio and stored in the repository, and use them in your match Jobs. For
further information, see Importing match rules from the studio repository.

It is important to import or define the same type of the rule selected in the
basic settings of the component, otherwise the Job runs with default values for
the parameters which are not compatible between the two algorithms.
Define the first match rule as the following:
- In the Key definition table, click
  the [+] button to add to the list the
  column(s) on which you want to do the matching operation,
  lname and fname.
  
  Note:
  
  When you select a date column on which to apply an algorithm or a matching algorithm,
  you can decide what to compare in the date format.
  
  For example, if you want to only compare the year in the date, in the component schema
  set the type of the date column to Date and then enter
  “yyyy” in the Date
  Pattern field. The component then converts the date format to a string
  according to the pattern defined in the schema before starting a string
  comparison.
- Click in the Matching Function column
  and select from the list Jaro-Winkler
  as the method to be used for the matching operation.
  
  If you select custom as a matching
  type, you must set in the Custom Matcher
  column the path pointing to the custom class (external
  matching algorithm) you need to use. This path is defined by yourself in
  the library file (.jar file).
- From the Tokenized measure list,
  select not to use a tokenized distance for the selected
  algorithms.
- Click in the cell of the Weight
  column to set the numerical weights for the two columns used as key
  attributes.
- Click in the cell of the Handle Null
  column and select the null operator you want to use to handle null
  attributes in the columns. In this example, select Null Match None in order to have matching results where
  null values have minimal effect.
- Set the match probability in the Match
  Interval field.
Follow the same procedure in the above step to define the second match rule.

Set the address1 column as an input attribute and select
Jaro as the matching type. Select Null Match None as the null operator. And finally set
the match probability which can be different from the one set for the first
rule.
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size.
Click the Advanced settings tab and set the
advanced parameters for the tMatchGroup
component as the following:
- Select the Separate output check box.
  
  The component will have three separate output flows: Unique rows, Confident
  groups and Uncertain
  groups.
  
  If this check box is not selected, the tMatchGroup component will have only one output flow
  where it groups all output data. For an example scenario, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.
- Select the Sort the output data by
  GID check box to sort the output data by their group
  identifier.
- Select the Output distance details
  and Display detailed labels check
  boxes.
  
  The component will output the MATCHING_DISTANCES
  column. This column provides the distance between the input and the
  master columns giving also the names of the columns against which the
  records are matched.
- Select the Deactivate matching computation when
  opening the wizard check box if you do not want to run
  the match rules the next time you open the wizard.
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the matching results
directly in the wizard.

The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of items in each group and colors
the groups in accordance with their color in the matching chart.

The Job conducts an OR match operation on the records. It evaluates the
records against the first rule and the records that match are not evaluated
against the second rule. The MATCHING_DISTANCES
column allows you to understand which rule has been used on what records. In
the yellow data group for example, the Amole Sarah record
is matched according to the second rule that uses address1
as a key attribute, whereas the other records in the group are matched according
to the first rule which uses the lname and
fname as key attributes.

You can set the Hide groups of less than
parameter in order to decide what groups to show in the matching chart and
table

Finalizing the Job and executing it

Double-click each of the tLogRow
components to display the Basic
settings view and define the component properties.
Save your Job and press F6 to execute
it.

You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.

The number of records in each of the three output blocks is listed in the
GRP_SIZE column and computed only on the master
record. The MASTER column indicates with true or false
if the corresponding record is a master record or not a master record. The
SCORE column lists the calculated distance between
the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records which group score is between the match interval,
0.95 or 0.85 depending on the
applied rule, and the confidence threshold defined in the advanced settings
of tMatchGroupare listed in the Suspects output flow.

All records which group score is above one of the match probabilities are
listed in the Matches output flow.

All records that have a group size equal to 1 is listed in the Uniques output flow.

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Scenario 2: Matching customer data through multiple passes

This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.

The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.

In this Job:

The tMysqlInput component connects to the
customer records to be processed.
Each of the tGenKey components defines a way
to partition data records. The first key partitions data to many groups and the
second key creates fewer groups that overlaps the previous blocks depending on
the blocking key definition.
The tMap component renames the key generated
by the second tGenKey component.
The first tMatchGroup processes the
partitions defined by the first tGenKey, and
the second tMatchGroup processes those defined
by the second tGenKey.

Warning:

The two tMatchGroup components must
have the same schema.
The tLogRow component presents the matching
results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the
Talend Studio User
Guide.

In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schema and
drop the database table onto the design workspace. The input table used in
this scenario is called customer.

A dialog box is displayed with a list of components.
Select the relevant database component, tMysqlInput in this example, and then click OK.
Drop two tGenKey components, two
tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.
Link the input component to the tGenKey
and tMap components using Main links.
In the two tMatchGroup components, select the
Output distance details check boxes in the
Advanced settings view of both components
before linking them together.

This will provide the MATCHING_DISTANCES column in the
output schema of each tMatchGroup.

If the two tMatchGroup components are already
linked to each other, you must select the Output distance
details check box in the second component in the Job flow first
otherwise you may have an issue.
Link the two tMatchGroup components and
the tLogRow component using Main links.
If needed, give the components specific labels to reflect their usage in
the Job.

For further information about how to label a component, see
Talend Studio

User Guide.

Configuring input data and key generation

Connecting to the input data

Double-click tMysqlInput to open its
Component view.

The property fields for tMysqlInput are
automatically filled in. If you do not define your input schema locally in
the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
lists.

The input table used in this scenario is called
customer.
Modify the query in the Query box to
select only the columns you want to match:
account_name, lname,
fname, mi,
address1, city,
state_province and
postal_code.

Configuring the key generation for the first pass

Double-click the first tGenKey to open
the Component view.
Click

and import blocking keys from match rules created and
tested in the
Profiling
perspective
of
Talend Studio
and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the
[+] button to add two rows in the
table.
In the column column, click the newly added row and
select from the list the column you want to process using an algorithm. In this
example, select lname.
Do the same on the second row to select
postal_code.
In the pre-algorithm column, click the newly
added row and select from the list the pre-algorithm you want to apply to the
corresponding column.

In this example, select remove diacritical marks and
convert to upper case to remove any diacritical mark and converts
the fields of the lname column to upper
case.

This conversion does not change your raw data.
In the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word.

If you select the Show help check box, you
can display instructions on how to set algorithms/options parameters.
Do the same for the second row on the algorithm column to select first N
characters of the string.
Click in the Value column next to the
algorithm column and enter the value for
the selected algorithm, when needed.

In this scenario, enter 1 for both rows. The first letter
of each field in the corresponding columns will be used to generate the
key.

Make sure to set a value for the algorithm which need one, otherwise you may
have a compilation error when you run the Job.

Configuring the key generation for the second pass

Double-click the second tGenKey to open
the Component view.
In the Algorithm table, define the column
you want to use to partition data, account_num in this
component. Select the first N characters of the
string algorithm and set the value to 1
in the Value column.

Each of the two tGenKey components will
generate a read_only T_GEN_KEY column in
the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
other.
Double-click the tMap component to open
its editor.
In the Schema editor, copy the columns
from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.
In the top part of the editor, drop all columns from the input table to
the output table.
Click Ok to save data transformation and
close the editor.
In the tGenKey basic settings, click the

button to verify that the two generated keys are named
differently in the output schema.

Configuring the two matching passes

Configuring the first pass

In the basic settings of the tMatchGroup
labelled pass1, select Simple VSR from the Matching
Algorithm list.

In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the
Configuration Wizard.
Click

and import matching keys from the match rules created and
tested in the
Profiling
perspective of

Talend Studio
and use them in your Job. Otherwise, define
the matching key parameters as described in the below steps.

It is important to import or define the same type of the rule selected in the
basic settings of the component, otherwise the Job runs with default values for
the parameters which are not compatible between the two algorithms.
In the Key definition table, click the
[+] button to add the column(s) on which
you want to do the matching operation, lname
in this scenario.

Note:

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
From the Tokenized measure list, select
Any order.
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY in this example.

Using a blocking value reduces the number of pairs of records that needs to be
examined. The input data is partitioned into exhaustive blocks based on the
functional key. This will decrease the number of pairs to compare, as comparison
is restricted to record pairs within each block.
If required, click Edit schema to open the
schema editor and see the schema retrieved from the previous component in the
Job.
Click the Advanced settings tab and select
the Sort the output data by GID check box to
arrange the output data by their group IDs.
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.

Configuring the second pass

In the basic settings of the tMatchGroup
labelled pass2, select Simple VSR from the Matching
Algorithm list.

In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the
Configuration Wizard.

If this component does not have the same schema of the preceding component, a
warning icon appears. If so, click the Sync
columns button to retrieve the schema from the preceding one and
once done, the warning icon disappears.
In the Key Definition table, click
the [+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.

Note:

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
Set Weight to 1 and in
the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table
then click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY1 in this example.
Click the Advanced settings tab and select
the Multi-pass check box. This option enables
tMatchGroup to receive data sets from the
tMatchGroup that precedes it in the Job.
In the Advanced settings view, select the
Sort the output data by GID check box to
arrange the output data by their group IDs.
Select the Deactivate matching computation when opening
the wizard check box if you do not want to run the match rules
the next time you open the wizard.

Executing the Job and showing the results on the console

In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:

Use a tReplicate component to
replicate the input flow you want to process as shown in the above
figure. Use a copy/paste operation to create the two parts of the
Job.
Keep only the first pass,tMatchGroup component, in the upper part of the Job
and show the match results in a tLogRow component.
Use two passes in the lower part of the Job and show the final
match results in a tLogRow
component.

Double-click each of the tLogRow
components to open the Component view and
in the Mode area, select the Table (print values in cells of a table)
option.
Save your Job and press F6 to execute
it.

The results after the first pass read as follows:

The results after the second pass read as follows:

When you compare, for example, the customer name Wilson from the results of the two passes, you will find
that more customers using the last name Wilson are grouped together after the second pass.

Scenario 3: Using survivorship functions to merge two records and create a master
record

This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.

This scenario describes a basic Job that uses the T-Swoosh algorithm to find duplicates in
customer names. It compares values in the first_name column using
matching and survivorship methods, groups similar records together and creates single
representations of entities in each group by merging two similar records according to a
survivorship rule.

These master records are new records that do not exist in the input data.

You can use several matching passes with the T-Swoosh algorithm in a Job. Each pass
provides the survived masters for each group and the original records to the second pass.
However, intermediate master records are removed from the output flow. Only final master and
original records are kept at the end.

In this Job:

The tFileInputDelimited component provides the
customer records to be processed.
The tMatchGroup component processes data using
matching and survivorship methods.
The tLogRow component presents the matching and
survivorship results.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the
repository, see the
Talend Studio User Guide.

In the Repository tree view, expand Metadata – File
delimited where you have stored the main input schema and drop
the file connection onto the design workspace. The file connection used in this
scenario is called Customer_File.

A dialog box is displayed with a list of components.
Select tFileInputDelimited and click
OK.
Drop a tMatchGroup and a tLogRow from the Palette onto the design workspace.
Link the components together using the Main
links.

Configuring the input component

The main input file contains the columns: id,
first_name and age. The data in this input
file has problems such as duplication, names spelled differently or wrongly, different
information for the same customer.

Double-click tFileInputDelimited to open its
Component view.

The property fields are automatically filled in. If you do not define your
input schema locally in the repository, fill in the details manually after
selecting Built-in in the Schema and Property Type
lists.
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Creating the match rule with a matching key and a survivorship function

Setting the schema and selecting the matching algorithm

Double-click tMatchGroup to display the
Basic settings view and define the
component properties.
From the Matching Algorithm list, select
T-Swoosh.

In this scenario, the match rule is based on the T-Swoosh algorithm.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are few output standard columns
that are read-only. For more information, see tMatchGroup Standard properties.
In the Advanced settings view, leave the
parameters selected by default.

Defining the match rule

In the tMatchGroup basic settings, click
Preview to open the configuration wizard
and define the matching key and the survivorship function.

You can use the configuration wizard to import match rules created and tested
in the studio and stored in the repository, and use them in your match Jobs. For
further information, see Importing match rules from the studio repository.

It is important to have the same type of the matching algorithm selected in
the basic settings of the component and defined in the configuration wizard.
Otherwise the Job runs with default values for the parameters which are not
compatible between the two algorithms.
Define the match rule as the following:
- In the Key definition table, click
  the [+] button to add a line in the
  table. Click in the Input Key Attribute
  column and select the column on which you want to do the matching
  operation, first_name in this scenario.
- Click in the Matching Function column
  and select Soundex from the list. This
  method matches processed entries according to a standard English
  phonetic algorithm which indexes strings by sound, as pronounced in
  English.
- From the Tokenized measure list,
  select not to use a tokenized distance for the selected
  algorithm.
- Set the Threshold to 0.8 and the Confidence Weight to 1.
- Select Null Match None in the
  Handle Null column in order to have
  matching results where null values have minimal effect.
- Select Most common in the Matching Function column. This method
  validates the most frequent name value in each group of
  duplicates.
Define the survivorship rule as the following:
- In the Default Survivorship Rules
  table, click the [+] button to add a
  line in the table. Click in the Data
  Type column and select Number.
- Click in the Survivorship Function
  column and select Largest (for numbers)
  from the list. This method validates the largest numerical value in each
  group.
Set the Hide groups of less than parameter in
order to decide what groups to show in the result chart and matching table. This
parameter enables you to hide groups of small group size.
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the results directly in
the wizard.

The matching chart gives a global picture about the duplicates in the analyzed
data. The matching table indicates the details of the items in each group,
colors the groups in accordance with their color in the matching chart and
indicates with true the records which are
master records. The master record in each group is the result of merging two
similar records according to the phonetic algorithm and survivorship rule. The
master record is a new record that does not exist in the input data.
Click OK to close the wizard.

Finalizing the Job and executing it

Double-click the tLogRow component to display
its Basic settings view and define the
component properties.
Save your Job and press F6 to execute
it.

Similar records are grouped together. The number of records in each group is
listed in the GRP_SIZE column and computed only on the
master record. The MASTER column indicates with true or false if
the corresponding record is a master record or not a master record. The master
record is created by merging the record which best matches the phonetic
algorithms with the record which has the largest age value. The
SCORE column lists the calculated similarity between
the input record and the master record according to the Soundex matching algorithm.

The identifier for each group, which is of String data
type, is listed in the GID column next to the corresponding
record. This identifier will be of the data type Long for
Jobs that are migrated from older releases. To have the group identifier as
String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the studio Palette.

tMatchGroup MapReduce properties

These properties are used to configure tMatchGroup running in the MapReduce Job framework.

The MapReduce
tMatchGroup component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.

Basic settings

Schema and Edit schema	A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Click Sync columns to retrieve the schema from the previous component in the Job. The output schema of this component contains the following read-only fields: GID: provides a group identifier of the data type String. Note: All Jobs with tMatchGroup that are migrated from older releases into your current studio may provide a group identifier of the data type Long. If you want to have a group identifier of the data type String, you must replace the tMatchGroup component in these Jobs with tMatchGroup from the studio Palette. GRP_SIZE: counts the number of records in the group, computed only on the master record. MASTER: identifies, by `true` or `false`, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score.
	Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.
Matching Algorithm	Select from the list the algorithm you want to use in the component: Simple VSR is the only matching algorithm you can use with the Map/Reduce version of the component. If you converted a standard Job using tMatchGroup with the T-Swoosh algorithm to a Map/Reduce Job, select Simple VSR from the list and save the converted Job before its execution. Otherwise, an error occurs.
Key Definition	Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Note: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
	Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example “Hello”: “H400”. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. Soundex FR: matches processed entries according to a standard French phonetic algorithm. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string. Fingerprint key: matches entries after doing the following sequential process: remove leading and trailing whitespace, change all characters to their lowercase representation, remove all punctuation and control characters, split the string into whitespace-separated tokens, sort the tokens and remove duplicates, join the tokens back together, Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster. normalize extended western characters to their ASCII representation, for example gödel to godel. This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q` length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between “masking” and “pairing” is 3. custom…: enables you to load an external matching algorithm from a Java library using the custom Matcher column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Scenario 2: Using a custom matching algorithm to match entries.
	Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file). For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
	Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.
	Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
Match Threshold	Enter the match probability. Two data records match when the probability is above the set value. You can enter a different match threshold for each match rule.
Blocking Selection	Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.

Advanced settings

Store on disk	Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored.
Multiple output	Select the Separate output check box to have three different output flows: –Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow. –Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow. –Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow. Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group.
Multi-pass	Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. For an example Job, see Scenario 2: Matching customer data through multiple passes
Sort the output data by GID	Select this check box to group the output data by the group identifier.
Output distance details	Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Note: When you use two tMatchGroup components in a Job and you want to use the Output distance details option, you must select this check box in both components before you link them together. If the components are linked, select the check box in the second component in the Job flow first then in the first component, otherwise you may have an issue as there are two columns in the output schema with the same name. Selecting this option in only one tMatchGroup is not useful and may bring schema mismatch issues.
Display detailed labels	Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be `fname:1.0\|lname:0.97` when the check box is selected and `1.0\|0.97` when it is not selected.
tStatCatcher Statistics	Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop. You need to use the Hadoop Configuration tab in the Run view to define the connection to a given Hadoop distribution for the whole Job. For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Open Studio for Big Data Getting Started Guide . For a scenario demonstrating a Map/Reduce Job using this component, see Scenario: Matching data through multiple passes using Map/Reduce components. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Usage rule

In a
Talend
Map/Reduce Job, this component is used as an intermediate
step and other components used along with it must be Map/Reduce components, too. They
generate native Map/Reduce code that can be executed directly in Hadoop.

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

For further information about a
Talend
Map/Reduce Job, see the sections
describing how to create, convert and configure a
Talend
Map/Reduce Job of the

Talend Open Studio for Big Data Getting Started
Guide
.

For a scenario demonstrating a Map/Reduce Job using this component,
see Scenario: Matching data through multiple passes using Map/Reduce components.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Working principle

This component implements the MapReduce model, based on the blocking keys defined in
the Blocking definition table of the Basic settings view.

This implementation proceeds as follows:

Splits the input rows in groups of a given size.
Implements a Map Class that creates a map between each key and a list of
records.
Shuffles the records to group those with the same key together.
Applies, on each key, the algorithm defined in the Key
definition table of the Basic
settings view.

Then accordingly, this component reads the records, compares them with the
master records, groups the similar ones, and classes each of the rest as a
master record.
Outputs the groups of similar records with their group IDs, group sizes,
matching distances and scores.

Configuration wizard in Map/Reduce Jobs

The match results on multiple conditions will list data
records that meet any of the defined rules.

You can not open
the configuration wizard unless you link an input component to the tMatchGroup component.

From the
configuration wizard in tMatchGroup, you can:

define multiple conditions using several match rules to group
data,
set different match intervals for each rule,
import match rules created and tested in the studio and stored in
the repository, and use them in your match Jobs. You can only import rules
configured with the VSR algorithm. For further information, see Importing match rules from the studio repository.
select a blocking key to partition data.

Click the [+] button on the match rule
bar.
Set the parameters for the new rule in the Key
definition table and define its match interval.
Repeat the above steps to create as many match rules as needed. You can define
a different match interval for each rule.

When you define multiple rules, the Job conducts an OR match operation. It
evaluates data records against the first rule and the records that match are not
evaluated against the second rule.
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.

Defining a blocking key is not mandatory but is very useful when you are
processing big data sets. A blocking key partitions data in blocks and so
reduces the number of records that need to be examined. This key can come from a
tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema.
At the bottom right corner of the wizard, click either:
- OK to save the current
  configuration.
- Cancel to close the wizard and keep
  the configuration saved initially in the wizard.

Scenario: Matching data through multiple passes using Map/Reduce components

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

Note that
Talend
Map/Reduce components are available only to users
who subscribed to Big Data.

This scenario shows how to create a
Talend
Map/Reduce Job to match data by
using Map/Reduce components. It generates Map/Reduce code and runs right in
Hadoop.

This Job is a duplication of the Standard data
integration Job described in Scenario 2: Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.

You can use
Talend Studio
to automatically
convert the standard Job in the previous section to a Map/Reduce Job. This way, you do
not need to redefine the settings of the components in the Job.

Before starting to replicate this scenario, ensure that you have appropriate rights
and permissions to access the Hadoop distribution to be used.

Converting the Job

In the Repository tree view of the
Integration
perspective of
Talend Studio
,
right-click the Job you have created in the earlier scenario to open its
contextual menu and select Edit
properties.

Then the [Edit properties] dialog box is
displayed. Note that the Job must be closed before you are able to make any
changes in this dialog box.

This dialog box looks like the image below:

Note that you can change the Job name as well as the other descriptive
information about the Job from this dialog box.
From the Job Type list, select Big Data Batch. Then a Map/Reduce Job using the same
name appears under the Big Data Batch sub-node
of the Job Design node.

Setting up the Job

Double-click the new Map/Reduce Job to open it in the workspace.

The Map/Reduce component Palette is opened.
Delete tMysqlInput in this scenario as it is
not a Map/Reduce component and use tRowGenerator in its place. Link it to tGenKey with a Row > Main
link.
Double-click tRowGenerator to open its
editor.
Define the schema you want to use to write data in Hadoop.
Click OK to validate your schema and close
the editor.
Leave the settings of the other components as you defined initially in the
standard version of the Job.

Setting up Hadoop connection

Click Run to open its view and then click the
Hadoop Configuration tab to display its
view for configuring the Hadoop connection for this Job.
From the Property type list,
select Built-in. If you have created the
connection to be used in Repository, then
select Repository and thus the Studio will
reuse that set of connection information for this Job.
In the Version area, select the
Hadoop distribution to be used and its version.

If you cannot find from the list the distribution corresponding to yours,
select Custom so as to connect to a
Hadoop distribution not officially supported in the Studio. For a
step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

If you use Google Cloud Dataproc, see Defining the Dataproc connection parameters for MapReduce Jobs.
Along with the evolution of Hadoop, please note the following changes:
- If you use Hortonworks Data
  Platform V2.2, the configuration files of your
  cluster might be using environment variables such as ${hdp.version}. If this is your situation,
  you need to set the mapreduce.application.framework.path property in the
  Hadoop properties table
  with the path value explicitly pointing to the MapReduce
  framework archive of your cluster. For
  example:
  
  mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework
  
  1
  
  mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework
- If you use Hortonworks Data
  Platform V2.0.0, the type of the operating
  system for running the distribution and a
  Talend
  Job must be the same, such as Windows or Linux.
  Otherwise, you have to use
  Talend
  Jobserver to execute the Job in the same type of
  operating system in which the Hortonworks Data Platform V2.0.0 distribution
  you are using is run.
In the Name node field, enter the location of
the master node, the NameNode, of the distribution to be used. For example,
hdfs://tal-qa113.talend.lan:8020.
- If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
  client will take care of the rest on the fly for creating the connection. The
  MapR client must be properly installed. For further information about how to set
  up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client
- If you are using WebHDFS, the location should be
  webhdfs://masternode:portnumber; if this WebHDFS is secured
  with SSL, the scheme should be swebhdfs and you need to use
  a tLibraryLoad in the Job to load the library required by
  the secured WebHDFS.
In the Resource Manager field,
enter the location of the ResourceManager of your distribution. For example,
tal-qa114.talend.lan:8050.
- Then you can continue to set the following parameters depending on the
  configuration of the Hadoop cluster to be used (if you leave the check
  box of a parameter clear, then at runtime, the configuration about this
  parameter in the Hadoop cluster to be used will be ignored):
  - Select the Set resourcemanager
    scheduler address check box and enter the Scheduler address in
    the field that appears.
  - Select the Set jobhistory
    address check box and enter the location of the JobHistory
    server of the Hadoop cluster to be used. This allows the metrics information of
    the current Job to be stored in that JobHistory server.
  - Select the Set staging
    directory check box and enter this directory defined in your
    Hadoop cluster for temporary files created by running programs. Typically, this
    directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
    such as yarn-site.xml or mapred-site.xml of your distribution.
  - Select the Use datanode
    hostname check box to allow the Job to access datanodes via
    their hostnames. This actually sets the dfs.client.use.datanode.hostname property to true. When connecting to a S3N filesystem, you must select this check
    box.
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos
principal name for the NameNode in the field displayed. This enables you to use
your user name to authenticate against the credentials stored in Kerberos.
- If this cluster is a MapR cluster of the version 4.0.1 or later, you can set the MapR
  ticket authentication configuration in addition or as an alternative by following the
  explanation in Connecting to a security-enabled MapR.
  
  Keep in mind that this configuration generates a new MapR security ticket for the username
  defined in the Job in each execution. If you need to reuse an existing ticket issued for the
  same username, leave both the Force MapR ticket
  authentication check box and the Use Kerberos
  authentication check box clear, and then MapR should be able to automatically
  find that ticket on the fly.
In addition, since this component performs Map/Reduce computations, you
also need to authenticate the related services such as the Job history server and
the Resource manager or Jobtracker depending on your distribution in the
corresponding field. These principals can be found in the configuration files of
your distribution. For example, in a CDH4 distribution, the Resource manager
principal is set in the yarn-site.xml file and the Job history
principal in the mapred-site.xml file.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend
Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
In the User name field, enter the login user
name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
In the Temp folder field, enter the path in
HDFS to the folder where you store the temporary files generated during
Map/Reduce computations.
Leave the default value of the Path separator in
server as it is, unless you have changed the separator used by your
Hadoop distribution’s host machine for its PATH variable or in other words, that
separator is not a colon (:). In that situation, you must change this value to the
one you are using in that host.
Leave the Clear temporary folder check box
selected, unless you want to keep those temporary files.
Leave the Compress intermediate map output to reduce
network traffic check box selected, so as to spend shorter time
to transfer the mapper task partitions to the multiple reducers.

However, if the data transfer in the Job is negligible, it is recommended to
clear this check box to deactivate the compression step, because this
compression consumes extra CPU resources.
If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
properties to be customized. Then at runtime, these changes will override the
corresponding default properties used by the Studio for its Hadoop
engine.

For further information about the properties required by Hadoop, see Apache’s
Hadoop documentation on http://hadoop.apache.org, or
the documentation of the Hadoop distribution you need to use.
If the HDFS transparent encryption has been enabled in your cluster, select
the Setup HDFS encryption configurations check
box and in the HDFS encryption key provider field
that is displayed, enter the location of the KMS proxy.

For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS.
If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks
Data Platform V1.3, you need to set proper memory allocations for the map and reduce
computations to be performed by the Hadoop system.

In that situation, you need to enter the values you need in the Mapred
job map memory mb and the Mapred job reduce memory
mb fields, respectively. By default, the values are both 1000 which are normally appropriate for running the
computations.

If the distribution is YARN, then the memory parameters to be set become Map (in Mb), Reduce (in Mb) and
ApplicationMaster (in Mb), accordingly. These fields
allow you to dynamically allocate memory to the map and the reduce computations and the
ApplicationMaster of YARN.

For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.
If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your
distribution to trace your Job lineage to the component level, including the schema
changes between components.
With this option activated, you need to set the following parameters:
- Username and Password: this is the credentials you use to connect to your Cloudera
  Navigator.
- Cloudera Navigator URL : enter the location
  of the Cloudera Navigator to be connected to.
- Cloudera Navigator Metadata URL: enter the
  location of the Navigator Metadata.
- Activate the autocommit option: select this
  check box to make Cloudera Navigator generate the lineage of the current Job at the end
  of the execution of this Job.
  
  Since this option actually forces Cloudera Navigator to generate lineages of
  all its available entities such as HDFS files and directories, Hive queries or Pig
  scripts, it is not recommended for the production environment because it will slow the
  Job.
- Kill the job if Cloudera Navigator fails: select this check
  box to stop the execution of the Job when the connection to your Cloudera Navigator fails.
  
  Otherwise, leave it clear to allow your Job to continue to run.
- Disable SSL validation: select this check box to
  make your Job to connect to Cloudera Navigator without the SSL validation
  process.
  
  This feature is meant to facilitate the test of your Job but is not
  recommended to be used in a production cluster.
If you are using Hortonworks Data Platform V2.4.0 onwards and you have
installed Atlas in your cluster, you can select the Use
Atlas check box to enable Job lineage to the component level, including the
schema changes between components.
With this option activated, you need to set the following parameters:
- Atlas URL : enter the location of the Atlas
  to be connected to. It is often http://name_of_your_atlas_node:port
- Die on error: select this check box to stop the Job
  execution when Atlas-related issues occur, such as connection issues to Atlas.
  
  Otherwise, leave it clear to allow your Job to continue to run.
In the Username field and the Password field, enter the authentication information for access
to Atlas.

Executing the Job

Save the Job and press F6 to execute
it.

Match results are displayed on the studio console.

Matches are calculated by running through the two passes. Each pass
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.

More customers with similar last names are grouped together after the
second pass.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 6.x

0 Comments

Inline Feedbacks

View all comments

tMatchGroup

tMatchGroup Standard properties

Basic settings

Advanced settings

Global Variables

Usage

Configuration wizard

Configuration view

Matching chart

Matching table

Importing match rules from the studio repository

Matching two records

Matching measures

Match rules

Matching several records

Blocking

The Simple VSR Matcher algorithm

The T-Swoosh algorithm

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

Multi-pass matching

What is multi-pass matching?

An example of multi-pass matching

Scenario 1: Grouping output data in separate flows according to the minimal distance computed in each record

Setting up the Job

Configuring the input component

Configuring the tMatchGroup component

Finalizing the Job and executing it

Scenario 2: Matching customer data through multiple passes

Setting up the Job

Configuring input data and key generation

Connecting to the input data

Configuring the key generation for the first pass

Configuring the key generation for the second pass

Configuring the two matching passes

Configuring the first pass

Configuring the second pass

Executing the Job and showing the results on the console

Scenario 3: Using survivorship functions to merge two records and create a master record

Setting up the Job

Configuring the input component

Creating the match rule with a matching key and a survivorship function

Setting the schema and selecting the matching algorithm

Defining the match rule

Finalizing the Job and executing it

tMatchGroup MapReduce properties

Basic settings

Advanced settings

Global Variables

Usage

Working principle

Configuration wizard in Map/Reduce Jobs

Scenario: Matching data through multiple passes using Map/Reduce components

Converting the Job

Setting up the Job

Setting up Hadoop connection

Executing the Job

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record

Scenario 3: Using survivorship functions to merge two records and create a master
record