Warning
This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.
|
Component family |
Data Quality |
This component is deprecated as tMatchGroup can be used now in both standard and |
||
|
Function |
tMatchGroupHadoop uses given When several tMatchGroupHadoop In defining a group, the first record processed of each group is |
|||
|
Purpose |
This component helps find similar or duplicate records in any |
|||
|
Basic settings |
Property type |
Either Built-in or Repository. Since version 5.6, both the Built-In mode and the Repository mode are |
||
|
Schema and Edit |
A schema is a row description, it defines the number of fields to Since version 5.6, both the Built-In mode and the Repository mode are Click Sync columns to retrieve The output schema of this component contains the following GID: represents the group identifier. GRP_SIZE: counts the number of records in the MASTER: identifies the record used in the SCORE: measures the distance between the Matching_Distances: it presents |
|||
|
|
|
Built-in: You create and store |
||
|
|
Repository: You have already |
|||
|
Link with a tMatchGroupHadoop |
Select this check box if more than one tMatchGroupHadoop is used in the Job. From the |
|||
|
Link with a tGenKeyHadoop |
Select this check box if you need to reuse a specific connection, |
|||
|
Use an existing connection |
Select this check box and in the Component List click the NoteWhen a Job contains the parent Job and the child Job, Component |
|||
|
Version NoteUnavailable if you use an existing link. |
Distribution |
Select the cluster you are using from the drop-down list. The options in the list vary
In order to connect to a custom distribution, once selecting Custom, click the
|
||
|
Hadoop version |
Select the version of the Hadoop distribution you are using. The available options vary
|
|||
|
NameNode URI |
Select this check box to indicate the location of the NameNode of the Hadoop cluster to be For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial |
|||
|
JobTracker URI |
Select this check box to indicate the location of the Jobtracker service within the Hadoop If you use YARN in your Hadoop cluster such as Hortonworks Data
For further information about these parameters, see the documentation or For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial |
|||
|
User name |
Enter the Hadoop user authentication. For some Hadoop versions, you also need to enter the name of the |
|||
| HDFS directory |
Enter the HDFS directory where the data to be processed is. At runtime, this component will clean up any data in this If you want to reuse the existing data in HDFS, use:
|
|||
|
|
|
Click the import icon to select a match rule from the Studio When you click the import icon, a [Match You can only import rules created with the VSR algorithm. For |
||
|
Key Definition |
Input Key Attribute |
Select the column(s) from the input flow on which you want to NoteWhen you select a date column on which to apply an algorithm or a matching algorithm, For example, if you want to only compare the year in the date, in the component schema |
||
|
|
Matching Function |
Select the relevant matching algorithm from the list: Exact Match: matches each Exact – ignore case: matches each Soundex: matches processed Levenshtein (edit distance): Metaphone: Based on a phonetic Double Metaphone: a new version Soundex FR: matches processed Jaro: matches processed entries Jaro-Winkler: a variant of Jaro, q-grams: matches processed custom…: enables you to load an For further information about how to load an external Java For further information about how to create a custom matching For the related scenario about how to use a custom matching |
||
| Custom matcher |
Type in the path pointing to the custom class (external matching For example, to use a MyDistance.class class |
|||
|
|
Weight |
Set a numerical weight for each attribute (column) of the key NoteIf you have more than one tMatchGroupHadoop components in your Job, the |
||
|
Handle Null |
Handle Null To handle null values, select from the list the null operator you Null Match Null: a Null attribute Null Match None: a Null attribute Null Match All: a Null attribute For example, if we have two columns, name and If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may Null Match Null: they do not Null Match None: they do not Null Match All: they And for the records: “Doe”, “” “Doe”, “” Null Match Null: they Null Match None: they do not Null Match All: they |
|||
|
Blocking Definition |
Input Column |
If required, select the column(s), as the blocking key(s), from Blocking reduces the number of pairs of records that needs to be NoteUsing blocking column(s) is very useful when you are |
||
|
Advanced settings |
Matching Algorithm |
Select an algorithm from the list – only one is available for the Simple VSR Matcher: This |
||
|
Match threshold |
Enter the match probability. Two data records match when the |
|||
|
|
Sort the output data by GID |
Select this check box to group the output data by the group |
||
|
|
Output distance details |
Select this check box to fill the fixed output column |
||
|
Hadoop Properties NoteUnavailable if you use an existing link. |
Property |
Talend Studio uses a default configuration for its engine to perform
For further information about the properties required by Hadoop and its related systems such
|
||
| Value |
Type in the custom property values used to connect to the Hadoop |
|||
|
|
Keep data in Hadoop |
Select this check box to keep the data processed by this component If you leave this check box unselected, the component processes |
||
|
|
Use existing HDFS file |
Select this check box to enable the component to directly process HDFS file URI: set the URI of the Field delimiter: set the If you leave this check box unselected, tMatchGroupHadoop receives the data flow to be |
||
|
|
tStatCatcher Statistics |
Select this check box to collect log data at the component |
||
|
Dynamic settings |
Click the [+] button to add a row The Dynamic settings table is available only when the For more information on Dynamic settings and context |
|||
|
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|||
|
Usage |
This component can be a start or an intermediary step. It requires It is ideally used alongside the tGenKey component or the |
|||
|
Limitation/prerequisite |
You need to use Linux to execute the Job containing this |
|||
This component implements the MapReduce model, based on the blocking keys defined
in the Blocking definition table of the Basic settings view.

This implementation proceeds as follows:
-
Splits the input rows in groups of a given size.
-
Implements a Map Class that creates a map between each key and a list of
records. -
Shuffles the records to group those with the same key together.
-
Applies, on each key, the algorithm defined in the Key definition table of the Basic
settings view.Then accordingly, this component reads the records, compares them with the
master records, groups the similar ones, and classes each of the rest as a
master record. -
Outputs the groups of similar records with their group IDs, group sizes,
matching distances and scores.
The Job in this scenario connects to the given Hadoop system, groups similar customer
records by running through two subsequent matching passes in HDFS and outputs the
calculated matches by groups. Each pass provides its matches to the pass that follows in
order for the latter to add more matches identified with new rules.

The components of interest are:
-
tFixedFlowInput: it provides the customer
records to be processed. -
two tGenKeyHadoop components: Each defines a
way to partition the records… -
two tMatchGroupHadoop components: they
process each partition to group the records within and once configured, sort the
groups by their group IDs. The first tMatchGroupHadoop processes the partitions defined by the first
tGenKeyHadoop and the second tMatchGroupHadoop processes those defined by the
second tGenKeyHadoop. Each of them set one pass
to match the received records.Warning
The two tMatchGroupHadoop components
must have the same schema. -
two tLogRow components: they present the
execution result of each tMatchGroupHadoop
component.
To replicate this scenario, proceed as the following sections illustrate.
-
Drop tFixedFlowInput, two tGenKeyHadoop components, two tMatchGroupHadoop components and two tLogRow components from Palette onto the design workspace.
Note
A component used in the workspace can be labelled the way you need. In
this scenario, this input component is labelled
incoming_customers for tFixedFlowInput. For further information about how to
label a component, see Talend Studio
User Guide -
Right-click tFixedFlowInput to open its
contextual menu and select the Row >
Main link from this menu to connect
this component to the first tGenKeyHadoop
(labeled lname_postcode). -
Do the same to create the Main link from
the first tGenKeyHadoop to the second
tGenKeyHadoop (labeled lname_initial), to the first tMatchGroupHadoop, to the first tLogRow and then to the second tMatchGroupHadoop and finally to the second
tLogRow.
The components to be used in this scenario are all placed and linked. Then you
need continue to configure them successively.
Setting up the input data
-
Double-click tFixedFlowInput to open its
Component view.
-
Click the three-dot button next to Edit
schema to open the schema editor.
-
Click the plus button eight times to add eight rows. They are the eight
columns of the schema of the input data. -
Rename these eight rows respectively. In this example, they are: account_num, lname, fname, mi, address1, city, state_province, postal_code.
-
In the Type column, select the data types
for the rows of interest. In this example, select Long for the account_num
column. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data
of interest. -
In the Content field, enter the input
data to be processed, or paste the sample data provided by the demo Job
D4_hadoop_group_family_multipass that
you could import along with the demo DQ project. For further information
about how to import a project, see Talend Studio
User Guide.
Configuring the key generation for the first pass
-
Double-click the first tGenKeyHadoop
(labelled lname_postcode) to open the
Component view.
-
Configure the connection to the HDFS you want to write and process the
records in.The parameters to be set are the Distribution, the Hadoop
version, the NameNode URI,
the HDFS directory, User name and the Jobtracker
URI.In the HDFS directory you define, this component will create, at runtime,
the folder storing separately the input records and the same records but
with their partition keys: the original ones in an in folder and the partitioned ones in an out folder, both under the same parental folder
tGenKeyHadoop_1 -
Click
and import blocking keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps. -
Under the Algorithm table, click the plus
button to add two rows in this table. -
On the column column, click the newly
added row and select, from the list, the column you want to process using an
algorithm. In this example, select lname. -
Do the same on the second row to select
postal_code -
On the pre-algorithm column, click the
newly added row and select, from the list, the pre-algorithm you want to
apply to the corresponding column. In this example, select remove diacritical marks and upper case to remove
any diacritical mark and converts the fields of the lname column to upper case before generating the code of
this column.Note
This conversion does not change your raw data.
-
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word. -
Do the same for the second row on the algorithm column to select first N
characters of the string. -
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed. In this scenario, type in
1 for both of the rows, meaning that the first
letter of each field in the corresponding columns will be used to generate
the keys. -
Click Advanced settings to open its view.

-
Select the Keep data in Hadoop check box
in order to process data in HDFS. In this situation, the customer records
are not outputted into the data flow and the process runs faster.
Configuring the key generation for the second pass
-
Double-click the second tGenKeyHadoop
(labelled lname_initial) to open the
Component view.
-
Select Link with a tGenKeyHadoop to reuse
the connection to HDFS and the HDFS directory created by the first tGenKeyHadoop.. This reuse enables this component
to read the data processed by its preceding tGenKeyHadoop component natively in HDFS. -
Click the
button to verify the key column in the schema. You can
find that the key columns of the two tGenKeyHadoop components have been automatically named to
differentiate each other. In this scenario, they are T_GEN_KEY_postcode and T_GEN_KEY.
-
Click
and import blocking keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps. -
Under the Algorithm table, click the plus
button to add one row in this table. -
On the column column, click the newly
added row and select from the list the column you want to process using an
algorithm. In this example, select account_num. -
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select first N
characters of the string. -
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed. In this scenario, type in
1, meaning that the first letter of each field in
the corresponding column will be used to generate the required keys. -
Click Advanced settings to open its
view. -
Select the Keep data in Hadoop check box
in order to process data in HDFS.
You need to configure the two passes to group the input data with the help of the
two columns of generated keys.
Configuring the first pass
-
Double-click the first tMatchGroupHadoop
component (labelled pass1) to display the
Component view.
-
If required, click Sync schema, then
click Edit schema to open the schema editor
and see the schema retrieved from the previous component in the Job.
-
Select Link with a tGenKeyHadoop to reuse
the connection to HDFS and the HDFS directory created by its preceding
tGenKeyHadoop.. This reuse enables this
component to read the data processed by that tGenKeyHadoop component
natively in HDFS. -
Click
and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps. -
In the Key definition table, click
to add to the list the columns on which you want to do
the matching operation, lname in this
scenario.Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison. -
Click in the first and second cells of the Matching
type column and select from the list the method(s) to be used
for the matching operation, Jaro-Winkler in this
example. -
Click in the cell of the Handle Null
column and select the null operator you want to use to handle null
attributes in the column. -
Click the plus button below the Blocking
Definition table to add one row in the table then click in
the line and select from the list the column you want to use as a blocking
value, T_GEN_KEY_postcode in this
example.Using a blocking value reduces the number of pairs of records that needs
to be examined. The input data is partitioned into exhaustive blocks based
on the functional key. This will decrease the number of pairs to compare, as
comparison is restricted to record pairs within each block. -
Click Advanced settings to open its view
to verify that the Keep data in Hadoop is
clear. This way, the processed customer records are outputted into the data
flow and therefore, become available to tLogRow.
-
Select the Sort the output data by GID
check box to arrange the output data by their group IDs.
Configuring the second pass
-
Double-click the second tMatchGroupHadoop
component (labelled pass2) to display the
Component view.
-
If this component does not have the same schema of the preceding
component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from
the preceding one and once done, the warning icon disappears. -
Select the Link with a tMatchGroupHadoop
check box to reuse the connection of its preceding tMatchGroupHadoop to the Hadoop system. -
Click
and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps. -
In the Key definition table, click the
plus button to add to the list the columns on which you want to do the
matching operation, lname in this
scenario.Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison. -
Click in the cell of the Matching type
column and select from the list the method(s) to be used for the matching
operation, Jaro-Winkler in this
example. -
Click
below the Blocking
Definition table to add one row in the table then click in
the row and select from the list the column you want to use as a blocking
value, T_GEN_KEY in this example. This
way, the matching operation is performed only between the master records
with the same key, in this example, the same initial character of the
account numbers.Through this pass, this Job processes the matching groups provided by the
previous tMatchGroupHadoop. It selects the
groups with one single record to compare with the other master records when
both parts of them have the same generated key. -
Click Advanced settings to open its view
to verify that the Keep data in Hadoop is
clear. This way, the processed customer records are outputted into the data
flow and therefore, become available to
tLogRow. -
Select the Sort the output data by GID
check box to arrange the output data by their group IDs.
Sorting the input records
-
Double-click tSortRow to open its
Component view.
-
Under the Criteria table, click the plus
button twice to add two rows. -
In the first row, select GID for the
Schema column column, alpha for the Sort num or
alpha column and asc for
the Order asc or desc column. This means
that the sorting is performed on the GID
column of the input schema, in terms of its ascending alphabetical order.
Thus the other columns are sorted accordingly. -
Do the same to select GRP_SIZE,
num and desc accordingly in the second row.
Then you can run this Job.
The tLogRow component is used to present the
execution result of the Job.
-
If you want to configure the presentation mode on its Component view, double-click the tLogRow component of interest to open the
Component view and in the Mode area, then, select the Table (print values in cells of a table) check box. -
Press F6 to run this Job.
Once done, the Run view is opened automatically,
where you can check the execution result.
The result after the first pass reads as follows:

The result after the second pass reads as follows:

Note
For the reason of page space, the results are not totally presented.
When you compare, for example, the customer name Alexander from the results of the two passes, you will find that
more customers using the last name Alexander are
grouped together after the second pass:
-
In the first pass, Jeremy Alexander,
Bob Alexander and Maxine Alexander are not distributed in the
same group because the matching is performed only within each block defined
in the T_GEN_KEY_postcode column while
they belong to different blocks, A9, A8 and A3 respectively. -
In the second pass, the matching to be performed uses the blocks defined
in the T_GEN_KEY column. As all of the
three customer names belong to block 2,
so they are grouped together after computing the distance in between. In
addition, you can as well read, from the MASTER column, that Jeremy
Alexander is the master record of its group.