Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.
Component family |
Data Quality |
|
Function |
tMatchGroup compares columns in Several tMatchGroup components In defining a group, the first processed record of each group is |
|
Purpose |
This component helps you to create groups of similar data records |
|
Basic settings |
Schema and Edit |
A schema is a row description, it defines the number of fields to Since version 5.6, both the Built-In mode and the Repository mode are Click Sync columns to retrieve The output schema of this component contains the following GID: provides a group identifier of the data NoteAll Jobs with tMatchGroup GRP_SIZE: counts the number of records in the MASTER: identifies, by Each input record will be compared to the master record, if they SCORE: measures the distance between the In case the tMatchGroup component GRP_QUALITY: provides the quality of |
|
|
Built-in: You create and store |
|
|
Repository: You have already |
PREVIEW This button opens a configuration wizard that enables you to |
Key Definition |
Input Key Attribute Select the column(s) from the input flow on which you want to NoteWhen you select a date column on which to apply an algorithm or a matching algorithm, For example, if you want to only compare the year in the date, in the component schema |
|
Matching Function Select a matching algorithm from the list: Exact: matches each processed Exact – ignore case: matches each Soundex: matches processed Levenshtein (edit distance): Metaphone: Based on a phonetic Double Metaphone: a new version Soundex FR: matches processed Jaro: matches processed entries Jaro-Winkler: a variant of Jaro, q-grams: matches processed custom…: enables you to load an For further information about how to load an external Java For further information about how to create a custom matching For a related scenario about how to use a custom matching |
|
Custom Matcher When you select Custom as the For example, to use a MyDistance.class class |
||
|
Weight Set a numerical weight for each attribute (column) of the key |
|
Handle Null To handle null values, select from the list the null operator you Null Match Null: a Null attribute Null Match None: a Null attribute Null Match All: a Null attribute For example, if we have two columns, name and If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may Null Match Null: they do not Null Match None: they do not Null Match All: they And for the records: “Doe”, “” “Doe”, “” Null Match Null: they Null Match None: they do not Null Match All: they |
||
Match Threshold |
Enter the match probability. Two data records match when the You can enter a different match threshold for each match |
|
Blocking Selection |
Input Column If required, select the column(s) from the input flow according to Blocking reduces the number of pairs of records that needs to be Using blocking column(s) is very useful when you are processing |
|
Advanced settings |
Store on disk |
Select the Store on disk check Max buffer size: Type in the size Temporary data directory path: |
|
Multiple output |
Select the Separate output check –Uniques: when the group score –Matches: when the group score –Suspects: when the group score Confident match threshold: set a |
|
Multi-pass |
Select this check box to enable a tMatchGroup component to receive data sets from For an example Job, see Scenario 2: Matching customer data through multiple passes |
|
Sort the output data by GID |
Select this check box to group the output data by the group |
|
Output distance details |
Select this check box to add an output column NoteWhen you use two tMatchGroup |
|
Display detailed labels |
Select this check box to have in the output For example, if you try to match on first name and last name |
|
tStatCatcher Statistics |
Select this check box to collect log data at the component level. |
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is an intermediary step. It requires an input flow |
|
Usage in Map/Reduce Jobs |
If you have subscribed to one of the Talend solutions with Big Data, you can also You need to use the Hadoop Configuration tab in the For further information about a Talend Map/Reduce Job, see the sections For a scenario demonstrating a Map/Reduce Job using this Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
|
Limitation/prerequisite |
n/a |
The configuration wizard enables you to create different production environments,
Configurations, and their match rules. You can also
use the configuration wizard to import match rules created and tested in the studio and
stored in the repository, and use them in your match Jobs. For further information, see
Importing match rules from the studio repository.
You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.
To open the configuration wizard:
-
In the studio workspace, design your job and link the components together, for
example as below: -
Either:
-
Double-click tMatchGroup, or
-
Right-click it and from the contextual menu select Configuration Wizard, or
-
Click Preview in the basic settings view of tMatchGroup.
-
The configuration wizard is composed of three areas:
-
the Configuration view, where you can set
the match rules and the blocking column(s). -
the matching chart, which presents the graphic matching result,
-
the matching table, which presents the details of the matching
result.
From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to
execute the Job. You can use these different configurations for testing purposes for
example, but you can only save one configuration from the wizard, the open
configuration.
In each configuration, you can define blocking key(s) and multiple conditions
using several match rules. You can also set different match intervals for each rule.
The match results on multiple conditions will list data records that meet any of the
defined rules. When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the records
that match are not evaluated against the other rules.
The parameters required to edit or create a match rule are:
-
The Limit field indicates the maximum
number of rows to be processed by the match rule(s) in the wizard. The
by-default maximum row number is 1000. -
The Key definition parameters.
-
The Match Threshold field.
To create a new configuration and new match rules from the configuration wizard,
do the following:
-
Click the [+] button on the top right
corner of the Configuration view.This creates, in a new tab, an exact copy of the last
configuration. -
Edit or set the parameters for the new configuration in the Key definition and Blocking
Selection tables. -
If needed, define several match rules for the open configuration as the following:
-
Click the [+] button on the
match rule bar.This creates, in a new tab, an exact copy of the last
rule. -
Set the parameters for the new rule in the Key definition table and define its
match interval. -
Follow the steps above to create as many match rules for a
configuration as needed. You can define a different match
interval for each rule.When a configuration has multiple conditions, the Job conducts
an OR match operation. It evaluates data records against the
first rule and the records that match are not evaluated against
the second rule and so on.
-
-
Click the Chart button at the top right
corner of the wizard to execute the Job in the open configuration.The matching results are displayed in the matching chart and table.
Follow the steps above to create as many new configuration in the wizard
as needed. -
To execute the Job in a specific configuration, open the configuration in
the wizard and click the Chart button.The matching results are displayed in the matching chart and table.
-
At the bottom right corner of the wizard, click either:
-
OK to save the open
configuration.You can save only one configuration in the wizard.
-
Cancel to close the wizard and
keep the configuration saved initially in the wizard.
-
From the matching chart, you can have a global picture about the duplicates in the
analyzed data.
The Hide groups less than parameter, which is set
to 2 by default, enables you to decide what groups to show in
the result chart. Usually you want to hide groups of small group size.
For example, the above matching chart indicates that:
-
48 items are analyzed and classified into 18 groups according to a
given match rule and after excluding items that are unique, by setting
the Hide groups less than parameter to
2. -
11 groups have 2 items each. In each group, the 2 items are duplicates
of each other. -
3 groups have 3 items each. In each group, these items are duplicates
of one another. -
3 groups have 4 items each. In each group, these items are duplicates
of one another. -
One single group has 5 duplicate items.
From the matching table, you can read details about the different
duplicates.
This table indicates the matching details of items in each group and colors the
groups in accordance with their color in the matching chart.
You can decide what groups to show in this table by setting the Hide groups of less than parameter. This parameter
enables you to hide groups of small group size. It is set to 2
by default.
The buttons under the table helps you to navigate back and forth through
pages.
From the tMatchGroup configuration wizard, you
can import match keys from the match rules created and tested in the Profiling perspective of Talend Studio and stored in
the repository. You can then use these imported matching keys in your match
Jobs.
The tMatchGroup component is based on the VSR algorithm.
You can not import match rules configured with the T-Swoosh algorithm. A warning
message displays in the wizard when you try to import rules with T-Swoosh.
The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records
and assigns them to groups. The first processed record of each group is the master
record of the group. The VSR algorithm compares each record with the master of each
group and uses the computed distances, from master records, to decide to what group
the record should go.
To import match rules from the studio repository:
-
From the configuration wizard, click the icon on the top right corner.
The [Match Rule Selector] wizard opens
listing all match rules created in the studio and saved in the
repository. -
Select the match rule you want to import into the tMatchGroup component and use on your data.
Note
– A warning message displays in the wizard if the match rule you want
to import is defined on columns that do not exist in the input schema of
tMatchGroup. You can define input
columns later in the configuration wizard.-A warning message displays in the wizard if you try to import match
rules configured with the T-Swoosh algorithm. For further information
about the T-Swoosh algorithm, see Talend Studio User
Guide. -
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.If you leave the box unselected, the match keys will be imported in a new
match rule tab without overwriting the current match rule in the
wizard. -
Click OK.
The matching key is imported from the match rule and listed as a new rule
in the configuration wizard. -
Click in the Input Key Attribute and
select from the input data the column on which you want to apply the
matching key. -
In the Match threshold field, enter the
match probability threshold. Two data records match when the computed match
score is above this value. -
In the Blocking Selection table, select
the column(s) from the input flow which you want to use as a blocking
key.Defining a blocking key is not mandatory but advisable. Using a blocking
key partitions data in blocks and so reduces the number of records that need
to be examined, as comparisons are restricted to record pairs within each
block. Using blocking key(s) is very useful when you are processing big data
set.Note that the Blocking Selection table in
the component is different from the Generation of
Blocking Key table in the match rule editor in the Profiling perspective.The blocking column in tMatchGroup could
come from a tGenKey component (and would be
called T_GEN_KEY) or directly from the input schema (it
could be a ZIP column for instance). While the
Generation of Blocking Key table in the
match rule editor defines the parameters necessary to generate a blocking
key; this table is equivalent to the tGenKey component. The Generation of
Blocking Key table generates a blocking column
BLOCK_KEY used for blocking. -
Click the Chart button in the top right
corner of the wizard to execute the Job using the imported match rule and
show the matching results in the wizard.
Warning
The information in this section is only for users that have subscribed to one of
the Talend solutions with Big Data and is not applicable to
Talend Open Studio for Big Data users.
In a Talend Map/Reduce Job, tMatchGroup, as well as the whole Map/Reduce Job using it, generates
native Map/Reduce code. This section presents the specific settings in the configuration
wizard of tMatchGroup when it is used in that
situation. For further information about a Talend Map/Reduce Job, see the Talend Big Data Getting Started Guide.
You can not open the configuration wizard unless you link an input component to the
tMatchGroup component.
From the configuration wizard in tMatchGroup, you can:
-
define multiple conditions using several match rules to group data,
-
set different match intervals for each rule,
-
import match rules created and tested in the studio and stored in the repository, and use
them in your match Jobs. You can only import rules configured with the VSR
algorithm. For further information, see Importing match rules from the studio repository. -
select a blocking key to partition data.
The match results on multiple conditions will list data records that meet any of the
defined rules.
To create match rules from the configuration wizard, do the following:
-
Click the [+] button on the match rule
bar. -
Set the parameters for the new rule in the Key
definition table and define its match interval. -
Repeat the above steps to create as many match rules as needed. You can define
a different match interval for each rule.When you define multiple rules, the Job conducts an OR match operation. It
evaluates data records against the first rule and the records that match are not
evaluated against the second rule. -
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.Defining a blocking key is not mandatory but is very useful when you are
processing big data sets. A blocking key partitions data in blocks and so
reduces the number of records that need to be examined. This key can come from a
tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema. -
At the bottom right corner of the wizard, click either:
-
OK to save the current
configuration. -
Cancel to close the wizard and keep
the configuration saved initially in the wizard.
-
This scenario describes a basic Job that compares columns in the input file using the
Jaro-Winkler matching method on the
lname and fname column and the q-grams matching method on the address1
column. It then groups the output records in three output flows:
-
Uniques: lists the records which group
score (minimal distance computed in the record) is equal to
1. -
Matches: lists the records which group
score (minimal distance computed in the record) is higher than the threshold
you define in the Confidence threshold
field. -
Suspects: lists the records which group
score (minimal distance computed in the record) is below the threshold you
define in the Confidence threshold
field.
For another scenario that groups the output records in one single output flow, see
Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key.
-
Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.
-
Connect tFileInputExcel to tMatchGroup using the Main
link. -
Connect tMatchGroup to the three
tLogRow components using the Unique rows, Confident
groups and Uncertain groups
links.Warning
To be able to set three different output flows for the processed
records, you must first select the Separate
output check box in the Advanced
settings view of the tMatchGroup component. For further information, see the
section about configuring the tMatchGroup component.
The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.
You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your Talend Studio. For
further information, see the Talend Studio User
Guide.
-
In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
and set other properties in case they are not stored in the Repository. -
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.
-
Double-click tMatchGroup to display the
Basic settings view and define the
component properties. -
Click Sync columns to retrieve the schema
from the preceding component. -
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.In the output schema of this component there are few output standard
columns that are read-only. For more information, see tMatchGroup properties. -
Click OK to close the dialog box.
-
Click Preview to open the configuration
wizard and define the component configuration and the match rule(s).You can use the configuration wizard to import match rules created and
tested in the studio and stored in the repository, and use them in your
match Jobs. For further information, see Importing match rules from the studio repository. -
Define the first match rule as the following:
-
In the Key definition table,
click the [+] button to add to the
list the column(s) on which you want to do the matching operation,
lname and
fname.Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison. -
Click in the cell of the Matching
type column and select from the list Jaro-Winkler as the method to be used for
the matching operation.If you select custom as a
matching type, you must set in the Custom
Matcher Class column the path pointing to the custom
class (external matching algorithm) you need to use. This path is
defined by yourself in the library file (.jar
file). -
Click in the cell of the Confidence
Weight column to set the numerical weights for the
two columns used as key attributes. -
Click in the cell of the Handle
Null column and select the null operator you want to
use to handle null attributes in the columns. In this example,
select Null Match None in order to
have matching results where null values have minimal effect. -
Set the match probability in the Match
Interval field.
-
-
Follow the same procedure in the above step to define the second match
rule.Set the address1 column as an input attribute and
select Jaro as the matching type. Select
Null Match None as the null operator.
And finally set the match probability which can be different from the one
set for the first rule. -
Set the Hide groups of less than
parameter in order to decide what groups to show in the result chart and
matching table. This parameter enables you to hide groups of small group
size. -
Click the Advanced settings tab and set
the advanced parameters for the tMatchGroup
component as the following:-
Select the Separate output check
box.The component will have three separate output flows: Unique rows, Confident groups and Uncertain
groups.If this check box is not selected, the tMatchGroup component will have only one output flow
where it groups all output data. For an example scenario, see Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key. -
Select the Sort the output data by
GID check box to sort the output data by their group
identifier. -
Select the Output distance
details and Display detailed
labels check boxes.The component will output the
MATCHING_DISTANCES column. This column
provides the distance between the input and the master columns
giving also the names of the columns against which the records are
matched.
-
-
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the matching results
directly in the wizard.The matching chart gives a global picture about the duplicates in the
analyzed data. The matching table indicates the details of items in each
group and colors the groups in accordance with their color in the matching
chart.The Job conducts an OR match operation on the records. It evaluates the
records against the first rule and the records that match are not evaluated
against the second rule. The MATCHING_DISTANCES column allows you to understand which
rule has been used on what records. In the yellow data group for example,
the Amole Sarah record is matched according to the
second rule that uses address1 as a key attribute,
whereas the other records in the group are matched according to the first
rule which uses the lname and
fname as key attributes.You can set the Hide groups of less than
parameter in order to decide what groups to show in the matching chart and
table
-
Double-click each of the tLogRow
components to display the Basic
settings view and define the component properties. -
Save your Job and press F6 to execute
it.You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.The number of records in each of the three output blocks is listed in the
GRP_SIZE column and computed only on the master
record. The MASTER column indicates with true or false
if the corresponding record is a master record or not a master record. The
SCORE column lists the calculated distance between
the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.All records which group score is between the match interval,
0.95 or 0.85 depending on the
applied rule, and the confidence threshold defined in the advanced settings
of tMatchGroupare listed in the Suspects output flow.All records which group score is above one of the match probabilities are
listed in the Matches output flow.All records that have a group size equal to 1 is listed in the Uniques output flow.
For another scenario that groups the output records in one single output flow
based on a generated functional key, see Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key.
The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.
In this Job:
-
The tMysqlInput component connects to the
customer records to be processed. -
Each of the tGenKey components defines a way
to partition data records. The first key partitions data to many groups and the
second key creates fewer groups that overlaps the previous blocks depending on
the blocking key definition. -
The tMap component renames the key generated
by the second tGenKey component. -
The first tMatchGroup processes the
partitions defined by the first tGenKey, and
the second tMatchGroup processes those defined
by the second tGenKey.Warning
The two tMatchGroup components must
have the same schema. -
The tLogRow component presents the matching
results after the two passes.
In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the Talend Studio User
Guide.
-
In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schema and
drop the database table onto the design workspace. The input table used in
this scenario is called customer.A dialog box is displayed with a list of components.
-
Select the relevant database component, tMysqlInput in this example, and then click OK.
-
Drop two tGenKey components, two
tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace. -
Link the input component to the tGenKey
and tMap components using Main links. -
In the two tMatchGroup components, select
the Output distance details check boxes in
the Advanced settings view of both
components before linking them together.This will provide the MATCHING_DISTANCES column in
the output schema of each tMatchGroup.Note
If the two tMatchGroup components are
already linked to each other, you must select the Output distance details check box in the second
component in the Job flow first otherwise you may have an issue. -
Link the two tMatchGroup components and
the tLogRow component using Main links. -
If needed, give the components specific labels to reflect their usage in
the Job.For further information about how to label a component, see Talend Studio
User Guide.
Connecting to the input data
-
Double-click tMysqlInput to open its
Component view.The property fields for tMysqlInput are
automatically filled in. If you do not define your input schema locally in
the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
lists.The input table used in this scenario is called
customer. -
Modify the query in the Query box to
select only the columns you want to match:
account_name, lname,
fname, mi,
address1, city,
state_province and
postal_code.
Configuring the key generation for the first pass
-
Double-click the first tGenKey to open
the Component view. -
Click and import blocking keys from match rules created and
tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps. -
Under the Algorithm table, click the
[+] button to add two rows in the
table. -
On the column column, click the newly
added row and select from the list the column you want to process using an
algorithm. In this example, select lname. -
Do the same on the second row to select
postal_code. -
On the pre-algorithm column, click the
newly added row and select from the list the pre-algorithm you want to apply
to the corresponding column.In this example, select remove diacritical marks and
convert to upper case to remove any diacritical mark and
converts the fields of the lname column
to upper case.Note
This conversion does not change your raw data.
-
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word. -
Do the same for the second row on the algorithm column to select first N
characters of the string. -
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed.In this scenario, enter 1 for both rows. The first letter of each
field in the corresponding columns will be used to generate the key.
Configuring the key generation for the second pass
-
Double-click the second tGenKey to open
the Component view. -
In the Algorithm table, define the column
you want to use to partition data, account_num in this
component. Select the first N characters of the
string algorithm and set the value to 1
in the Value column.Each of the two tGenKey components will
generate a read_only T_GEN_KEY column in
the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
other. -
Double-click the tMap component to open
its editor. -
In the Schema editor, copy the columns
from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example. -
In the top part of the editor, drop all columns from the input table to
the output table. -
Click Ok to save data transformation and
close the editor. -
In the tGenKey basic settings, click the button to verify that the two generated keys are named
differently in the output schema.
Configuring the first pass
-
Double-click the first tMatchGroup
labelled pass1 to display the Configuration Wizard. -
Click and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps. -
In the Key definition table, click the
[+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison. -
Select the Jaro-Winkler algorithm in the Matching Function column.
-
Set Weight to 1 and
in the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario. -
Click the [+] button below the Blocking Selection table to add one row in the
table then click in the line and select from the list the column you want to
use as a blocking value, T_GEN_KEY in
this example.Using a blocking value reduces the number of pairs of records that needs
to be examined. The input data is partitioned into exhaustive blocks based
on the functional key. This will decrease the number of pairs to compare, as
comparison is restricted to record pairs within each block. -
If required, click Edit schema to open
the schema editor and see the schema retrieved from the previous component
in the Job. -
Click the Advanced settings tab and
select the Sort the output data by GID
check box to arrange the output data by their group IDs.
Configuring the second pass
-
Double-click the second tMatchGroup
component labelled pass2 to display the
Configuration Wizard.If this component does not have the same schema of the preceding
component, a warning icon appears. If so, click the Sync columns button to retrieve the schema from the
preceding one and once done, the warning icon disappears. -
In the Key Definition table, click
the [+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison. -
Select the Jaro-Winkler algorithm in the Matching Function column.
-
Set Weight to 1 and
in the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario. -
Click the [+] button below the Blocking Selection table to add one row in the table then
click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY1 in this
example. -
Click the Advanced settings tab and
select the Multi-pass check box. This
option enables tMatchGroup to receive data
sets from the tMatchGroup that precedes it
in the Job. -
In the Advanced settings view, select the
Sort the output data by GID check box
to arrange the output data by their group IDs.
Executing the Job and showing the results on the console
In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:
-
Use a tReplicate component to
replicate the input flow you want to process as shown in the above
figure. Use a copy/paste operation to create the two parts of the
Job. -
Keep only the first pass,tMatchGroup component, in the upper part of the Job
and show the match results in a tLogRow component. -
Use two passes in the lower part of the Job and show the final
match results in a tLogRow
component.
-
Double-click each of the tLogRow
components to open the Component view and
in the Mode area, select the Table (print values in cells of a table)
option. -
Save your Job and press F6 to execute
it.The results after the first pass read as follows:
The results after the second pass read as follows:
When you compare, for example, the customer name Wilson from the results of the two passes, you will find
that more customers using the last name Wilson are grouped together after the second pass.
Note that Talend Map/Reduce components are available only to users
who subscribed to Big Data.
This scenario shows how to create a Talend Map/Reduce Job to match data by
using Map/Reduce components. It generates Map/Reduce code and runs right in
Hadoop.
The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.
This Job is a duplication of the Standard data
integration Job described in Scenario 2: Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.
You can use Talend Studio to automatically
convert the standard Job in the previous section to a Map/Reduce Job. This way, you do
not need to redefine the settings of the components in the Job.
Before starting to replicate this scenario, ensure that you have appropriate rights
and permissions to access the Hadoop distribution to be used.
-
In the Repository tree view of the Integration perspective of Talend Studio, right-click the
Job you have created in the earlier scenario to open its contextual menu and
select Edit properties.Then the [Edit properties] dialog box is
displayed. Note that the Job must be closed before you are able to make any
changes in this dialog box.This dialog box looks like the image below:
Note that you can change the Job name as well as the other descriptive
information about the Job from this dialog box. -
Click Convert to Map/Reduce Job. Then a
Map/Reduce Job using the same name appears under the Map/Reduce Jobs sub-node of the Job
Design node.
If you need to create this Map/Reduce Job from scratch, you have to right-click the
Job Design node or the Map/Reduce Jobs sub-node and select Create
Map/Reduce Job from the contextual menu. Then an empty Job is opened in
the workspace. For further information, see the section describing how to create a
Map/Reduce Job of the Talend Big Data Getting Started Guide.
-
Double-click the new Map/Reduce Job to open it in the workspace.
The Map/Reduce component Palette is
opened. A crossed-out component, if any, indicates that it does not have the
Map/Reduce version. -
Delete tMysqlInput in this scenario and
drop tRowGenerator from the Palette to the workspace. Link it to tGenKey with a Row >
Main link. -
Double-click tRowGenerator to open its
editor. -
Define the schema you want to use to write data in Hadoop.
-
Click OK to validate your schema and
close the editor. -
Leave the settings of the other components as you defined initially in the
standard version of the Job.
-
Click Run to open its view and then click the
Hadoop Configuration tab to display its
view for configuring the Hadoop connection for this Job.This view looks like the image below:
-
From the Property type list, select Built-in. If you have created the connection to be
used in Repository, then select Repository and thus the Studio will reuse that set of
connection information for this Job.For further information about how to create an Hadoop connection in
Repository, see the chapter describing the Hadoop
cluster node of the Talend Big Data Getting Started Guide. -
In the Version area, select the Hadoop
distribution to be used and its version. If you cannot find from the list the
distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially
supported in the Studio.For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.
Along with the evolution of Hadoop, please note the
following changes:-
If you use Hortonworks Data Platform
V2.2, the configuration files of your cluster might be using
environment variables such as ${hdp.version}. If this is your situation, you need to set
the mapreduce.application.framework.path property in the
Hadoop properties table with the path
value explicitly pointing to the MapReduce framework archive of your
cluster. For
example:1mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework -
If you use Hortonworks Data Platform
V2.0.0, the type of the operating system for running the
distribution and a Talend Job must be the same,
such as Windows or Linux. Otherwise, you have to use Talend Jobserver to execute the Job in the same
type of operating system in which the Hortonworks
Data Platform V2.0.0 distribution you are using is run. For
further information about Talend Jobserver, see
Talend
Installation and Upgrade Guide.
-
-
In the Name node field, enter the location of
the master node, the NameNode, of the distribution to be used. For example,
hdfs://tal-qa113.talend.lan:8020.If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
client will take care of the rest on the fly for creating the connection. The
MapR client must be properly installed. For further information about how to set
up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client -
In the Job tracker field, enter the location
of the JobTracker of your distribution. For example, tal-qa114.talend.lan:8050.Note that the notion Job in this term JobTracker designates the MR or the
MapReduce jobs described in Apache’s documentation on http://hadoop.apache.org/.If you use YARN in your Hadoop cluster such as Hortonworks Data Platform V2.0.0 or Cloudera CDH4.3 + (YARN mode), you need to specify the location
of the Resource Manager instead of the
Jobtracker. Then you can continue to set the following parameters depending on
the configuration of the Hadoop cluster to be used (if you leave the check box
of a parameter clear, then at runtime, the configuration about this parameter in
the Hadoop cluster to be used will be ignored ):-
Select the Set resourcemanager scheduler
address check box and enter the Scheduler address in
the field that appears. -
Select the Set jobhistory address
check box and enter the location of the JobHistory server of the
Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server. -
Select the Set staging directory
check box and enter this directory defined in your Hadoop cluster
for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the
configuration files such as yarn-site.xml or mapred-site.xml of your distribution. -
Select the Use datanode hostname
check box to allow the Job to access datanodes via their hostnames.
This actually sets the dfs.client.use.datanode.hostname property to
true. When connecting to a
S3N filesystem, you must select this check box.
-
-
If you are accessing the Hadoop cluster running with Kerberos security, select this check
box, then, enter the Kerberos principal name for the NameNode in the field displayed. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos.In addition, since this component performs Map/Reduce computations, you also need to
authenticate the related services such as the Job history server and the Resource manager or
Jobtracker depending on your distribution in the corresponding field. These principals can
be found in the configuration files of your distribution. For example, in a CDH4
distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.If you need to use a Kerberos keytab file to log in, select Use a
keytab to authenticate. A keytab file contains pairs of Kerberos principals
and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the
Keytab field.Note that the user that executes a keytab-enabled Job is not necessarily the one a
principal designates but must have the right to read the keytab file being used. For
example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. -
In the User name field, enter the login user
name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used. -
In the Temp folder field, enter the path in
HDFS to the folder where you store the temporary files generated during
Map/Reduce computations. -
Leave the default value of the Path separator in server as
it is, unless you have changed the separator used by your Hadoop distribution’s host machine
for its PATH variable or in other words, that separator is not a colon (:). In that
situation, you must change this value to the one you are using in that host. -
Leave the Clear temporary folder check box
selected, unless you want to keep those temporary files. -
Leave the Compress intermediate map output to reduce
network traffic check box selected, so as to spend shorter time
to transfer the mapper task partitions to the multiple reducers.However, if the data transfer in the Job is negligible, it is recommended to
clear this check box to deactivate the compression step, because this
compression consumes extra CPU resources. -
If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
properties to be customized. Then at runtime, these changes will override the
corresponding default properties used by the Studio for its Hadoop
engine.For further information about the properties required by Hadoop, see Apache’s
Hadoop documentation on http://hadoop.apache.org, or
the documentation of the Hadoop distribution you need to use. -
If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks
Data Platform V1.3, you need to set proper memory allocations for the map and reduce
computations to be performed by the Hadoop system.In that situation, you need to enter the values you need in the Mapred
job map memory mb and the Mapred job reduce memory
mb fields, respectively. By default, the values are both 1000 which are normally appropriate for running the
computations.If the distribution is YARN, then the memory parameters to be set become Map (in Mb), Reduce (in Mb) and
ApplicationMaster (in Mb), accordingly. These fields
allow you to dynamically allocate memory to the map and the reduce computations and the
ApplicationMaster of YARN.
For further information about this Hadoop
Configuration tab, see the section describing how to configure the Hadoop
connection for a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide.
For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.
For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.
-
Save the Job and press F6 to execute
it.Match results are displayed on the studio console.
Matches are calculated by running through the two passes. Each pass
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.More customers with similar last names are grouped together after the
second pass.