tMatchGroup

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tMatchGroup properties

Component family	Data Quality
Function	tMatchGroup compares columns in both standard input data flows and in M/R input data flows by using matching methods and groups similar encountered duplicates together. Several tMatchGroup components can be used sequentially to match data against different blocking keys. This will refine the groups received by each of the tMatchGroup components through creating different data partitions that overlap previous data blocks and so on. In defining a group, the first processed record of each group is the master record of the group. The other records are computed as to their distances from the master records and then are distributed to the due master record accordingly.
Purpose	This component helps you to create groups of similar data records in any source data including large volumes of data by using one or several match rules.
Basic settings	Schema and Edit schema	A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions. Click Sync columns to retrieve the schema from the previous component in the Job. The output schema of this component contains the following read-only fields: GID: provides a group identifier of the data type String. Note All Jobs with tMatchGroup that are migrated from older releases into your current studio may provide a group identifier of the data type Long. If you want to have a group identifier of the data type String, you must replace the tMatchGroup component in these Jobs with tMatchGroup from the studio Palette. GRP_SIZE: counts the number of records in the group, computed only on the master record. MASTER: identifies, by `true` or `false`, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score.
		Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
		Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.
PREVIEW This button opens a configuration wizard that enables you to define production environments and their match rules or to import match rules from the studio repository. For further information, see Configuration wizard	Key Definition	Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Note When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
		Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example “Hello”: “H400”. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. Soundex FR: matches processed entries according to a standard French phonetic algorithm. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q`length grams. The matching result is given as the number of q-gram matches over possible q-grams. custom…: enables you to load an external matching algorithm from a Java library using the custom matcher class column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Scenario 2: Using a custom matching algorithm to match entries.
		Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file). For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
		Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.
		Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
	Match Threshold	Enter the match probability. Two data records match when the probability is above the set value. You can enter a different match threshold for each match rule.
	Blocking Selection	Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.
Advanced settings	Store on disk	Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored.
	Multiple output	Select the Separate output check box to have three different output flows: –Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow. –Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow. –Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow. Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group.
	Multi-pass	Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys. For an example Job, see Scenario 2: Matching customer data through multiple passes
	Sort the output data by GID	Select this check box to group the output data by the group identifier.
	Output distance details	Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Note When you use two tMatchGroup components in a Job and you want to use the Output distance details option, you must select this check box in both components before you link them together. If the components are linked, select the check box in the second component in the Job flow first then in the first component, otherwise you may have an issue as there are two columns in the output schema with the same name. Selecting this option in only one tMatchGroup is not useful and may bring schema mismatch issues.
	Display detailed labels	Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be `fname:1.0\|lname:0.97` when the check box is selected and `1.0\|0.97` when it is not selected.
	tStatCatcher Statistics	Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component is an intermediary step. It requires an input flow as well as an output flow.
Usage in Map/Reduce Jobs	If you have subscribed to one of the Talend solutions with Big Data, you can also use this component as a Map/Reduce component. In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop. You need to use the Hadoop Configuration tab in the Run view to define the connection to a given Hadoop distribution for the whole Job. For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide. For a scenario demonstrating a Map/Reduce Job using this component, see Scenario 3: Matching data through multiple passes using Map/Reduce components. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.
Limitation/prerequisite	n/a

Configuration wizard

The configuration wizard enables you to create different production environments,
Configurations, and their match rules. You can also
use the configuration wizard to import match rules created and tested in the studio and
stored in the repository, and use them in your match Jobs. For further information, see
Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.

To open the configuration wizard:

In the studio workspace, design your job and link the components together, for
example as below:
Either:
- Double-click tMatchGroup, or
- Right-click it and from the contextual menu select Configuration Wizard, or
- Click Preview in the basic settings view of tMatchGroup.

The configuration wizard is composed of three areas:

the Configuration view, where you can set
the match rules and the blocking column(s).
the matching chart, which presents the graphic matching result,
the matching table, which presents the details of the matching
result.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to
execute the Job. You can use these different configurations for testing purposes for
example, but you can only save one configuration from the wizard, the open
configuration.

In each configuration, you can define blocking key(s) and multiple conditions
using several match rules. You can also set different match intervals for each rule.
The match results on multiple conditions will list data records that meet any of the
defined rules. When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the records
that match are not evaluated against the other rules.

The parameters required to edit or create a match rule are:

The Limit field indicates the maximum
number of rows to be processed by the match rule(s) in the wizard. The
by-default maximum row number is 1000.
The Key definition parameters.
The Match Threshold field.

To create a new configuration and new match rules from the configuration wizard,
do the following:

Click the [+] button on the top right
corner of the Configuration view.

This creates, in a new tab, an exact copy of the last
configuration.
Edit or set the parameters for the new configuration in the Key definition and Blocking
Selection tables.
If needed, define several match rules for the open configuration as the following:
- Click the [+] button on the
  match rule bar.
  
  This creates, in a new tab, an exact copy of the last
  rule.
- Set the parameters for the new rule in the Key definition table and define its
  match interval.
- Follow the steps above to create as many match rules for a
  configuration as needed. You can define a different match
  interval for each rule.
  
  When a configuration has multiple conditions, the Job conducts
  an OR match operation. It evaluates data records against the
  first rule and the records that match are not evaluated against
  the second rule and so on.
Click the Chart button at the top right
corner of the wizard to execute the Job in the open configuration.

The matching results are displayed in the matching chart and table.

Follow the steps above to create as many new configuration in the wizard
as needed.
To execute the Job in a specific configuration, open the configuration in
the wizard and click the Chart button.

The matching results are displayed in the matching chart and table.
At the bottom right corner of the wizard, click either:
- OK to save the open
  configuration.
  
  You can save only one configuration in the wizard.
- Cancel to close the wizard and
  keep the configuration saved initially in the wizard.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the
analyzed data.

The Hide groups less than parameter, which is set
to 2 by default, enables you to decide what groups to show in
the result chart. Usually you want to hide groups of small group size.

For example, the above matching chart indicates that:

48 items are analyzed and classified into 18 groups according to a
given match rule and after excluding items that are unique, by setting
the Hide groups less than parameter to
2.
11 groups have 2 items each. In each group, the 2 items are duplicates
of each other.
3 groups have 3 items each. In each group, these items are duplicates
of one another.
3 groups have 4 items each. In each group, these items are duplicates
of one another.
One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different
duplicates.

This table indicates the matching details of items in each group and colors the
groups in accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide groups of less than parameter. This parameter
enables you to hide groups of small group size. It is set to 2
by default.

The buttons under the table helps you to navigate back and forth through
pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you
can import match keys from the match rules created and tested in the Profiling perspective of Talend Studio and stored in
the repository. You can then use these imported matching keys in your match
Jobs.

The tMatchGroup component is based on the VSR algorithm.
You can not import match rules configured with the T-Swoosh algorithm. A warning
message displays in the wizard when you try to import rules with T-Swoosh.

The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records
and assigns them to groups. The first processed record of each group is the master
record of the group. The VSR algorithm compares each record with the master of each
group and uses the computed distances, from master records, to decide to what group
the record should go.

To import match rules from the studio repository:

From the configuration wizard, click the icon on the top right corner.

The [Match Rule Selector] wizard opens
listing all match rules created in the studio and saved in the
repository.
Select the match rule you want to import into the tMatchGroup component and use on your data.

Note

– A warning message displays in the wizard if the match rule you want
to import is defined on columns that do not exist in the input schema of
tMatchGroup. You can define input
columns later in the configuration wizard.

-A warning message displays in the wizard if you try to import match
rules configured with the T-Swoosh algorithm. For further information
about the T-Swoosh algorithm, see Talend Studio User
Guide.
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.

If you leave the box unselected, the match keys will be imported in a new
match rule tab without overwriting the current match rule in the
wizard.
Click OK.

The matching key is imported from the match rule and listed as a new rule
in the configuration wizard.
Click in the Input Key Attribute and
select from the input data the column on which you want to apply the
matching key.
In the Match threshold field, enter the
match probability threshold. Two data records match when the computed match
score is above this value.
In the Blocking Selection table, select
the column(s) from the input flow which you want to use as a blocking
key.

Defining a blocking key is not mandatory but advisable. Using a blocking
key partitions data in blocks and so reduces the number of records that need
to be examined, as comparisons are restricted to record pairs within each
block. Using blocking key(s) is very useful when you are processing big data
set.

Note that the Blocking Selection table in
the component is different from the Generation of
Blocking Key table in the match rule editor in the Profiling perspective.

The blocking column in tMatchGroup could
come from a tGenKey component (and would be
called T_GEN_KEY) or directly from the input schema (it
could be a ZIP column for instance). While the
Generation of Blocking Key table in the
match rule editor defines the parameters necessary to generate a blocking
key; this table is equivalent to the tGenKey component. The Generation of
Blocking Key table generates a blocking column
BLOCK_KEY used for blocking.
Click the Chart button in the top right
corner of the wizard to execute the Job using the imported match rule and
show the matching results in the wizard.

Configuration wizard in Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of
the Talend solutions with Big Data and is not applicable to
Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tMatchGroup, as well as the whole Map/Reduce Job using it, generates
native Map/Reduce code. This section presents the specific settings in the configuration
wizard of tMatchGroup when it is used in that
situation. For further information about a Talend Map/Reduce Job, see the Talend Big Data Getting Started Guide.

You can not open the configuration wizard unless you link an input component to the
tMatchGroup component.

From the configuration wizard in tMatchGroup, you can:

define multiple conditions using several match rules to group data,
set different match intervals for each rule,
import match rules created and tested in the studio and stored in the repository, and use
them in your match Jobs. You can only import rules configured with the VSR
algorithm. For further information, see Importing match rules from the studio repository.
select a blocking key to partition data.

The match results on multiple conditions will list data records that meet any of the
defined rules.

To create match rules from the configuration wizard, do the following:

Click the [+] button on the match rule
bar.
Set the parameters for the new rule in the Key
definition table and define its match interval.
Repeat the above steps to create as many match rules as needed. You can define
a different match interval for each rule.

When you define multiple rules, the Job conducts an OR match operation. It
evaluates data records against the first rule and the records that match are not
evaluated against the second rule.
In the Blocking Selection table, select the
column(s) from the input flow which you want to use as a blocking key.

Defining a blocking key is not mandatory but is very useful when you are
processing big data sets. A blocking key partitions data in blocks and so
reduces the number of records that need to be examined. This key can come from a
tGenKey component (and would be called
T_GEN_KEY) or directly from the input schema.
At the bottom right corner of the wizard, click either:
- OK to save the current
  configuration.
- Cancel to close the wizard and keep
  the configuration saved initially in the wizard.

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record

This scenario describes a basic Job that compares columns in the input file using the
Jaro-Winkler matching method on the
lname and fname column and the q-grams matching method on the address1
column. It then groups the output records in three output flows:

Uniques: lists the records which group
score (minimal distance computed in the record) is equal to
1.
Matches: lists the records which group
score (minimal distance computed in the record) is higher than the threshold
you define in the Confidence threshold
field.
Suspects: lists the records which group
score (minimal distance computed in the record) is below the threshold you
define in the Confidence threshold
field.

For another scenario that groups the output records in one single output flow, see
Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key.

Setting up the Job

Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.
Connect tFileInputExcel to tMatchGroup using the Main
link.
Connect tMatchGroup to the three
tLogRow components using the Unique rows, Confident
groups and Uncertain groups
links.

Warning

To be able to set three different output flows for the processed
records, you must first select the Separate
output check box in the Advanced
settings view of the tMatchGroup component. For further information, see the
section about configuring the tMatchGroup component.

Configuring the input component

The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.

You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your Talend Studio. For
further information, see the Talend Studio User
Guide.

In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
and set other properties in case they are not stored in the Repository.
Create the schema through the Edit Schema
button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Configuring the tMatchGroup component

Double-click tMatchGroup to display the
Basic settings view and define the
component properties.
Click Sync columns to retrieve the schema
from the preceding component.
Click the Edit schema button to view the
input and output schema and do any modifications in the output schema, if
necessary.

In the output schema of this component there are few output standard
columns that are read-only. For more information, see tMatchGroup properties.
Click OK to close the dialog box.
Click Preview to open the configuration
wizard and define the component configuration and the match rule(s).

You can use the configuration wizard to import match rules created and
tested in the studio and stored in the repository, and use them in your
match Jobs. For further information, see Importing match rules from the studio repository.
Define the first match rule as the following:
- In the Key definition table,
  click the [+] button to add to the
  list the column(s) on which you want to do the matching operation,
  lname and
  fname.
  
  Note
  
  When you select a date column on which to apply an algorithm or a matching algorithm,
  you can decide what to compare in the date format.
  
  For example, if you want to only compare the year in the date, in the component schema
  set the type of the date column to Date and then enter
  “yyyy” in the Date
  Pattern field. The component then converts the date format to a string
  according to the pattern defined in the schema before starting a string
  comparison.
- Click in the cell of the Matching
  type column and select from the list Jaro-Winkler as the method to be used for
  the matching operation.
  
  If you select custom as a
  matching type, you must set in the Custom
  Matcher Class column the path pointing to the custom
  class (external matching algorithm) you need to use. This path is
  defined by yourself in the library file (.jar
  file).
- Click in the cell of the Confidence
  Weight column to set the numerical weights for the
  two columns used as key attributes.
- Click in the cell of the Handle
  Null column and select the null operator you want to
  use to handle null attributes in the columns. In this example,
  select Null Match None in order to
  have matching results where null values have minimal effect.
- Set the match probability in the Match
  Interval field.
Follow the same procedure in the above step to define the second match
rule.

Set the address1 column as an input attribute and
select Jaro as the matching type. Select
Null Match None as the null operator.
And finally set the match probability which can be different from the one
set for the first rule.
Set the Hide groups of less than
parameter in order to decide what groups to show in the result chart and
matching table. This parameter enables you to hide groups of small group
size.
Click the Advanced settings tab and set
the advanced parameters for the tMatchGroup
component as the following:
- Select the Separate output check
  box.
  
  The component will have three separate output flows: Unique rows, Confident groups and Uncertain
  groups.
  
  If this check box is not selected, the tMatchGroup component will have only one output flow
  where it groups all output data. For an example scenario, see Scenario 2: Comparing columns and grouping in the output flow duplicate
  records that have the same functional key.
- Select the Sort the output data by
  GID check box to sort the output data by their group
  identifier.
- Select the Output distance
  details and Display detailed
  labels check boxes.
  
  The component will output the
  MATCHING_DISTANCES column. This column
  provides the distance between the input and the master columns
  giving also the names of the columns against which the records are
  matched.
Click the Chart button in the wizard to
execute the Job in the defined configuration and have the matching results
directly in the wizard.

The matching chart gives a global picture about the duplicates in the
analyzed data. The matching table indicates the details of items in each
group and colors the groups in accordance with their color in the matching
chart.

The Job conducts an OR match operation on the records. It evaluates the
records against the first rule and the records that match are not evaluated
against the second rule. The MATCHING_DISTANCES column allows you to understand which
rule has been used on what records. In the yellow data group for example,
the Amole Sarah record is matched according to the
second rule that uses address1 as a key attribute,
whereas the other records in the group are matched according to the first
rule which uses the lname and
fname as key attributes.

You can set the Hide groups of less than
parameter in order to decide what groups to show in the matching chart and
table

Finalizing the Job and executing it

Double-click each of the tLogRow
components to display the Basic
settings view and define the component properties.
Save your Job and press F6 to execute
it.

You can see that records are grouped together in three different groups.
Each record is listed in one of the three groups according to the value of
the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String
data type, is listed in the GID column next to the
corresponding record. This identifier will be of the data type
Long for Jobs that are migrated from older
releases. To have the group identifier as String, you
must replace the tMatchGroup component in
the imported Job with tMatchGroup from the
studio Palette.

The number of records in each of the three output blocks is listed in the
GRP_SIZE column and computed only on the master
record. The MASTER column indicates with true or false
if the corresponding record is a master record or not a master record. The
SCORE column lists the calculated distance between
the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that
match are not evaluated against the second rule.

All records which group score is between the match interval,
0.95 or 0.85 depending on the
applied rule, and the confidence threshold defined in the advanced settings
of tMatchGroupare listed in the Suspects output flow.

All records which group score is above one of the match probabilities are
listed in the Matches output flow.

All records that have a group size equal to 1 is listed in the Uniques output flow.

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key.

Scenario 2: Matching customer data through multiple passes

The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.

In this Job:

The tMysqlInput component connects to the
customer records to be processed.
Each of the tGenKey components defines a way
to partition data records. The first key partitions data to many groups and the
second key creates fewer groups that overlaps the previous blocks depending on
the blocking key definition.
The tMap component renames the key generated
by the second tGenKey component.
The first tMatchGroup processes the
partitions defined by the first tGenKey, and
the second tMatchGroup processes those defined
by the second tGenKey.

Warning

The two tMatchGroup components must
have the same schema.
The tLogRow component presents the matching
results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the Talend Studio User
Guide.

In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schema and
drop the database table onto the design workspace. The input table used in
this scenario is called customer.

A dialog box is displayed with a list of components.
Select the relevant database component, tMysqlInput in this example, and then click OK.
Drop two tGenKey components, two
tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.
Link the input component to the tGenKey
and tMap components using Main links.
In the two tMatchGroup components, select
the Output distance details check boxes in
the Advanced settings view of both
components before linking them together.

This will provide the MATCHING_DISTANCES column in
the output schema of each tMatchGroup.

Note

If the two tMatchGroup components are
already linked to each other, you must select the Output distance details check box in the second
component in the Job flow first otherwise you may have an issue.
Link the two tMatchGroup components and
the tLogRow component using Main links.
If needed, give the components specific labels to reflect their usage in
the Job.

For further information about how to label a component, see Talend Studio
User Guide.

Configuring input data and key generation

Connecting to the input data

Double-click tMysqlInput to open its
Component view.

The property fields for tMysqlInput are
automatically filled in. If you do not define your input schema locally in
the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
lists.

The input table used in this scenario is called
customer.
Modify the query in the Query box to
select only the columns you want to match:
account_name, lname,
fname, mi,
address1, city,
state_province and
postal_code.

Configuring the key generation for the first pass

Double-click the first tGenKey to open
the Component view.
Click and import blocking keys from match rules created and
tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the
[+] button to add two rows in the
table.
On the column column, click the newly
added row and select from the list the column you want to process using an
algorithm. In this example, select lname.
Do the same on the second row to select
postal_code.
On the pre-algorithm column, click the
newly added row and select from the list the pre-algorithm you want to apply
to the corresponding column.

In this example, select remove diacritical marks and
convert to upper case to remove any diacritical mark and
converts the fields of the lname column
to upper case.

Note

This conversion does not change your raw data.
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word.
Do the same for the second row on the algorithm column to select first N
characters of the string.
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed.

In this scenario, enter 1 for both rows. The first letter of each
field in the corresponding columns will be used to generate the key.

Configuring the key generation for the second pass

Double-click the second tGenKey to open
the Component view.
In the Algorithm table, define the column
you want to use to partition data, account_num in this
component. Select the first N characters of the
string algorithm and set the value to 1
in the Value column.

Each of the two tGenKey components will
generate a read_only T_GEN_KEY column in
the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
other.
Double-click the tMap component to open
its editor.
In the Schema editor, copy the columns
from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.
In the top part of the editor, drop all columns from the input table to
the output table.
Click Ok to save data transformation and
close the editor.
In the tGenKey basic settings, click the button to verify that the two generated keys are named
differently in the output schema.

Configuring the two matching passes

Configuring the first pass

Double-click the first tMatchGroup
labelled pass1 to display the Configuration Wizard.
Click and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps.
In the Key definition table, click the
[+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
Set Weight to 1 and
in the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the
table then click in the line and select from the list the column you want to
use as a blocking value, T_GEN_KEY in
this example.

Using a blocking value reduces the number of pairs of records that needs
to be examined. The input data is partitioned into exhaustive blocks based
on the functional key. This will decrease the number of pairs to compare, as
comparison is restricted to record pairs within each block.
If required, click Edit schema to open
the schema editor and see the schema retrieved from the previous component
in the Job.
Click the Advanced settings tab and
select the Sort the output data by GID
check box to arrange the output data by their group IDs.

Configuring the second pass

Double-click the second tMatchGroup
component labelled pass2 to display the
Configuration Wizard.

If this component does not have the same schema of the preceding
component, a warning icon appears. If so, click the Sync columns button to retrieve the schema from the
preceding one and once done, the warning icon disappears.
In the Key Definition table, click
the [+] button to add the column(s) on
which you want to do the matching operation, lname in this scenario.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
Set Weight to 1 and
in the Handle Null column, select the null
operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table then
click in the line and select from the list the column you want to use as a
blocking value, T_GEN_KEY1 in this
example.
Click the Advanced settings tab and
select the Multi-pass check box. This
option enables tMatchGroup to receive data
sets from the tMatchGroup that precedes it
in the Job.
In the Advanced settings view, select the
Sort the output data by GID check box
to arrange the output data by their group IDs.

Executing the Job and showing the results on the console

In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:

Use a tReplicate component to
replicate the input flow you want to process as shown in the above
figure. Use a copy/paste operation to create the two parts of the
Job.
Keep only the first pass,tMatchGroup component, in the upper part of the Job
and show the match results in a tLogRow component.
Use two passes in the lower part of the Job and show the final
match results in a tLogRow
component.

Double-click each of the tLogRow
components to open the Component view and
in the Mode area, select the Table (print values in cells of a table)
option.
Save your Job and press F6 to execute
it.

The results after the first pass read as follows:

The results after the second pass read as follows:

When you compare, for example, the customer name Wilson from the results of the two passes, you will find
that more customers using the last name Wilson are grouped together after the second pass.

Scenario 3: Matching data through multiple passes using Map/Reduce components

Note that Talend Map/Reduce components are available only to users
who subscribed to Big Data.

This scenario shows how to create a Talend Map/Reduce Job to match data by
using Map/Reduce components. It generates Map/Reduce code and runs right in
Hadoop.

This Job is a duplication of the Standard data
integration Job described in Scenario 2: Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.

You can use Talend Studio to automatically
convert the standard Job in the previous section to a Map/Reduce Job. This way, you do
not need to redefine the settings of the components in the Job.

Before starting to replicate this scenario, ensure that you have appropriate rights
and permissions to access the Hadoop distribution to be used.

Converting the Job

In the Repository tree view of the Integration perspective of Talend Studio, right-click the
Job you have created in the earlier scenario to open its contextual menu and
select Edit properties.

Then the [Edit properties] dialog box is
displayed. Note that the Job must be closed before you are able to make any
changes in this dialog box.

This dialog box looks like the image below:

Note that you can change the Job name as well as the other descriptive
information about the Job from this dialog box.
Click Convert to Map/Reduce Job. Then a
Map/Reduce Job using the same name appears under the Map/Reduce Jobs sub-node of the Job
Design node.

If you need to create this Map/Reduce Job from scratch, you have to right-click the
Job Design node or the Map/Reduce Jobs sub-node and select Create
Map/Reduce Job from the contextual menu. Then an empty Job is opened in
the workspace. For further information, see the section describing how to create a
Map/Reduce Job of the Talend Big Data Getting Started Guide.

Setting up the Job

Double-click the new Map/Reduce Job to open it in the workspace.

The Map/Reduce component Palette is
opened. A crossed-out component, if any, indicates that it does not have the
Map/Reduce version.
Delete tMysqlInput in this scenario and
drop tRowGenerator from the Palette to the workspace. Link it to tGenKey with a Row >
Main link.
Double-click tRowGenerator to open its
editor.
Define the schema you want to use to write data in Hadoop.
Click OK to validate your schema and
close the editor.
Leave the settings of the other components as you defined initially in the
standard version of the Job.

Setting up Hadoop connection

Click Run to open its view and then click the
Hadoop Configuration tab to display its
view for configuring the Hadoop connection for this Job.

This view looks like the image below:
From the Property type list, select Built-in. If you have created the connection to be
used in Repository, then select Repository and thus the Studio will reuse that set of
connection information for this Job.

For further information about how to create an Hadoop connection in
Repository, see the chapter describing the Hadoop
cluster node of the Talend Big Data Getting Started Guide.
In the Version area, select the Hadoop
distribution to be used and its version. If you cannot find from the list the
distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially
supported in the Studio.

For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

Along with the evolution of Hadoop, please note the
following changes:
- If you use Hortonworks Data Platform
  V2.2, the configuration files of your cluster might be using
  environment variables such as ${hdp.version}. If this is your situation, you need to set
  the mapreduce.application.framework.path property in the
  Hadoop properties table with the path
  value explicitly pointing to the MapReduce framework archive of your
  cluster. For
  example:
  
  mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework
  
  1
  
  mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework
- If you use Hortonworks Data Platform
  V2.0.0, the type of the operating system for running the
  distribution and a Talend Job must be the same,
  such as Windows or Linux. Otherwise, you have to use Talend Jobserver to execute the Job in the same
  type of operating system in which the Hortonworks
  Data Platform V2.0.0 distribution you are using is run. For
  further information about Talend Jobserver, see
  Talend
  Installation and Upgrade Guide.
In the Name node field, enter the location of
the master node, the NameNode, of the distribution to be used. For example,
hdfs://tal-qa113.talend.lan:8020.

If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
client will take care of the rest on the fly for creating the connection. The
MapR client must be properly installed. For further information about how to set
up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client
In the Job tracker field, enter the location
of the JobTracker of your distribution. For example, tal-qa114.talend.lan:8050.

Note that the notion Job in this term JobTracker designates the MR or the
MapReduce jobs described in Apache’s documentation on http://hadoop.apache.org/.

If you use YARN in your Hadoop cluster such as Hortonworks Data Platform V2.0.0 or Cloudera CDH4.3 + (YARN mode), you need to specify the location
of the Resource Manager instead of the
Jobtracker. Then you can continue to set the following parameters depending on
the configuration of the Hadoop cluster to be used (if you leave the check box
of a parameter clear, then at runtime, the configuration about this parameter in
the Hadoop cluster to be used will be ignored ):
- Select the Set resourcemanager scheduler
  address check box and enter the Scheduler address in
  the field that appears.
- Select the Set jobhistory address
  check box and enter the location of the JobHistory server of the
  Hadoop cluster to be used. This allows the metrics information of
  the current Job to be stored in that JobHistory server.
- Select the Set staging directory
  check box and enter this directory defined in your Hadoop cluster
  for temporary files created by running programs. Typically, this
  directory can be found under the yarn.app.mapreduce.am.staging-dir property in the
  configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
- Select the Use datanode hostname
  check box to allow the Job to access datanodes via their hostnames.
  This actually sets the dfs.client.use.datanode.hostname property to
  true. When connecting to a
  S3N filesystem, you must select this check box.
If you are accessing the Hadoop cluster running with Kerberos security, select this check
box, then, enter the Kerberos principal name for the NameNode in the field displayed. This
enables you to use your user name to authenticate against the credentials stored in
Kerberos.

In addition, since this component performs Map/Reduce computations, you also need to
authenticate the related services such as the Job history server and the Resource manager or
Jobtracker depending on your distribution in the corresponding field. These principals can
be found in the configuration files of your distribution. For example, in a CDH4
distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

If you need to use a Kerberos keytab file to log in, select Use a
keytab to authenticate. A keytab file contains pairs of Kerberos principals
and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the
Keytab field.

Note that the user that executes a keytab-enabled Job is not necessarily the one a
principal designates but must have the right to read the keytab file being used. For
example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
In the User name field, enter the login user
name for your distribution. If you leave it empty, the user name of the machine
hosting the Studio will be used.
In the Temp folder field, enter the path in
HDFS to the folder where you store the temporary files generated during
Map/Reduce computations.
Leave the default value of the Path separator in server as
it is, unless you have changed the separator used by your Hadoop distribution’s host machine
for its PATH variable or in other words, that separator is not a colon (:). In that
situation, you must change this value to the one you are using in that host.
Leave the Clear temporary folder check box
selected, unless you want to keep those temporary files.
Leave the Compress intermediate map output to reduce
network traffic check box selected, so as to spend shorter time
to transfer the mapper task partitions to the multiple reducers.

However, if the data transfer in the Job is negligible, it is recommended to
clear this check box to deactivate the compression step, because this
compression consumes extra CPU resources.
If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
properties to be customized. Then at runtime, these changes will override the
corresponding default properties used by the Studio for its Hadoop
engine.

For further information about the properties required by Hadoop, see Apache’s
Hadoop documentation on http://hadoop.apache.org, or
the documentation of the Hadoop distribution you need to use.
If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks
Data Platform V1.3, you need to set proper memory allocations for the map and reduce
computations to be performed by the Hadoop system.

In that situation, you need to enter the values you need in the Mapred
job map memory mb and the Mapred job reduce memory
mb fields, respectively. By default, the values are both 1000 which are normally appropriate for running the
computations.

If the distribution is YARN, then the memory parameters to be set become Map (in Mb), Reduce (in Mb) and
ApplicationMaster (in Mb), accordingly. These fields
allow you to dynamically allocate memory to the map and the reduce computations and the
ApplicationMaster of YARN.

For further information about this Hadoop
Configuration tab, see the section describing how to configure the Hadoop
connection for a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide.

For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.

Executing the Job

Save the Job and press F6 to execute
it.

Match results are displayed on the studio console.

Matches are calculated by running through the two passes. Each pass
provides its matches to the pass that follows and more matches are
identified with the rule and blocking key of the second pass.

More customers with similar last names are grouped together after the
second pass.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tMatchGroup – Docs for ESB 5.x

tMatchGroup

Warning

tMatchGroup properties

Note

Note

Note

Configuration wizard

Configuration view

Matching chart

Matching table

Importing match rules from the studio repository

Note

Configuration wizard in Map/Reduce Jobs

Warning

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record

Setting up the Job

Warning

Configuring the input component

Configuring the tMatchGroup component

Note

Finalizing the Job and executing it

Scenario 2: Matching customer data through multiple passes

Warning

Setting up the Job

Note

Configuring input data and key generation

Note

Configuring the two matching passes

Note

Note

Scenario 3: Matching data through multiple passes using Map/Reduce components

Converting the Job

Setting up the Job

Setting up Hadoop connection

Executing the Job

My Website Links

Tags

tMatchGroup

Warning

tMatchGroup properties

Note

Note

Note

Configuration wizard

Configuration view

Matching chart

Matching table

Importing match rules from the studio repository

Note

Configuration wizard in Map/Reduce Jobs

Warning

Scenario 1: Grouping output data in separate flows according to the minimal distance computed in each record

Setting up the Job

Warning

Configuring the input component

Configuring the tMatchGroup component

Note

Finalizing the Job and executing it

Scenario 2: Matching customer data through multiple passes

Warning

Setting up the Job

Note

Configuring input data and key generation

Note

Configuring the two matching passes

Note

Note

Scenario 3: Matching data through multiple passes using Map/Reduce components

Converting the Job

Setting up the Job

Setting up Hadoop connection

Executing the Job

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record