August 17, 2023

tMatchGroup – Docs for ESB 5.x

tMatchGroup

tGroup_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tMatchGroup properties

Component family

Data Quality

 

Function

tMatchGroup compares columns in
both standard input data flows and in M/R input data flows by using
matching methods and groups similar encountered duplicates
together.

Several tMatchGroup components
can be used sequentially to match data against different blocking
keys. This will refine the groups received by each of the tMatchGroup components through creating
different data partitions that overlap previous data blocks and so
on.

In defining a group, the first processed record of each group is
the master record of the group. The other records are computed as to
their distances from the master records and then are distributed to
the due master record accordingly.

Purpose

This component helps you to create groups of similar data records
in any source data including large volumes of data by using one or
several match rules.

Basic settings

Schema and Edit
schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Sync columns to retrieve
the schema from the previous component in the Job.

The output schema of this component contains the following
read-only fields:

GID: provides a group identifier of the data
type String.

Note

All Jobs with tMatchGroup
that are migrated from older releases into your current
studio may provide a group identifier of the data type
Long. If you want to have a group
identifier of the data type String, you
must replace the tMatchGroup component in these Jobs with
tMatchGroup from the
studio Palette.

GRP_SIZE: counts the number of records in the
group, computed only on the master record.

MASTER: identifies, by true or
false, if the record used in the matching
comparisons is a master record. There is only one master record per
group.

Each input record will be compared to the master record, if they
match, the input record will be in the group.

SCORE: measures the distance between the
input record and the master record according to the matching
algorithm used.

In case the tMatchGroup component
is used to have multiple output flows, the score in this column
decides to what output group the record should go.

GRP_QUALITY: provides the quality of
similarities in the group by taking the minimal matching value. Only
the master record has a quality score.

 

 

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

 

Repository: You have already
created and stored the schema in the Repository. You can reuse it in
other projects and job designs. Related topic: see Talend Studio User Guide.

PREVIEW

This button opens a configuration wizard that enables you to
define production environments and their match rules or to import
match rules from the studio repository. For further information, see
Configuration wizard

Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to
apply a matching algorithm.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
yyyy” in the Date
Pattern
field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.

 

 

Matching Function

Select a matching algorithm from the list:

Exact: matches each processed
entry to all possible reference entries with exactly the same value.
It returns 1 when the two strings exactly
match, otherwise it returns 0.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the
same value while ignoring the value case.

Soundex: matches processed
entries according to a standard English phonetic algorithm. It
indexes strings by sound, as pronounced in English, for example
“Hello”: “H400”.

Levenshtein (edit distance):
calculates the minimum number of edits (insertion, deletion or
substitution) required to transform one string into another. Using
this algorithm in the tMatchGroup
component, you do not need to specify a maximum distance. The
component automatically calculates a matching percentage based on
the distance. This matching score will be used for the global
matching calculation, based on the weight you assign in the Confidence Weight field.

Metaphone: Based on a phonetic
algorithm for indexing entries by their pronunciation. It first
loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the
reference flow.

Double Metaphone: a new version
of the Metaphone phonetic algorithm, that produces more accurate
results than the original algorithm. It can return both a primary
and a secondary code for a string. This accounts for some ambiguous
cases as well as for multiple variants of surnames with common
ancestry.

Soundex FR: matches processed
entries according to a standard French phonetic algorithm.

Jaro: matches processed entries
according to spelling deviations. It counts the number of matched
characters between two strings. The higher the distance is, the more
similar the strings are.

Jaro-Winkler: a variant of Jaro,
but it gives more importance to the beginning of the string.

q-grams: matches processed
entries by dividing strings into letter blocks of length
q in order to create a number of q
length grams. The matching result is given as the number of
q-gram matches over possible q-grams.

custom…: enables you to load an
external matching algorithm from a Java library using the custom matcher class column.

For further information about how to load an external Java
library, see tLibraryLoad.

For further information about how to create a custom matching
algorithm, see Creating a custom matching algorithm.

For a related scenario about how to use a custom matching
algorithm, see Scenario 2: Using a custom matching algorithm to match entries.

 

Custom Matcher

When you select Custom as the
matching type, enter the path pointing to the custom class (external
matching algorithm) you need to use. This path is defined by
yourself in the library file (.jar
file).

For example, to use a MyDistance.class class
stored in the directory org/talend/mydistance
in a user-defined mydistance.jar library, the
path to be entered is
org.talend.mydistance.MyDistance.

 

 

Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0.

 

Handle Null

To handle null values, select from the list the null operator you
want to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but
the first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may
not match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they
match.

And for the records:

“Doe”, “”

“Doe”, “”

Null Match Null: they
match.

Null Match None: they do not
match.

Null Match All: they
match.

Match Threshold

Enter the match probability. Two data records match when the
probability is above the set value.

You can enter a different match threshold for each match
rule.

Blocking Selection

Input Column

If required, select the column(s) from the input flow according to
which you want to partition the processed data in blocks, this is
usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive
blocks designed to increase the proportion of matches observed while
decreasing the number of pairs to compare. Comparisons are
restricted to record pairs within each block.

Using blocking column(s) is very useful when you are processing
very big data.

Advanced settings

Store on disk

Select the Store on disk check
box if you want to store processed data blocks on the disk to
maximize system performance.

Max buffer size: Type in the size
of physical memory you want to allocate to processed data.

Temporary data directory path:
Set the location where the temporary file should be stored.

 

Multiple output

Select the Separate output check
box to have three different output flows:

Uniques: when the group score
(minimal distance computed in the record) is equal to
1, the record is listed in this flow.

Matches: when the group score
(minimal distance computed in the record) is higher than the
threshold you define in the Confidence
threshold
field, the record is listed in this
flow.

Suspects: when the group score
(minimal distance computed in the record) is below the threshold you
define in the Confidence threshold
field, the record is listed in this flow.

Confident match threshold: set a
numerical value between the current Match
threshold
and 1. Above this
threshold, you can be confident in the quality of the group.

 

Multi-pass

Select this check box to enable a tMatchGroup component to receive data sets from
another tMatchGroup that precedes
it in the Job. This will refine the groups received by each of the
tMatchGroup components through
creating data partitions based on different blocking keys.

For an example Job, see Scenario 2: Matching customer data through multiple passes

 

Sort the output data by GID

Select this check box to group the output data by the group
identifier.

 

Output distance details

Select this check box to add an output column
MATCHING_DISTANCES in the schema of the
component. This column provides the distance between the input and
master records in each group.

Note

When you use two tMatchGroup
components in a Job and you want to use the Output distance details option, you
must select this check box in both components before you link
them together. If the components are linked, select the check
box in the second component in the Job flow first then in the
first component, otherwise you may have an issue as there are
two columns in the output schema with the same name. Selecting
this option in only one tMatchGroup is not useful and may bring schema
mismatch issues.

 

Display detailed labels

Select this check box to have in the output
MATCHING_DISTANCES column not only the
matching distance but also the names of the columns used as key
attributes in the applied rule.

For example, if you try to match on first name and last name
fields, lname and fname,
the output would be fname:1.0|lname:0.97 when the check
box is selected and 1.0|0.97 when it is not
selected.

 

tStatCatcher Statistics

Select this check box to collect log data at the component level.
Note that this check box is not available in the Map/Reduce version
of the component.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is an intermediary step. It requires an input flow
as well as an output flow.

Usage in Map/Reduce Jobs

If you have subscribed to one of the Talend solutions with Big Data, you can also
use this component as a Map/Reduce component. In a Talend Map/Reduce Job, this
component is used as an intermediate step and other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

You need to use the Hadoop Configuration tab in the
Run view to define the connection to a given Hadoop
distribution for the whole Job.

For further information about a Talend Map/Reduce Job, see the sections
describing how to create, convert and configure a Talend Map/Reduce Job of the
Talend Big Data Getting Started Guide.

For a scenario demonstrating a Map/Reduce Job using this
component, see Scenario 3: Matching data through multiple passes using Map/Reduce components.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Limitation/prerequisite

n/a

Configuration wizard

The configuration wizard enables you to create different production environments,
Configurations, and their match rules. You can also
use the configuration wizard to import match rules created and tested in the studio and
stored in the repository, and use them in your match Jobs. For further information, see
Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the
tMatchGroup component.

To open the configuration wizard:

  1. In the studio workspace, design your job and link the components together, for
    example as below:

    use_case_tmatchgroup.png
  2. Either:

    • Double-click tMatchGroup, or

    • Right-click it and from the contextual menu select Configuration Wizard, or

    • Click Preview in the basic settings view of tMatchGroup.

match_studio.png

The configuration wizard is composed of three areas:

  • the Configuration view, where you can set
    the match rules and the blocking column(s).

  • the matching chart, which presents the graphic matching result,

  • the matching table, which presents the details of the matching
    result.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to
execute the Job. You can use these different configurations for testing purposes for
example, but you can only save one configuration from the wizard, the open
configuration.

In each configuration, you can define blocking key(s) and multiple conditions
using several match rules. You can also set different match intervals for each rule.
The match results on multiple conditions will list data records that meet any of the
defined rules. When a configuration has multiple conditions, the Job conducts an OR
match operation. It evaluates data records against the first rule and the records
that match are not evaluated against the other rules.

The parameters required to edit or create a match rule are:

  • The Limit field indicates the maximum
    number of rows to be processed by the match rule(s) in the wizard. The
    by-default maximum row number is 1000.

  • The Key definition parameters.

  • The Match Threshold field.

To create a new configuration and new match rules from the configuration wizard,
do the following:

  1. Click the [+] button on the top right
    corner of the Configuration view.

    This creates, in a new tab, an exact copy of the last
    configuration.

    match_studio_new_config.png
  2. Edit or set the parameters for the new configuration in the Key definition and Blocking
    Selection
    tables.

  3. If needed, define several match rules for the open configuration as the following:

    • Click the [+] button on the
      match rule bar.

      This creates, in a new tab, an exact copy of the last
      rule.

    • Set the parameters for the new rule in the Key definition table and define its
      match interval.

    • Follow the steps above to create as many match rules for a
      configuration as needed. You can define a different match
      interval for each rule.

      When a configuration has multiple conditions, the Job conducts
      an OR match operation. It evaluates data records against the
      first rule and the records that match are not evaluated against
      the second rule and so on.

  4. Click the Chart button at the top right
    corner of the wizard to execute the Job in the open configuration.

    The matching results are displayed in the matching chart and table.

    Follow the steps above to create as many new configuration in the wizard
    as needed.

  5. To execute the Job in a specific configuration, open the configuration in
    the wizard and click the Chart button.

    The matching results are displayed in the matching chart and table.

  6. At the bottom right corner of the wizard, click either:

    • OK to save the open
      configuration.

      You can save only one configuration in the wizard.

    • Cancel to close the wizard and
      keep the configuration saved initially in the wizard.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the
analyzed data.

The Hide groups less than parameter, which is set
to 2 by default, enables you to decide what groups to show in
the result chart. Usually you want to hide groups of small group size.

tmatchgroup-wizard_chart.png

For example, the above matching chart indicates that:

  • 48 items are analyzed and classified into 18 groups according to a
    given match rule and after excluding items that are unique, by setting
    the Hide groups less than parameter to
    2.

  • 11 groups have 2 items each. In each group, the 2 items are duplicates
    of each other.

  • 3 groups have 3 items each. In each group, these items are duplicates
    of one another.

  • 3 groups have 4 items each. In each group, these items are duplicates
    of one another.

  • One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different
duplicates.

tmatchgroup-wizard_table.png

This table indicates the matching details of items in each group and colors the
groups in accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide groups of less than parameter. This parameter
enables you to hide groups of small group size. It is set to 2
by default.

The buttons under the table helps you to navigate back and forth through
pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you
can import match keys from the match rules created and tested in the Profiling perspective of Talend Studio and stored in
the repository. You can then use these imported matching keys in your match
Jobs.

The tMatchGroup component is based on the VSR algorithm.
You can not import match rules configured with the T-Swoosh algorithm. A warning
message displays in the wizard when you try to import rules with T-Swoosh.

The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records
and assigns them to groups. The first processed record of each group is the master
record of the group. The VSR algorithm compares each record with the master of each
group and uses the computed distances, from master records, to decide to what group
the record should go.

To import match rules from the studio repository:

  1. From the configuration wizard, click the match_rule_import_icon.png icon on the top right corner.

    The [Match Rule Selector] wizard opens
    listing all match rules created in the studio and saved in the
    repository.

    tmatchgroup-wizard_import_rule.png
  2. Select the match rule you want to import into the tMatchGroup component and use on your data.

    Note

    – A warning message displays in the wizard if the match rule you want
    to import is defined on columns that do not exist in the input schema of
    tMatchGroup. You can define input
    columns later in the configuration wizard.

    -A warning message displays in the wizard if you try to import match
    rules configured with the T-Swoosh algorithm. For further information
    about the T-Swoosh algorithm, see Talend Studio User
    Guide
    .

  3. Select the Overwrite current Match Rule in the
    analysis
    check box if you want to replace the rule in the
    configuration wizard with the rule you import.

    If you leave the box unselected, the match keys will be imported in a new
    match rule tab without overwriting the current match rule in the
    wizard.

  4. Click OK.

    The matching key is imported from the match rule and listed as a new rule
    in the configuration wizard.

    tmatchgroup-wizard_import_rule2.png
  5. Click in the Input Key Attribute and
    select from the input data the column on which you want to apply the
    matching key.

  6. In the Match threshold field, enter the
    match probability threshold. Two data records match when the computed match
    score is above this value.

  7. In the Blocking Selection table, select
    the column(s) from the input flow which you want to use as a blocking
    key.

    Defining a blocking key is not mandatory but advisable. Using a blocking
    key partitions data in blocks and so reduces the number of records that need
    to be examined, as comparisons are restricted to record pairs within each
    block. Using blocking key(s) is very useful when you are processing big data
    set.

    Note that the Blocking Selection table in
    the component is different from the Generation of
    Blocking Key
    table in the match rule editor in the Profiling perspective.

    The blocking column in tMatchGroup could
    come from a tGenKey component (and would be
    called T_GEN_KEY) or directly from the input schema (it
    could be a ZIP column for instance). While the
    Generation of Blocking Key table in the
    match rule editor defines the parameters necessary to generate a blocking
    key; this table is equivalent to the tGenKey component. The Generation of
    Blocking Key
    table generates a blocking column
    BLOCK_KEY used for blocking.

  8. Click the Chart button in the top right
    corner of the wizard to execute the Job using the imported match rule and
    show the matching results in the wizard.

Configuration wizard in Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of
the Talend solutions with Big Data and is not applicable to
Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tMatchGroup, as well as the whole Map/Reduce Job using it, generates
native Map/Reduce code. This section presents the specific settings in the configuration
wizard of tMatchGroup when it is used in that
situation. For further information about a Talend Map/Reduce Job, see the Talend Big Data Getting Started Guide.

You can not open the configuration wizard unless you link an input component to the
tMatchGroup component.

From the configuration wizard in tMatchGroup, you can:

  • define multiple conditions using several match rules to group data,

  • set different match intervals for each rule,

  • import match rules created and tested in the studio and stored in the repository, and use
    them in your match Jobs. You can only import rules configured with the VSR
    algorithm. For further information, see Importing match rules from the studio repository.

  • select a blocking key to partition data.

tmatchgroup-mr_wizard.png

The match results on multiple conditions will list data records that meet any of the
defined rules.

To create match rules from the configuration wizard, do the following:

  1. Click the [+] button on the match rule
    bar.

  2. Set the parameters for the new rule in the Key
    definition
    table and define its match interval.

  3. Repeat the above steps to create as many match rules as needed. You can define
    a different match interval for each rule.

    When you define multiple rules, the Job conducts an OR match operation. It
    evaluates data records against the first rule and the records that match are not
    evaluated against the second rule.

  4. In the Blocking Selection table, select the
    column(s) from the input flow which you want to use as a blocking key.

    Defining a blocking key is not mandatory but is very useful when you are
    processing big data sets. A blocking key partitions data in blocks and so
    reduces the number of records that need to be examined. This key can come from a
    tGenKey component (and would be called
    T_GEN_KEY) or directly from the input schema.

  5. At the bottom right corner of the wizard, click either:

    • OK to save the current
      configuration.

    • Cancel to close the wizard and keep
      the configuration saved initially in the wizard.

Scenario 1: Grouping output data in separate flows according to the minimal distance
computed in each record

This scenario describes a basic Job that compares columns in the input file using the
Jaro-Winkler matching method on the
lname and fname column and the q-grams matching method on the address1
column. It then groups the output records in three output flows:

  • Uniques: lists the records which group
    score (minimal distance computed in the record) is equal to
    1.

  • Matches: lists the records which group
    score (minimal distance computed in the record) is higher than the threshold
    you define in the Confidence threshold
    field.

  • Suspects: lists the records which group
    score (minimal distance computed in the record) is below the threshold you
    define in the Confidence threshold
    field.

use_case_tmatchgroup.png

For another scenario that groups the output records in one single output flow, see
Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key
.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.

  2. Connect tFileInputExcel to tMatchGroup using the Main
    link.

  3. Connect tMatchGroup to the three
    tLogRow components using the Unique rows, Confident
    groups
    and Uncertain groups
    links.

    Warning

    To be able to set three different output flows for the processed
    records, you must first select the Separate
    output
    check box in the Advanced
    settings
    view of the tMatchGroup component. For further information, see the
    section about configuring the tMatchGroup component.

Configuring the input component

The main input file contains eight columns: account_num,
lname, fname,
mi, address1,
city, state_province and
postal_code. The data in this input file has problems
such as duplication, names spelled differently or wrongly, different information
for the same customer.

use_case_tmatchgroup_input.png

You can create the input file used in this scenario if you execute the
c0 and c1 Jobs included in the
data quality demo project, TDQEEDEMOJAVA, you can import from the login window
of your Talend Studio. For
further information, see the Talend Studio User
Guide
.

  1. In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file
    and set other properties in case they are not stored in the Repository.

    use_case_tmatchgroup2.png
  2. Create the schema through the Edit Schema
    button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

    use_case_tmatchgroup1.png

Configuring the tMatchGroup component

  1. Double-click tMatchGroup to display the
    Basic settings view and define the
    component properties.

    use_case_tmatchgroup3.png
  2. Click Sync columns to retrieve the schema
    from the preceding component.

  3. Click the Edit schema button to view the
    input and output schema and do any modifications in the output schema, if
    necessary.

    use_case_tmatchgroup4.png

    In the output schema of this component there are few output standard
    columns that are read-only. For more information, see tMatchGroup properties.

  4. Click OK to close the dialog box.

  5. Click Preview to open the configuration
    wizard and define the component configuration and the match rule(s).

    match_studio.png

    You can use the configuration wizard to import match rules created and
    tested in the studio and stored in the repository, and use them in your
    match Jobs. For further information, see Importing match rules from the studio repository.

  6. Define the first match rule as the following:

    • In the Key definition table,
      click the [+] button to add to the
      list the column(s) on which you want to do the matching operation,
      lname and
      fname.

      Note

      When you select a date column on which to apply an algorithm or a matching algorithm,
      you can decide what to compare in the date format.

      For example, if you want to only compare the year in the date, in the component schema
      set the type of the date column to Date and then enter
      yyyy” in the Date
      Pattern
      field. The component then converts the date format to a string
      according to the pattern defined in the schema before starting a string
      comparison.

    • Click in the cell of the Matching
      type
      column and select from the list Jaro-Winkler as the method to be used for
      the matching operation.

      If you select custom as a
      matching type, you must set in the Custom
      Matcher Class
      column the path pointing to the custom
      class (external matching algorithm) you need to use. This path is
      defined by yourself in the library file (.jar
      file).

    • Click in the cell of the Confidence
      Weight
      column to set the numerical weights for the
      two columns used as key attributes.

    • Click in the cell of the Handle
      Null
      column and select the null operator you want to
      use to handle null attributes in the columns. In this example,
      select Null Match None in order to
      have matching results where null values have minimal effect.

    • Set the match probability in the Match
      Interval
      field.

  7. Follow the same procedure in the above step to define the second match
    rule.

    Set the address1 column as an input attribute and
    select Jaro as the matching type. Select
    Null Match None as the null operator.
    And finally set the match probability which can be different from the one
    set for the first rule.

  8. Set the Hide groups of less than
    parameter in order to decide what groups to show in the result chart and
    matching table. This parameter enables you to hide groups of small group
    size.

  9. Click the Advanced settings tab and set
    the advanced parameters for the tMatchGroup
    component as the following:

    use_case_tmatchgroup5.png
    • Select the Separate output check
      box.

      The component will have three separate output flows: Unique rows, Confident groups and Uncertain
      groups
      .

      If this check box is not selected, the tMatchGroup component will have only one output flow
      where it groups all output data. For an example scenario, see Scenario 2: Comparing columns and grouping in the output flow duplicate
      records that have the same functional key
      .

    • Select the Sort the output data by
      GID
      check box to sort the output data by their group
      identifier.

    • Select the Output distance
      details
      and Display detailed
      labels
      check boxes.

      The component will output the
      MATCHING_DISTANCES column. This column
      provides the distance between the input and the master columns
      giving also the names of the columns against which the records are
      matched.

  10. Click the Chart button in the wizard to
    execute the Job in the defined configuration and have the matching results
    directly in the wizard.

    use_case_tmatchgroup9.png

    The matching chart gives a global picture about the duplicates in the
    analyzed data. The matching table indicates the details of items in each
    group and colors the groups in accordance with their color in the matching
    chart.

    The Job conducts an OR match operation on the records. It evaluates the
    records against the first rule and the records that match are not evaluated
    against the second rule. The MATCHING_DISTANCES column allows you to understand which
    rule has been used on what records.  In the yellow data group for example,
    the Amole Sarah record is matched according to the
    second rule that uses address1 as a key attribute,
    whereas the other records in the group are matched according to the first
    rule which uses the lname and
    fname as key attributes.

    You can set the Hide groups of less than
    parameter in order to decide what groups to show in the matching chart and
    table

Finalizing the Job and executing it

  1. Double-click each of the tLogRow
    components to display the Basic
    settings
    view and define the component properties.

  2. Save your Job and press F6 to execute
    it.

    You can see that records are grouped together in three different groups.
    Each record is listed in one of the three groups according to the value of
    the group score which is the minimal distance computed in the group.

    The identifier for each group, which is of String
    data type, is listed in the GID column next to the
    corresponding record. This identifier will be of the data type
    Long for Jobs that are migrated from older
    releases. To have the group identifier as String, you
    must replace the tMatchGroup component in
    the imported Job with tMatchGroup from the
    studio Palette.

    The number of records in each of the three output blocks is listed in the
    GRP_SIZE column and computed only on the master
    record. The MASTER column indicates with true or false
    if the corresponding record is a master record or not a master record. The
    SCORE column lists the calculated distance between
    the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.

    The Job evaluates the records against the first rule and the records that
    match are not evaluated against the second rule.

    All records which group score is between the match interval,
    0.95 or 0.85 depending on the
    applied rule, and the confidence threshold defined in the advanced settings
    of tMatchGroupare listed in the Suspects output flow.

    use_case_tmatchgroup6.png

    All records which group score is above one of the match probabilities are
    listed in the Matches output flow.

    use_case_tmatchgroup7.png

    All records that have a group size equal to 1 is listed in the Uniques output flow.

    use_case_tmatchgroup8.png

For another scenario that groups the output records in one single output flow
based on a generated functional key, see Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key
.

Scenario 2: Matching customer data through multiple passes

The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.

use_case-multi_pass_tmatchgroup.png

In this Job:

  • The tMysqlInput component connects to the
    customer records to be processed.

  • Each of the tGenKey components defines a way
    to partition data records. The first key partitions data to many groups and the
    second key creates fewer groups that overlaps the previous blocks depending on
    the blocking key definition.

  • The tMap component renames the key generated
    by the second tGenKey component.

  • The first tMatchGroup processes the
    partitions defined by the first tGenKey, and
    the second tMatchGroup processes those defined
    by the second tGenKey.

    Warning

    The two tMatchGroup components must
    have the same schema.

  • The tLogRow component presents the matching
    results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema
metadata in the repository, see the Talend Studio User
Guide
.

  1. In the Repository tree view, expand
    MetadataDB
    Connections
    where you have stored the main input schema and
    drop the database table onto the design workspace. The input table used in
    this scenario is called customer.

    A dialog box is displayed with a list of components.

  2. Select the relevant database component, tMysqlInput in this example, and then click OK.

  3. Drop two tGenKey components, two
    tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.

  4. Link the input component to the tGenKey
    and tMap components using Main links.

  5. In the two tMatchGroup components, select
    the Output distance details check boxes in
    the Advanced settings view of both
    components before linking them together.

    This will provide the MATCHING_DISTANCES column in
    the output schema of each tMatchGroup.

    Note

    If the two tMatchGroup components are
    already linked to each other, you must select the Output distance details check box in the second
    component in the Job flow first otherwise you may have an issue.

  6. Link the two tMatchGroup components and
    the tLogRow component using Main links.

  7. If needed, give the components specific labels to reflect their usage in
    the Job.

    For further information about how to label a component, see Talend Studio
    User Guide.

Configuring input data and key generation

Connecting to the input data

  1. Double-click tMysqlInput to open its
    Component view.

    use_case-multi_pass_tmatchgroup2.png

    The property fields for tMysqlInput are
    automatically filled in. If you do not define your input schema locally in
    the repository, fill in the details manually after selecting Built-in in the Schema and Property Type
    lists.

    The input table used in this scenario is called
    customer.

  2. Modify the query in the Query box to
    select only the columns you want to match:
    account_name, lname,
    fname, mi,
    address1, city,
    state_province and
    postal_code.

Configuring the key generation for the first pass

  1. Double-click the first tGenKey to open
    the Component view.

    use_case-multi_pass_tmatchgroup3.png
  2. Click match_rule_import_icon.png and import blocking keys from match rules created and
    tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the blocking key parameters as described in the below steps.

  3. Under the Algorithm table, click the
    [+] button to add two rows in the
    table.

  4. On the column column, click the newly
    added row and select from the list the column you want to process using an
    algorithm. In this example, select lname.

  5. Do the same on the second row to select
    postal_code.

  6. On the pre-algorithm column, click the
    newly added row and select from the list the pre-algorithm you want to apply
    to the corresponding column.

    In this example, select remove diacritical marks and
    convert to upper case
    to remove any diacritical mark and
    converts the fields of the lname column
    to upper case.

    Note

    This conversion does not change your raw data.

  7. On the algorithm column, click the newly
    added row and select from the list the algorithm you want to apply to the
    corresponding column. In this example, select N first
    characters of each word
    .

  8. Do the same for the second row on the algorithm column to select first N
    characters of the string
    .

  9. Click in the Value column next to the
    algorithm column and enter the value
    for the selected algorithm, when needed.

    In this scenario, enter 1 for both rows. The first letter of each
    field in the corresponding columns will be used to generate the key.

Configuring the key generation for the second pass

  1. Double-click the second tGenKey to open
    the Component view.

    use_case-multi_pass_tmatchgroup4.png
  2. In the Algorithm table, define the column
    you want to use to partition data, account_num in this
    component. Select the first N characters of the
    string
    algorithm and set the value to 1
    in the Value column.

    Each of the two tGenKey components will
    generate a read_only T_GEN_KEY column in
    the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each
    other.

  3. Double-click the tMap component to open
    its editor.

    use_case-multi_pass_tmatchgroup8.png
  4. In the Schema editor, copy the columns
    from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.

  5. In the top part of the editor, drop all columns from the input table to
    the output table.

  6. Click Ok to save data transformation and
    close the editor.

  7. In the tGenKey basic settings, click the dotbutton.png button to verify that the two generated keys are named
    differently in the output schema.

    use_case-multi_pass_tmatchgroup5.png

Configuring the two matching passes

Configuring the first pass

  1. Double-click the first tMatchGroup
    labelled pass1 to display the Configuration Wizard.

    use_case-multi_pass_tmatchgroup6.png
  2. Click match_rule_import_icon.png and import matching keys from the match rules created
    and tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the matching key parameters as described in the below steps.

  3. In the Key definition table, click the
    [+] button to add the column(s) on
    which you want to do the matching operation, lname in this scenario.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm,
    you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema
    set the type of the date column to Date and then enter
    yyyy” in the Date
    Pattern
    field. The component then converts the date format to a string
    according to the pattern defined in the schema before starting a string
    comparison.

  4. Select the Jaro-Winkler algorithm in the Matching Function column.

  5. Set Weight to 1 and
    in the Handle Null column, select the null
    operator you want to use to handle null attributes in the column, Null Match Null in this scenario.

  6. Click the [+] button below the Blocking Selection table to add one row in the
    table then click in the line and select from the list the column you want to
    use as a blocking value, T_GEN_KEY in
    this example.

    Using a blocking value reduces the number of pairs of records that needs
    to be examined. The input data is partitioned into exhaustive blocks based
    on the functional key. This will decrease the number of pairs to compare, as
    comparison is restricted to record pairs within each block.

  7. If required, click Edit schema to open
    the schema editor and see the schema retrieved from the previous component
    in the Job.

    use_case-multi_pass_tmatchgroup9.png
  8. Click the Advanced settings tab and
    select the Sort the output data by GID
    check box to arrange the output data by their group IDs.

Configuring the second pass

  1. Double-click the second tMatchGroup
    component labelled pass2 to display the
    Configuration Wizard.

    If this component does not have the same schema of the preceding
    component, a warning icon appears. If so, click the Sync columns button to retrieve the schema from the
    preceding one and once done, the warning icon disappears.

  2. In the Key Definition table, click
    the [+] button to add the column(s) on
    which you want to do the matching operation, lname in this scenario.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm,
    you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema
    set the type of the date column to Date and then enter
    yyyy” in the Date
    Pattern
    field. The component then converts the date format to a string
    according to the pattern defined in the schema before starting a string
    comparison.

  3. Select the Jaro-Winkler algorithm in the Matching Function column.

  4. Set Weight to 1 and
    in the Handle Null column, select the null
    operator you want to use to handle null attributes in the column, Null Match Null in this scenario.

  5. Click the [+] button below the Blocking Selection table to add one row in the table then
    click in the line and select from the list the column you want to use as a
    blocking value, T_GEN_KEY1 in this
    example.

  6. Click the Advanced settings tab and
    select the Multi-pass check box. This
    option enables tMatchGroup to receive data
    sets from the tMatchGroup that precedes it
    in the Job.

  7. In the Advanced settings view, select the
    Sort the output data by GID check box
    to arrange the output data by their group IDs.

Executing the Job and showing the results on the console

In order to show the match groups created after the first pass and compare
them with the groups created after the second pass, you must modify the Job as
the following:

use_case-multi_pass_tmatchgroup7.png
  • Use a tReplicate component to
    replicate the input flow you want to process as shown in the above
    figure. Use a copy/paste operation to create the two parts of the
    Job.

  • Keep only the first pass,tMatchGroup component, in the upper part of the Job
    and show the match results in a tLogRow component.

  • Use two passes in the lower part of the Job and show the final
    match results in a tLogRow
    component.

  1. Double-click each of the tLogRow
    components to open the Component view and
    in the Mode area, select the Table (print values in cells of a table)
    option.

  2. Save your Job and press F6 to execute
    it.

    The results after the first pass read as follows:

    use_case-multi_pass_tmatchgroup10.png

    The results after the second pass read as follows:

    use_case-multi_pass_tmatchgroup11.png

    When you compare, for example, the customer name Wilson from the results of the two passes, you will find
    that more customers using the last name Wilson are grouped together after the second pass.

Scenario 3: Matching data through multiple passes using Map/Reduce components

Note that Talend Map/Reduce components are available only to users
who subscribed to Big Data.

This scenario shows how to create a Talend Map/Reduce Job to match data by
using Map/Reduce components. It generates Map/Reduce code and runs right in
Hadoop.

use_case-mr_tmatchgroup.png

The Job in this scenario, groups similar customer records by running through two
subsequent matching passes (tMatchGroup components) and
outputs the calculated matches in groups. Each pass provides its matches to the pass
that follows in order for the latter to add more matches identified with new rules and
blocking keys.

This Job is a duplication of the Standard data
integration Job described in Scenario 2: Matching customer data through multiple passes where standard components are replaced with Map/Reduce components.

You can use Talend Studio to automatically
convert the standard Job in the previous section to a Map/Reduce Job. This way, you do
not need to redefine the settings of the components in the Job.

Before starting to replicate this scenario, ensure that you have appropriate rights
and permissions to access the Hadoop distribution to be used.

Converting the Job

  1. In the Repository tree view of the Integration perspective of Talend Studio, right-click the
    Job you have created in the earlier scenario to open its contextual menu and
    select Edit properties.

    Then the [Edit properties] dialog box is
    displayed. Note that the Job must be closed before you are able to make any
    changes in this dialog box.

    This dialog box looks like the image below:

    use_case-mr_convert_job-common.png

    Note that you can change the Job name as well as the other descriptive
    information about the Job from this dialog box.

  2. Click Convert to Map/Reduce Job. Then a
    Map/Reduce Job using the same name appears under the Map/Reduce Jobs sub-node of the Job
    Design
    node.

If you need to create this Map/Reduce Job from scratch, you have to right-click the
Job Design node or the Map/Reduce Jobs sub-node and select Create
Map/Reduce Job
from the contextual menu. Then an empty Job is opened in
the workspace. For further information, see the section describing how to create a
Map/Reduce Job of the Talend Big Data Getting Started Guide.

Setting up the Job

  1. Double-click the new Map/Reduce Job to open it in the workspace.

    The Map/Reduce component Palette is
    opened. A crossed-out component, if any, indicates that it does not have the
    Map/Reduce version.

    use_case-mr_tmatchgroup2.png
  2. Delete tMysqlInput in this scenario and
    drop tRowGenerator from the Palette to the workspace. Link it to tGenKey with a Row >
    Main
    link.

  3. Double-click tRowGenerator to open its
    editor.

    use_case-mr_tmatchgroup3.png
  4. Define the schema you want to use to write data in Hadoop.

  5. Click OK to validate your schema and
    close the editor.

  6. Leave the settings of the other components as you defined initially in the
    standard version of the Job.

Setting up Hadoop connection

  1. Click Run to open its view and then click the
    Hadoop Configuration tab to display its
    view for configuring the Hadoop connection for this Job.

    This view looks like the image below:

    use_case-hadoop_config-common.png
  2. From the Property type list, select Built-in. If you have created the connection to be
    used in Repository, then select Repository and thus the Studio will reuse that set of
    connection information for this Job.

    For further information about how to create an Hadoop connection in
    Repository, see the chapter describing the Hadoop
    cluster
    node of the Talend Big Data Getting Started Guide.

  3. In the Version area, select the Hadoop
    distribution to be used and its version. If you cannot find from the list the
    distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially
    supported in the Studio.

    For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

    Along with the evolution of Hadoop, please note the
    following changes:

    • If you use Hortonworks Data Platform
      V2.2
      , the configuration files of your cluster might be using
      environment variables such as ${hdp.version}. If this is your situation, you need to set
      the mapreduce.application.framework.path property in the
      Hadoop properties table with the path
      value explicitly pointing to the MapReduce framework archive of your
      cluster. For
      example:

    • If you use Hortonworks Data Platform
      V2.0.0
      , the type of the operating system for running the
      distribution and a Talend Job must be the same,
      such as Windows or Linux. Otherwise, you have to use Talend Jobserver to execute the Job in the same
      type of operating system in which the Hortonworks
      Data Platform V2.0.0
      distribution you are using is run. For
      further information about Talend Jobserver, see
      Talend
      Installation and Upgrade Guide
      .

  4. In the Name node field, enter the location of
    the master node, the NameNode, of the distribution to be used. For example,
    hdfs://tal-qa113.talend.lan:8020.

    If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR
    client will take care of the rest on the fly for creating the connection. The
    MapR client must be properly installed. For further information about how to set
    up a MapR client, see the following link in MapR’s documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client

  5. In the Job tracker field, enter the location
    of the JobTracker of your distribution. For example, tal-qa114.talend.lan:8050.

    Note that the notion Job in this term JobTracker designates the MR or the
    MapReduce jobs described in Apache’s documentation on http://hadoop.apache.org/.

    If you use YARN in your Hadoop cluster such as Hortonworks Data Platform V2.0.0 or Cloudera CDH4.3 + (YARN mode), you need to specify the location
    of the Resource Manager instead of the
    Jobtracker. Then you can continue to set the following parameters depending on
    the configuration of the Hadoop cluster to be used (if you leave the check box
    of a parameter clear, then at runtime, the configuration about this parameter in
    the Hadoop cluster to be used will be ignored ):

    • Select the Set resourcemanager scheduler
      address
      check box and enter the Scheduler address in
      the field that appears.

    • Select the Set jobhistory address
      check box and enter the location of the JobHistory server of the
      Hadoop cluster to be used. This allows the metrics information of
      the current Job to be stored in that JobHistory server.

    • Select the Set staging directory
      check box and enter this directory defined in your Hadoop cluster
      for temporary files created by running programs. Typically, this
      directory can be found under the yarn.app.mapreduce.am.staging-dir property in the
      configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • Select the Use datanode hostname
      check box to allow the Job to access datanodes via their hostnames.
      This actually sets the dfs.client.use.datanode.hostname property to
      true. When connecting to a
      S3N filesystem, you must select this check box.

  6. If you are accessing the Hadoop cluster running with Kerberos security, select this check
    box, then, enter the Kerberos principal name for the NameNode in the field displayed. This
    enables you to use your user name to authenticate against the credentials stored in
    Kerberos.

    In addition, since this component performs Map/Reduce computations, you also need to
    authenticate the related services such as the Job history server and the Resource manager or
    Jobtracker depending on your distribution in the corresponding field. These principals can
    be found in the configuration files of your distribution. For example, in a CDH4
    distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

    If you need to use a Kerberos keytab file to log in, select Use a
    keytab to authenticate
    . A keytab file contains pairs of Kerberos principals
    and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the
    Keytab field.

    Note that the user that executes a keytab-enabled Job is not necessarily the one a
    principal designates but must have the right to read the keytab file being used. For
    example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

  7. In the User name field, enter the login user
    name for your distribution. If you leave it empty, the user name of the machine
    hosting the Studio will be used.

  8. In the Temp folder field, enter the path in
    HDFS to the folder where you store the temporary files generated during
    Map/Reduce computations.

  9. Leave the default value of the Path separator in server as
    it is, unless you have changed the separator used by your Hadoop distribution’s host machine
    for its PATH variable or in other words, that separator is not a colon (:). In that
    situation, you must change this value to the one you are using in that host.

  10. Leave the Clear temporary folder check box
    selected, unless you want to keep those temporary files.

  11. Leave the Compress intermediate map output to reduce
    network traffic
    check box selected, so as to spend shorter time
    to transfer the mapper task partitions to the multiple reducers.

    However, if the data transfer in the Job is negligible, it is recommended to
    clear this check box to deactivate the compression step, because this
    compression consumes extra CPU resources.

  12. If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or
    properties to be customized. Then at runtime, these changes will override the
    corresponding default properties used by the Studio for its Hadoop
    engine.

    For further information about the properties required by Hadoop, see Apache’s
    Hadoop documentation on http://hadoop.apache.org, or
    the documentation of the Hadoop distribution you need to use.

  13. If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks
    Data Platform V1.3, you need to set proper memory allocations for the map and reduce
    computations to be performed by the Hadoop system.

    In that situation, you need to enter the values you need in the Mapred
    job map memory mb
    and the Mapred job reduce memory
    mb
    fields, respectively. By default, the values are both 1000 which are normally appropriate for running the
    computations.

    If the distribution is YARN, then the memory parameters to be set become Map (in Mb), Reduce (in Mb) and
    ApplicationMaster (in Mb), accordingly. These fields
    allow you to dynamically allocate memory to the map and the reduce computations and the
    ApplicationMaster of YARN.

For further information about this Hadoop
Configuration
tab, see the section describing how to configure the Hadoop
connection for a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide.

For further information about the Resource Manager, its scheduler and the
ApplicationMaster, see YARN’s documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

For further information about how to determine YARN and MapReduce memory configuration
settings, see the documentation of the distribution you are using, such as the following
link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.

Executing the Job

  • Save the Job and press F6 to execute
    it.

    Match results are displayed on the studio console.

    use_case-mr_tmatchgroup4.png

    Matches are calculated by running through the two passes. Each pass
    provides its matches to the pass that follows and more matches are
    identified with the rule and blocking key of the second pass.

    More customers with similar last names are grouped together after the
    second pass.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x