August 17, 2023

tMatchGroupHadoop – Docs for ESB 5.x

tMatchGroupHadoop

tMatchGroupHadoop_icon32_white.png

Warning

This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

tMatchGroupHadoop properties

Component family

Data Quality

This component is deprecated as tMatchGroup can be used now in both standard and
Map/Reduce Jobs. tMatchGroupHadoop
will continue to work in Jobs you import from older releases.

Function

tMatchGroupHadoop uses given
matching rule(s) to compare columns of the data in HDFS and groups
accordingly encountered duplicates in the output flow.

When several tMatchGroupHadoop
components are used sequentially, the first one creates the groups
of similar records and the others coming after refine the groups
they receive from their preceding ones, one after the other.

In defining a group, the first record processed of each group is
the master record of the group; the other records are computed as to
their distances from the master records and then are distributed to
the due master record accordingly. In refining the given groups, a
group with one single record (the group size is one) is compared
with the other master records and is merged into one of the other
groups as determined by the distances recomputed in between.

Purpose

This component helps find similar or duplicate records in any
source data of large volume.

Basic settings

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

Schema and Edit
schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Sync columns to retrieve
the schema from the previous component in the Job.

The output schema of this component contains the following
fields:

GID: represents the group identifier.

GRP_SIZE: counts the number of records in the
group, computed only on the master record.

MASTER: identifies the record used in the
matching comparisons. There is only one master record per group.
Each input record will be compared to a master record, if they
match, it will be placed into the group.

SCORE: measures the distance between the
input record and the master record according to the matching
algorithm used.

Matching_Distances: it presents
the distance computed of a record from its master record.

 

 

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

 

Repository: You have already
created and stored the schema in the Repository. You can reuse it in
other projects and job designs. Related topic: see Talend Studio User Guide.

 

Link with a tMatchGroupHadoop

Select this check box if more than one tMatchGroupHadoop is used in the Job. From the
Component list, select the
relevant tMatchGroupHadoop
component to reuse the Hadoop connection details you already
defined.

 

Link with a tGenKeyHadoop

Select this check box if you need to reuse a specific connection,
created by a tGenKeyHadoop, to the
HDFS file. From the tGenKeyHadoop
list
, select the relevant tGenKeyHadoop component to reuse the Hadoop
connection details you already defined.

 

Use an existing connection

Select this check box and in the Component List click the
HDFS connection component from which you want to reuse the connection details already
defined.

Note

When a Job contains the parent Job and the child Job, Component
list
presents only the connection components in the same Job
level.

Version

Note

Unavailable if you use an existing link.

Distribution

Select the cluster you are using from the drop-down list. The options in the list vary
depending on the component you are using. Among these options, the following ones requires
specific configuration:

  • If available in this Distribution drop-down list, the
    Microsoft HD Insight option allows you to use a
    Microsoft HD Insight cluster. For this purpose, you need to configure the
    connections to the WebHCat service, the HD Insight service and the Windows Azure
    Storage service of that cluster in the areas that are displayed. A demonstration
    video about how to configure this connection is available in the following link:
    https://www.youtube.com/watch?v=A3QTT6VsNoM

  • The Custom option allows you to connect to a
    cluster different from any of the distributions given in this list, that is to
    say, to connect to a cluster not officially supported by Talend.

In order to connect to a custom distribution, once selecting Custom, click the dotbutton.png button to display the dialog box in which you can
alternatively:

  1. Select Import from existing version to import an
    officially supported distribution as base and then add other required jar files
    which the base distribution does not provide.

  2. Select Import from zip to import a custom
    distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php.

    Note

    In this dialog box, the active check box must be kept selected so as to import
    the jar files pertinent to the connection to be created between the custom
    distribution and this component.

    For an step-by-step example about how to connect to a custom distribution and
    share this connection, see Connecting to a custom Hadoop distribution.

 

Hadoop version

Select the version of the Hadoop distribution you are using. The available options vary
depending on the component you are using. Along with the evolution of Hadoop, please note
the following changes:

  • If you use Hortonworks Data Platform V2.2, the
    configuration files of your cluster might be using environment variables such as
    ${hdp.version}. If this is your situation, you
    need to set the mapreduce.application.framework.path property in the Hadoop properties table of this component with the path value
    explicitly pointing to the MapReduce framework archive of your cluster. For
    example:

  • If you use Hortonworks Data Platform V2.0.0, the
    type of the operating system for running the distribution and a Talend
    Job must be the same, such as Windows or Linux. Otherwise, you have to use Talend
    Jobserver to execute the Job in the same type of operating system in which the
    Hortonworks Data Platform V2.0.0 distribution you
    are using is run. For further information about Talend Jobserver, see
    Talend Installation
    and Upgrade Guide
    .

 

NameNode URI

Select this check box to indicate the location of the NameNode of the Hadoop cluster to be
used. The NameNode is the master node of a Hadoop cluster. For example, we assume that you
have chosen a machine called masternode as the NameNode
of an Apache Hadoop distribution, then the location is hdfs://masternode:portnumber.

For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.

 

JobTracker URI

Select this check box to indicate the location of the Jobtracker service within the Hadoop
cluster to be used. For example, we assume that you have chosen a machine called machine1 as the JobTracker, then set its location as machine1:portnumber. A Jobtracker is the service that assigns
Map/Reduce tasks to specific nodes in a Hadoop cluster. Note that the notion job in this
term JobTracker does not designate a Talend Job, but rather a Hadoop job
described as MR or MapReduce job in Apache’s Hadoop documentation on http://hadoop.apache.org.

If you use YARN in your Hadoop cluster such as Hortonworks Data
Platform V2.0.0
or Cloudera CDH4.3 + (YARN
mode)
, you need to specify the location of the Resource
Manager
instead of the Jobtracker. Then you can continue to set the following
parameters depending on the configuration of the Hadoop cluster to be used (if you leave the
check box of a parameter clear, then at runtime, the configuration about this parameter in
the Hadoop cluster to be used will be ignored ):

  1. Select the Set resourcemanager scheduler
    address
    check box and enter the Scheduler address in the field
    that appears.

  2. Allocate proper memory volumes to the Map and
    the Reduce computations and the ApplicationMaster of YARN by selecting the Set memory check box in the Advanced settings view.

  3. Select the Set jobhistory address check box
    and enter the location of the JobHistory server of the Hadoop cluster to be
    used. This allows the metrics information of the current Job to be stored in
    that JobHistory server.

  4. Select the Set staging directory check box
    and enter this directory defined in your Hadoop cluster for temporary files
    created by running programs. Typically, this directory can be found under the
    yarn.app.mapreduce.am.staging-dir
    property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

  5. Select the Set Hadoop user check box and
    enter the user name under which you want to execute the Job. Since a file or a
    directory in Hadoop has its specific owner with appropriate read or write
    rights, this field allows you to execute the Job directly under the user name
    that has the appropriate rights to access the file or directory to be
    processed.

  6. Select the Use datanode hostname check box to
    allow the Job to access datanodes via their hostnames. This actually sets the
    dfs.client.use.datanode.hostname property
    to true.

For further information about these parameters, see the documentation or
contact the administrator of the Hadoop cluster to be used.

For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.

 

User name

Enter the Hadoop user authentication.

For some Hadoop versions, you also need to enter the name of the
supergroup to which the user belongs in the Group field that is displayed.

  HDFS directory

Enter the HDFS directory where the data to be processed is.

At runtime, this component will clean up any data in this
directory if it exists, write the input data in and perform the
operations.

If you want to reuse the existing data in HDFS, use:

  • Link with a tGenKeyHadoop

  • Link with a
    tMatchGroupHadoop

  • Use existing HDFS file

 

match_rule_import_icon.png

Click the import icon to select a match rule from the Studio
repository.

When you click the import icon, a [Match
Rule Selector]
wizard is opened to help you import
match rules from the Studio repository and use them in your
Job.

You can only import rules created with the VSR algorithm. For
further information, see Importing match rules from the studio repository

Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to
apply a matching algorithm.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
yyyy” in the Date
Pattern
field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.

 

Matching Function

Select the relevant matching algorithm from the list:

Exact Match: matches each
processed entry to all possible reference entries with exactly the
same value. It returns 1 when the two strings
exactly match, and 0 otherwise.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the
same value while ignoring the value case.

Soundex: matches processed
entries according to a standard English phonetic algorithm. It
indexes strings by sound, as pronounced in English, for example
“Hello”: “H400”.

Levenshtein (edit distance):
calculates the minimum number of edits (insertion, deletion or
substitution) required to transform one string into another. Using
this algorithm in the tMatchGroupHadoop component, you do not need to
specify a maximum distance. The component automatically calculates a
matching percentage based on the distance. This matching score will
be used for the global matching calculation, based on the weight you
assign in the Confidence Weight
field.

Metaphone: Based on a phonetic
algorithm for indexing entries by their pronunciation. It first
loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the
reference flow.

Double Metaphone: a new version
of the Metaphone phonetic algorithm, that produces more accurate
results than the original algorithm. It can return both a primary
and a secondary code for a string. This accounts for some ambiguous
cases as well as for multiple variants of surnames with common
ancestry.

Soundex FR: matches processed
entries according to a standard French phonetic algorithm.

Jaro: matches processed entries
according to spelling deviations. It counts the number of matched
characters between two strings. The higher the distance is, the more
similar the strings are.

Jaro-Winkler: a variant of Jaro,
but it gives more importance to the beginning of the string.

q-grams: matches processed
entries by dividing strings into letter blocks of length
q in order to create a number of q
length grams. The matching result is given as the number of
q-gram matches over possible q-grams.

custom…: enables you to load an
external matching algorithm from a Java library. The custom matcher class column alongside is
activated when you select this option.

For further information about how to load an external Java
library, see tLibraryLoad.

For further information about how to create a custom matching
algorithm, see Creating a custom matching algorithm.

For the related scenario about how to use a custom matching
algorithm, see Scenario 2: Using a custom matching algorithm to match entries.

  Custom matcher

Type in the path pointing to the custom class (external matching
algorithm) you need to use. This path is defined by yourself in the
library file (.jar file).

For example, to use a MyDistance.class class
stored in the directory org/talend/mydistance
in a user-defined mydistance.jar library, the
path to be entered is
org.talend.mydistance.MyDistance.

 

Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0.

Note

If you have more than one tMatchGroupHadoop components in your Job, the
distances and the matching scores may not be coherent when the
criteria you set to compute the distance is different in each
component

 

Handle Null

Handle Null

To handle null values, select from the list the null operator you
want to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but
the first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may
not match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they
match.

And for the records:

“Doe”, “”

“Doe”, “”

Null Match Null: they
match.

Null Match None: they do not
match.

Null Match All: they
match.

Blocking Definition

Input Column

If required, select the column(s), as the blocking key(s), from
the input flow depending on which you want to partition the
processed data in blocks, this is usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive
blocks designed to increase the proportion of matches observed while
decreasing the number of pairs to compare. Comparisons are
restricted to record pairs within each block.

Note

Using blocking column(s) is very useful when you are
processing very big data.

  Advanced settings

Matching Algorithm

Select an algorithm from the list – only one is available for the
time being.

Simple VSR Matcher: This
algorithm is based on a Vector Space Retrieval method that specifies
how two records may match.

 

Match threshold

Enter the match probability. Two data records match when the
probability is above the set value.

 

Sort the output data by GID

Select this check box to group the output data by the group
ID.

 

Output distance details

Select this check box to fill the fixed output column
MATCHING_DISTANCES with the details of the
distance between each column. This check box becomes unavailable
when you select the Link with a
tMatchGroupHadoop
check box.

Hadoop Properties

Note

Unavailable if you use an existing link.

Property

Talend Studio uses a default configuration for its engine to perform
operations in a Hadoop distribution. If you need to use a custom configuration in a specific
situation, complete this table with the property or properties to be customized. Then at
runtime, the customized property or properties will override those default ones.

  • Note that if you are using the centrally stored metadata from the Repository, this table automatically inherits the
    properties defined in that metadata and becomes uneditable unless you change the
    Property type from Repository to Built-in.

For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you
are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:

  Value

Type in the custom property values used to connect to the Hadoop
of interest.

 

Keep data in Hadoop

Select this check box to keep the data processed by this component
in the HDFS file.

If you leave this check box unselected, the component processes
data and then retrieves it from the HDFS file and output it in the
Job flow.

 

Use existing HDFS file

Select this check box to enable the component to directly process
the data in an HDFS file. When this check box is selected, this
component can act as a start component in your Job.

HDFS file URI: set the URI of the
HDFS file holding the data you want to process.

Field delimiter: set the
character used as a field delimiter in the HDFS file.

If you leave this check box unselected, tMatchGroupHadoop receives the data flow to be
processed and loads it into an HDFS file.

 

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Dynamic settings

Click the [+] button to add a row
in the table and fill the Code
field with a context variable to choose your HDFS connection
dynamically from multiple connections planned in your Job.

The Dynamic settings table is available only when the
Use an existing connection check box is selected in the
Basic settings view. Once a dynamic parameter is
defined, the Component List box in the Basic settings view becomes unusable.

For more information on Dynamic settings and context
variables, see Talend Studio User Guide.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component can be a start or an intermediary step. It requires
an input flow as an intermediary step and an output flow as either
of the step types. It needs the connection to Hadoop for processing
large volume of data.

It is ideally used alongside the tGenKey component or the
tGenKeyHadoop
component in order to use the blocking
columns, provided by either of the components, to gain a better
performance. In practice, we recommend placing the most restrictive
blocking criteria in the first tGenKey or the first tGenKeyHadoop component in use.

Limitation/prerequisite

You need to use Linux to execute the Job containing this
component.

Working principle

This component implements the MapReduce model, based on the blocking keys defined
in the Blocking definition table of the Basic settings view.

mapreduce.png

This implementation proceeds as follows:

  1. Splits the input rows in groups of a given size.

  2. Implements a Map Class that creates a map between each key and a list of
    records.

  3. Shuffles the records to group those with the same key together.

  4. Applies, on each key, the algorithm defined in the Key definition table of the Basic
    settings
    view.

    Then accordingly, this component reads the records, compares them with the
    master records, groups the similar ones, and classes each of the rest as a
    master record.

  5. Outputs the groups of similar records with their group IDs, group sizes,
    matching distances and scores.

Scenario: Running customer matching through multiple passes

The Job in this scenario connects to the given Hadoop system, groups similar customer
records by running through two subsequent matching passes in HDFS and outputs the
calculated matches by groups. Each pass provides its matches to the pass that follows in
order for the latter to add more matches identified with new rules.

tmatchgroupHadoop-scenario.png

The components of interest are:

  • tFixedFlowInput: it provides the customer
    records to be processed.

  • two tGenKeyHadoop components: Each defines a
    way to partition the records…

  • two tMatchGroupHadoop components: they
    process each partition to group the records within and once configured, sort the
    groups by their group IDs. The first tMatchGroupHadoop processes the partitions defined by the first
    tGenKeyHadoop and the second tMatchGroupHadoop processes those defined by the
    second tGenKeyHadoop. Each of them set one pass
    to match the received records.

    Warning

    The two tMatchGroupHadoop components
    must have the same schema.

  • two tLogRow components: they present the
    execution result of each tMatchGroupHadoop
    component.

To replicate this scenario, proceed as the following sections illustrate.

Dropping and linking the components

  1. Drop tFixedFlowInput, two tGenKeyHadoop components, two tMatchGroupHadoop components and two tLogRow components from Palette onto the design workspace.

    Note

    A component used in the workspace can be labelled the way you need. In
    this scenario, this input component is labelled
    incoming_customers for tFixedFlowInput. For further information about how to
    label a component, see Talend Studio
    User Guide

  2. Right-click tFixedFlowInput to open its
    contextual menu and select the Row >
    Main link from this menu to connect
    this component to the first tGenKeyHadoop
    (labeled lname_postcode).

  3. Do the same to create the Main link from
    the first tGenKeyHadoop to the second
    tGenKeyHadoop (labeled lname_initial), to the first tMatchGroupHadoop, to the first tLogRow and then to the second tMatchGroupHadoop and finally to the second
    tLogRow.

The components to be used in this scenario are all placed and linked. Then you
need continue to configure them successively.

Configuring the input flow

Setting up the input data

  1. Double-click tFixedFlowInput to open its
    Component view.

    tmatchgroupHadoop-inputflow.png
  2. Click the three-dot button next to Edit
    schema
    to open the schema editor.

    tmatchgroupHadoop-input_schema.png
  3. Click the plus button eight times to add eight rows. They are the eight
    columns of the schema of the input data.

  4. Rename these eight rows respectively. In this example, they are: account_num, lname, fname, mi, address1, city, state_province, postal_code.

  5. In the Type column, select the data types
    for the rows of interest. In this example, select Long for the account_num
    column.

  6. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  7. In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data
    of interest.

  8. In the Content field, enter the input
    data to be processed, or paste the sample data provided by the demo Job
    D4_hadoop_group_family_multipass that
    you could import along with the demo DQ project. For further information
    about how to import a project, see Talend Studio
    User Guide.

Configuring the key generation for the first pass

  1. Double-click the first tGenKeyHadoop
    (labelled lname_postcode) to open the
    Component view.

    tmatchgrouphadoop-genkey1.png
  2. Configure the connection to the HDFS you want to write and process the
    records in.

    The parameters to be set are the Distribution, the Hadoop
    version
    , the NameNode URI,
    the HDFS directory, User name and the Jobtracker
    URI
    .

    In the HDFS directory you define, this component will create, at runtime,
    the folder storing separately the input records and the same records but
    with their partition keys: the original ones in an in folder and the partitioned ones in an out folder, both under the same parental folder
    tGenKeyHadoop_1

  3. Click match_rule_import_icon.png and import blocking keys from the match rules created
    and tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the blocking key parameters as described in the below steps.

  4. Under the Algorithm table, click the plus
    button to add two rows in this table.

  5. On the column column, click the newly
    added row and select, from the list, the column you want to process using an
    algorithm. In this example, select lname.

  6. Do the same on the second row to select
    postal_code

  7. On the pre-algorithm column, click the
    newly added row and select, from the list, the pre-algorithm you want to
    apply to the corresponding column. In this example, select remove diacritical marks and upper case to remove
    any diacritical mark and converts the fields of the lname column to upper case before generating the code of
    this column.

    Note

    This conversion does not change your raw data.

  8. On the algorithm column, click the newly
    added row and select from the list the algorithm you want to apply to the
    corresponding column. In this example, select N first
    characters of each word
    .

  9. Do the same for the second row on the algorithm column to select first N
    characters of the string
    .

  10. Click in the Value column next to the
    algorithm column and enter the value
    for the selected algorithm, when needed. In this scenario, type in
    1 for both of the rows, meaning that the first
    letter of each field in the corresponding columns will be used to generate
    the keys.

  11. Click Advanced settings to open its view.

    tmatchgroupHadoop-genkey1-advanced_settings.png
  12. Select the Keep data in Hadoop check box
    in order to process data in HDFS. In this situation, the customer records
    are not outputted into the data flow and the process runs faster.

Configuring the key generation for the second pass

  1. Double-click the second tGenKeyHadoop
    (labelled lname_initial) to open the
    Component view.

    tmatchgrouphadoop-genkey2.png
  2. Select Link with a tGenKeyHadoop to reuse
    the connection to HDFS and the HDFS directory created by the first tGenKeyHadoop.. This reuse enables this component
    to read the data processed by its preceding tGenKeyHadoop component natively in HDFS.

  3. Click the dotbutton.png button to verify the key column in the schema. You can
    find that the key columns of the two tGenKeyHadoop components have been automatically named to
    differentiate each other. In this scenario, they are T_GEN_KEY_postcode and T_GEN_KEY.

    tmatchgroupHadoop-genkeyschema.png
  4. Click match_rule_import_icon.png and import blocking keys from the match rules created
    and tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the blocking key parameters as described in the below steps.

  5. Under the Algorithm table, click the plus
    button to add one row in this table.

  6. On the column column, click the newly
    added row and select from the list the column you want to process using an
    algorithm. In this example, select account_num.

  7. On the algorithm column, click the newly
    added row and select from the list the algorithm you want to apply to the
    corresponding column. In this example, select first N
    characters of the string
    .

  8. Click in the Value column next to the
    algorithm column and enter the value
    for the selected algorithm, when needed. In this scenario, type in
    1, meaning that the first letter of each field in
    the corresponding column will be used to generate the required keys.

  9. Click Advanced settings to open its
    view.

  10. Select the Keep data in Hadoop check box
    in order to process data in HDFS.

Configuring the two passes

You need to configure the two passes to group the input data with the help of the
two columns of generated keys.

Configuring the first pass

  1. Double-click the first tMatchGroupHadoop
    component (labelled pass1) to display the
    Component view.

    tmatchgrouphadoop-pass1.png
  2. If required, click Sync schema, then
    click Edit schema to open the schema editor
    and see the schema retrieved from the previous component in the Job.

    tmatchgroupHadoop-pass1_schema.png
  3. Select Link with a tGenKeyHadoop to reuse
    the connection to HDFS and the HDFS directory created by its preceding
    tGenKeyHadoop.. This reuse enables this
    component to read the data processed by that tGenKeyHadoop component
    natively in HDFS.

  4. Click match_rule_import_icon.png and import matching keys from the match rules created
    and tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the matching key parameters as described in the below steps.

  5. In the Key definition table, click plus_button.png to add to the list the columns on which you want to do
    the matching operation, lname in this
    scenario.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm,
    you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema
    set the type of the date column to Date and then enter
    yyyy” in the Date
    Pattern
    field. The component then converts the date format to a string
    according to the pattern defined in the schema before starting a string
    comparison.

  6. Click in the first and second cells of the Matching
    type
    column and select from the list the method(s) to be used
    for the matching operation, Jaro-Winkler in this
    example.

  7. Click in the cell of the Handle Null
    column and select the null operator you want to use to handle null
    attributes in the column.

  8. Click the plus button below the Blocking
    Definition
    table to add one row in the table then click in
    the line and select from the list the column you want to use as a blocking
    value, T_GEN_KEY_postcode in this
    example.

    Using a blocking value reduces the number of pairs of records that needs
    to be examined. The input data is partitioned into exhaustive blocks based
    on the functional key. This will decrease the number of pairs to compare, as
    comparison is restricted to record pairs within each block.

  9. Click Advanced settings to open its view
    to verify that the Keep data in Hadoop is
    clear. This way, the processed customer records are outputted into the data
    flow and therefore, become available to tLogRow.

    tmatchgroupHadoop-pass1-advanced_settings.png
  10. Select the Sort the output data by GID
    check box to arrange the output data by their group IDs.

Configuring the second pass

  1. Double-click the second tMatchGroupHadoop
    component (labelled pass2) to display the
    Component view.

    tmatchgrouphadoop-pass2.png
  2. If this component does not have the same schema of the preceding
    component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from
    the preceding one and once done, the warning icon disappears.

  3. Select the Link with a tMatchGroupHadoop
    check box to reuse the connection of its preceding tMatchGroupHadoop to the Hadoop system.

  4. Click match_rule_import_icon.png and import matching keys from the match rules created
    and tested in the Profiling perspective
    of Talend Studio and use them in your Job. Otherwise,
    define the matching key parameters as described in the below steps.

  5. In the Key definition table, click the
    plus button to add to the list the columns on which you want to do the
    matching operation, lname in this
    scenario.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm,
    you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema
    set the type of the date column to Date and then enter
    yyyy” in the Date
    Pattern
    field. The component then converts the date format to a string
    according to the pattern defined in the schema before starting a string
    comparison.

  6. Click in the cell of the Matching type
    column and select from the list the method(s) to be used for the matching
    operation, Jaro-Winkler in this
    example.

  7. Click plus_button.png below the Blocking
    Definition
    table to add one row in the table then click in
    the row and select from the list the column you want to use as a blocking
    value, T_GEN_KEY in this example. This
    way, the matching operation is performed only between the master records
    with the same key, in this example, the same initial character of the
    account numbers.

    Through this pass, this Job processes the matching groups provided by the
    previous tMatchGroupHadoop. It selects the
    groups with one single record to compare with the other master records when
    both parts of them have the same generated key.

  8. Click Advanced settings to open its view
    to verify that the Keep data in Hadoop is
    clear. This way, the processed customer records are outputted into the data
    flow and therefore, become available to
    tLogRow
    .

  9. Select the Sort the output data by GID
    check box to arrange the output data by their group IDs.

Sorting the input records

  1. Double-click tSortRow to open its
    Component view.

    tmatchgrouphadoop-sortrow.png
  2. Under the Criteria table, click the plus
    button twice to add two rows.

  3. In the first row, select GID for the
    Schema column column, alpha for the Sort num or
    alpha
    column and asc for
    the Order asc or desc column. This means
    that the sorting is performed on the GID
    column of the input schema, in terms of its ascending alphabetical order.
    Thus the other columns are sorted accordingly.

  4. Do the same to select GRP_SIZE,
    num and desc accordingly in the second row.

Then you can run this Job.

The tLogRow component is used to present the
execution result of the Job.

  1. If you want to configure the presentation mode on its Component view, double-click the tLogRow component of interest to open the
    Component view and in the Mode area, then, select the Table (print values in cells of a table) check box.

  2. Press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

The result after the first pass reads as follows:

tmatchgrouphadoop_multipass1.png

The result after the second pass reads as follows:

tmatchgrouphadoop_multipass2.png

Note

For the reason of page space, the results are not totally presented.

When you compare, for example, the customer name Alexander from the results of the two passes, you will find that
more customers using the last name Alexander are
grouped together after the second pass:

  • In the first pass, Jeremy Alexander,
    Bob Alexander and Maxine Alexander are not distributed in the
    same group because the matching is performed only within each block defined
    in the T_GEN_KEY_postcode column while
    they belong to different blocks, A9, A8 and A3 respectively.

  • In the second pass, the matching to be performed uses the blocks defined
    in the T_GEN_KEY column. As all of the
    three customer names belong to block 2,
    so they are grouped together after computing the distance in between. In
    addition, you can as well read, from the MASTER column, that Jeremy
    Alexander
    is the master record of its group.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x