tMatchGroupHadoop

Warning

This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

tMatchGroupHadoop properties

Component family

Data Quality

This component is deprecated as tMatchGroup can be used now in both standard and
Map/Reduce Jobs. tMatchGroupHadoop
will continue to work in Jobs you import from older releases.

Function

tMatchGroupHadoop uses given
matching rule(s) to compare columns of the data in HDFS and groups
accordingly encountered duplicates in the output flow.

When several tMatchGroupHadoop
components are used sequentially, the first one creates the groups
of similar records and the others coming after refine the groups
they receive from their preceding ones, one after the other.

In defining a group, the first record processed of each group is
the master record of the group; the other records are computed as to
their distances from the master records and then are distributed to
the due master record accordingly. In refining the given groups, a
group with one single record (the group size is one) is compared
with the other master records and is merged into one of the other
groups as determined by the distances recomputed in between.

Purpose

This component helps find similar or duplicate records in any
source data of large volume.

Basic settings

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Schema and Edit
schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Sync columns to retrieve
the schema from the previous component in the Job.

The output schema of this component contains the following
fields:

GID: represents the group identifier.

GRP_SIZE: counts the number of records in the
group, computed only on the master record.

MASTER: identifies the record used in the
matching comparisons. There is only one master record per group.
Each input record will be compared to a master record, if they
match, it will be placed into the group.

SCORE: measures the distance between the
input record and the master record according to the matching
algorithm used.

Matching_Distances: it presents
the distance computed of a record from its master record.

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide.

Repository: You have already
created and stored the schema in the Repository. You can reuse it in
other projects and job designs. Related topic: see Talend Studio User Guide.

Link with a tMatchGroupHadoop

Select this check box if more than one tMatchGroupHadoop is used in the Job. From the
Component list, select the
relevant tMatchGroupHadoop
component to reuse the Hadoop connection details you already
defined.

Link with a tGenKeyHadoop

Select this check box if you need to reuse a specific connection,
created by a tGenKeyHadoop, to the
HDFS file. From the tGenKeyHadoop
list, select the relevant tGenKeyHadoop component to reuse the Hadoop
connection details you already defined.

Use an existing connection

Select this check box and in the Component List click the
HDFS connection component from which you want to reuse the connection details already
defined.

Note

When a Job contains the parent Job and the child Job, Component
list presents only the connection components in the same Job
level.

Version

Note

Unavailable if you use an existing link.

Distribution

Select the cluster you are using from the drop-down list. The options in the list vary
depending on the component you are using. Among these options, the following ones requires
specific configuration:

If available in this Distribution drop-down list, the
Microsoft HD Insight option allows you to use a
Microsoft HD Insight cluster. For this purpose, you need to configure the
connections to the WebHCat service, the HD Insight service and the Windows Azure
Storage service of that cluster in the areas that are displayed. A demonstration
video about how to configure this connection is available in the following link:
https://www.youtube.com/watch?v=A3QTT6VsNoM
The Custom option allows you to connect to a
cluster different from any of the distributions given in this list, that is to
say, to connect to a cluster not officially supported by Talend.

In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can
alternatively:

Select Import from existing version to import an
officially supported distribution as base and then add other required jar files
which the base distribution does not provide.
Select Import from zip to import a custom
distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php.

Note

In this dialog box, the active check box must be kept selected so as to import
the jar files pertinent to the connection to be created between the custom
distribution and this component.

For an step-by-step example about how to connect to a custom distribution and
share this connection, see Connecting to a custom Hadoop distribution.

Hadoop version

Select the version of the Hadoop distribution you are using. The available options vary
depending on the component you are using. Along with the evolution of Hadoop, please note
the following changes:

If you use Hortonworks Data Platform V2.2, the
configuration files of your cluster might be using environment variables such as
${hdp.version}. If this is your situation, you
need to set the mapreduce.application.framework.path property in the Hadoop properties table of this component with the path value
explicitly pointing to the MapReduce framework archive of your cluster. For
example:

mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework

1

mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework
If you use Hortonworks Data Platform V2.0.0, the
type of the operating system for running the distribution and a Talend
Job must be the same, such as Windows or Linux. Otherwise, you have to use Talend
Jobserver to execute the Job in the same type of operating system in which the
Hortonworks Data Platform V2.0.0 distribution you
are using is run. For further information about Talend Jobserver, see
Talend Installation
and Upgrade Guide.

NameNode URI

Select this check box to indicate the location of the NameNode of the Hadoop cluster to be
used. The NameNode is the master node of a Hadoop cluster. For example, we assume that you
have chosen a machine called masternode as the NameNode
of an Apache Hadoop distribution, then the location is hdfs://masternode:portnumber.

For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.

JobTracker URI

Select this check box to indicate the location of the Jobtracker service within the Hadoop
cluster to be used. For example, we assume that you have chosen a machine called machine1 as the JobTracker, then set its location as machine1:portnumber. A Jobtracker is the service that assigns
Map/Reduce tasks to specific nodes in a Hadoop cluster. Note that the notion job in this
term JobTracker does not designate a Talend Job, but rather a Hadoop job
described as MR or MapReduce job in Apache’s Hadoop documentation on http://hadoop.apache.org.

If you use YARN in your Hadoop cluster such as Hortonworks Data
Platform V2.0.0 or Cloudera CDH4.3 + (YARN
mode), you need to specify the location of the Resource
Manager instead of the Jobtracker. Then you can continue to set the following
parameters depending on the configuration of the Hadoop cluster to be used (if you leave the
check box of a parameter clear, then at runtime, the configuration about this parameter in
the Hadoop cluster to be used will be ignored ):

Select the Set resourcemanager scheduler
address check box and enter the Scheduler address in the field
that appears.
Allocate proper memory volumes to the Map and
the Reduce computations and the ApplicationMaster of YARN by selecting the Set memory check box in the Advanced settings view.
Select the Set jobhistory address check box
and enter the location of the JobHistory server of the Hadoop cluster to be
used. This allows the metrics information of the current Job to be stored in
that JobHistory server.
Select the Set staging directory check box
and enter this directory defined in your Hadoop cluster for temporary files
created by running programs. Typically, this directory can be found under the
yarn.app.mapreduce.am.staging-dir
property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
Select the Set Hadoop user check box and
enter the user name under which you want to execute the Job. Since a file or a
directory in Hadoop has its specific owner with appropriate read or write
rights, this field allows you to execute the Job directly under the user name
that has the appropriate rights to access the file or directory to be
processed.
Select the Use datanode hostname check box to
allow the Job to access datanodes via their hostnames. This actually sets the
dfs.client.use.datanode.hostname property
to true.

For further information about these parameters, see the documentation or
contact the administrator of the Hadoop cluster to be used.

For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.

User name

Enter the Hadoop user authentication.

For some Hadoop versions, you also need to enter the name of the
supergroup to which the user belongs in the Group field that is displayed.

HDFS directory

Enter the HDFS directory where the data to be processed is.

At runtime, this component will clean up any data in this
directory if it exists, write the input data in and perform the
operations.

If you want to reuse the existing data in HDFS, use:

Link with a tGenKeyHadoop
Link with a
tMatchGroupHadoop
Use existing HDFS file

Click the import icon to select a match rule from the Studio
repository.

When you click the import icon, a [Match
Rule Selector] wizard is opened to help you import
match rules from the Studio repository and use them in your
Job.

You can only import rules created with the VSR algorithm. For
further information, see Importing match rules from the studio repository

Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to
apply a matching algorithm.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.

Matching Function

Select the relevant matching algorithm from the list:

Exact Match: matches each
processed entry to all possible reference entries with exactly the
same value. It returns 1 when the two strings
exactly match, and 0 otherwise.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the
same value while ignoring the value case.

Soundex: matches processed
entries according to a standard English phonetic algorithm. It
indexes strings by sound, as pronounced in English, for example
“Hello”: “H400”.

Levenshtein (edit distance):
calculates the minimum number of edits (insertion, deletion or
substitution) required to transform one string into another. Using
this algorithm in the tMatchGroupHadoop component, you do not need to
specify a maximum distance. The component automatically calculates a
matching percentage based on the distance. This matching score will
be used for the global matching calculation, based on the weight you
assign in the Confidence Weight
field.

Metaphone: Based on a phonetic
algorithm for indexing entries by their pronunciation. It first
loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the
reference flow.

Double Metaphone: a new version
of the Metaphone phonetic algorithm, that produces more accurate
results than the original algorithm. It can return both a primary
and a secondary code for a string. This accounts for some ambiguous
cases as well as for multiple variants of surnames with common
ancestry.

Soundex FR: matches processed
entries according to a standard French phonetic algorithm.

Jaro: matches processed entries
according to spelling deviations. It counts the number of matched
characters between two strings. The higher the distance is, the more
similar the strings are.

Jaro-Winkler: a variant of Jaro,
but it gives more importance to the beginning of the string.

q-grams: matches processed
entries by dividing strings into letter blocks of length
q in order to create a number of qlength grams. The matching result is given as the number of
q-gram matches over possible q-grams.

custom…: enables you to load an
external matching algorithm from a Java library. The custom matcher class column alongside is
activated when you select this option.

For further information about how to load an external Java
library, see tLibraryLoad.

For further information about how to create a custom matching
algorithm, see Creating a custom matching algorithm.

For the related scenario about how to use a custom matching
algorithm, see Scenario 2: Using a custom matching algorithm to match entries.

Custom matcher

Type in the path pointing to the custom class (external matching
algorithm) you need to use. This path is defined by yourself in the
library file (.jar file).

For example, to use a MyDistance.class class
stored in the directory org/talend/mydistance
in a user-defined mydistance.jar library, the
path to be entered is
org.talend.mydistance.MyDistance.

Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0.

Note

If you have more than one tMatchGroupHadoop components in your Job, the
distances and the matching scores may not be coherent when the
criteria you set to compute the distance is different in each
component

Handle Null

Handle Null

To handle null values, select from the list the null operator you
want to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but
the first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may
not match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they
match.

And for the records:

“Doe”, “”

Null Match Null: they
match.

Null Match None: they do not
match.

Null Match All: they
match.

Blocking Definition

Input Column

If required, select the column(s), as the blocking key(s), from
the input flow depending on which you want to partition the
processed data in blocks, this is usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive
blocks designed to increase the proportion of matches observed while
decreasing the number of pairs to compare. Comparisons are
restricted to record pairs within each block.

Note

Using blocking column(s) is very useful when you are
processing very big data.

Advanced settings

Matching Algorithm

Select an algorithm from the list – only one is available for the
time being.

Simple VSR Matcher: This
algorithm is based on a Vector Space Retrieval method that specifies
how two records may match.

Match threshold

Enter the match probability. Two data records match when the
probability is above the set value.

Sort the output data by GID

Select this check box to group the output data by the group
ID.

Output distance details

Select this check box to fill the fixed output column
MATCHING_DISTANCES with the details of the
distance between each column. This check box becomes unavailable
when you select the Link with a
tMatchGroupHadoop check box.

Hadoop Properties

Note

Unavailable if you use an existing link.

Property

Talend Studio uses a default configuration for its engine to perform
operations in a Hadoop distribution. If you need to use a custom configuration in a specific
situation, complete this table with the property or properties to be customized. Then at
runtime, the customized property or properties will override those default ones.

Note that if you are using the centrally stored metadata from the Repository, this table automatically inherits the
properties defined in that metadata and becomes uneditable unless you change the
Property type from Repository to Built-in.

For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you
are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:

Typically, the HDFS-related properties can be found in the hdfs-default.xml file of your distribution, such as

http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml.
Apache also provides a page to list the Hive-related properties: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.

Value

Type in the custom property values used to connect to the Hadoop
of interest.

Keep data in Hadoop

Select this check box to keep the data processed by this component
in the HDFS file.

If you leave this check box unselected, the component processes
data and then retrieves it from the HDFS file and output it in the
Job flow.

Use existing HDFS file

Select this check box to enable the component to directly process
the data in an HDFS file. When this check box is selected, this
component can act as a start component in your Job.

HDFS file URI: set the URI of the
HDFS file holding the data you want to process.

Field delimiter: set the
character used as a field delimiter in the HDFS file.

If you leave this check box unselected, tMatchGroupHadoop receives the data flow to be
processed and loads it into an HDFS file.

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Dynamic settings

Click the [+] button to add a row
in the table and fill the Code
field with a context variable to choose your HDFS connection
dynamically from multiple connections planned in your Job.

The Dynamic settings table is available only when the
Use an existing connection check box is selected in the
Basic settings view. Once a dynamic parameter is
defined, the Component List box in the Basic settings view becomes unusable.

For more information on Dynamic settings and context
variables, see Talend Studio User Guide.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component can be a start or an intermediary step. It requires
an input flow as an intermediary step and an output flow as either
of the step types. It needs the connection to Hadoop for processing
large volume of data.

It is ideally used alongside the tGenKey component or the
tGenKeyHadoop component in order to use the blocking
columns, provided by either of the components, to gain a better
performance. In practice, we recommend placing the most restrictive
blocking criteria in the first tGenKey or the first tGenKeyHadoop component in use.

Limitation/prerequisite

You need to use Linux to execute the Job containing this
component.

Working principle

This component implements the MapReduce model, based on the blocking keys defined
in the Blocking definition table of the Basic settings view.

This implementation proceeds as follows:

Splits the input rows in groups of a given size.
Implements a Map Class that creates a map between each key and a list of
records.
Shuffles the records to group those with the same key together.
Applies, on each key, the algorithm defined in the Key definition table of the Basic
settings view.

Then accordingly, this component reads the records, compares them with the
master records, groups the similar ones, and classes each of the rest as a
master record.
Outputs the groups of similar records with their group IDs, group sizes,
matching distances and scores.

Scenario: Running customer matching through multiple passes

The Job in this scenario connects to the given Hadoop system, groups similar customer
records by running through two subsequent matching passes in HDFS and outputs the
calculated matches by groups. Each pass provides its matches to the pass that follows in
order for the latter to add more matches identified with new rules.

The components of interest are:

tFixedFlowInput: it provides the customer
records to be processed.
two tGenKeyHadoop components: Each defines a
way to partition the records…
two tMatchGroupHadoop components: they
process each partition to group the records within and once configured, sort the
groups by their group IDs. The first tMatchGroupHadoop processes the partitions defined by the first
tGenKeyHadoop and the second tMatchGroupHadoop processes those defined by the
second tGenKeyHadoop. Each of them set one pass
to match the received records.

Warning

The two tMatchGroupHadoop components
must have the same schema.
two tLogRow components: they present the
execution result of each tMatchGroupHadoop
component.

To replicate this scenario, proceed as the following sections illustrate.

Dropping and linking the components

Drop tFixedFlowInput, two tGenKeyHadoop components, two tMatchGroupHadoop components and two tLogRow components from Palette onto the design workspace.

Note

A component used in the workspace can be labelled the way you need. In
this scenario, this input component is labelled
incoming_customers for tFixedFlowInput. For further information about how to
label a component, see Talend Studio
User Guide
Right-click tFixedFlowInput to open its
contextual menu and select the Row >
Main link from this menu to connect
this component to the first tGenKeyHadoop
(labeled lname_postcode).
Do the same to create the Main link from
the first tGenKeyHadoop to the second
tGenKeyHadoop (labeled lname_initial), to the first tMatchGroupHadoop, to the first tLogRow and then to the second tMatchGroupHadoop and finally to the second
tLogRow.

The components to be used in this scenario are all placed and linked. Then you
need continue to configure them successively.

Configuring the input flow

Setting up the input data

Double-click tFixedFlowInput to open its
Component view.
Click the three-dot button next to Edit
schema to open the schema editor.
Click the plus button eight times to add eight rows. They are the eight
columns of the schema of the input data.
Rename these eight rows respectively. In this example, they are: account_num, lname, fname, mi, address1, city, state_province, postal_code.
In the Type column, select the data types
for the rows of interest. In this example, select Long for the account_num
column.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data
of interest.
In the Content field, enter the input
data to be processed, or paste the sample data provided by the demo Job
D4_hadoop_group_family_multipass that
you could import along with the demo DQ project. For further information
about how to import a project, see Talend Studio
User Guide.

Configuring the key generation for the first pass

Double-click the first tGenKeyHadoop
(labelled lname_postcode) to open the
Component view.
Configure the connection to the HDFS you want to write and process the
records in.

The parameters to be set are the Distribution, the Hadoop
version, the NameNode URI,
the HDFS directory, User name and the Jobtracker
URI.

In the HDFS directory you define, this component will create, at runtime,
the folder storing separately the input records and the same records but
with their partition keys: the original ones in an in folder and the partitioned ones in an out folder, both under the same parental folder
tGenKeyHadoop_1
Click and import blocking keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the plus
button to add two rows in this table.
On the column column, click the newly
added row and select, from the list, the column you want to process using an
algorithm. In this example, select lname.
Do the same on the second row to select
postal_code
On the pre-algorithm column, click the
newly added row and select, from the list, the pre-algorithm you want to
apply to the corresponding column. In this example, select remove diacritical marks and upper case to remove
any diacritical mark and converts the fields of the lname column to upper case before generating the code of
this column.

Note

This conversion does not change your raw data.
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select N first
characters of each word.
Do the same for the second row on the algorithm column to select first N
characters of the string.
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed. In this scenario, type in
1 for both of the rows, meaning that the first
letter of each field in the corresponding columns will be used to generate
the keys.
Click Advanced settings to open its view.
Select the Keep data in Hadoop check box
in order to process data in HDFS. In this situation, the customer records
are not outputted into the data flow and the process runs faster.

Configuring the key generation for the second pass

Double-click the second tGenKeyHadoop
(labelled lname_initial) to open the
Component view.
Select Link with a tGenKeyHadoop to reuse
the connection to HDFS and the HDFS directory created by the first tGenKeyHadoop.. This reuse enables this component
to read the data processed by its preceding tGenKeyHadoop component natively in HDFS.
Click the button to verify the key column in the schema. You can
find that the key columns of the two tGenKeyHadoop components have been automatically named to
differentiate each other. In this scenario, they are T_GEN_KEY_postcode and T_GEN_KEY.
Click and import blocking keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the plus
button to add one row in this table.
On the column column, click the newly
added row and select from the list the column you want to process using an
algorithm. In this example, select account_num.
On the algorithm column, click the newly
added row and select from the list the algorithm you want to apply to the
corresponding column. In this example, select first N
characters of the string.
Click in the Value column next to the
algorithm column and enter the value
for the selected algorithm, when needed. In this scenario, type in
1, meaning that the first letter of each field in
the corresponding column will be used to generate the required keys.
Click Advanced settings to open its
view.
Select the Keep data in Hadoop check box
in order to process data in HDFS.

Configuring the two passes

You need to configure the two passes to group the input data with the help of the
two columns of generated keys.

Configuring the first pass

Double-click the first tMatchGroupHadoop
component (labelled pass1) to display the
Component view.
If required, click Sync schema, then
click Edit schema to open the schema editor
and see the schema retrieved from the previous component in the Job.
Select Link with a tGenKeyHadoop to reuse
the connection to HDFS and the HDFS directory created by its preceding
tGenKeyHadoop.. This reuse enables this
component to read the data processed by that tGenKeyHadoop component
natively in HDFS.
Click and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps.
In the Key definition table, click to add to the list the columns on which you want to do
the matching operation, lname in this
scenario.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Click in the first and second cells of the Matching
type column and select from the list the method(s) to be used
for the matching operation, Jaro-Winkler in this
example.
Click in the cell of the Handle Null
column and select the null operator you want to use to handle null
attributes in the column.
Click the plus button below the Blocking
Definition table to add one row in the table then click in
the line and select from the list the column you want to use as a blocking
value, T_GEN_KEY_postcode in this
example.

Using a blocking value reduces the number of pairs of records that needs
to be examined. The input data is partitioned into exhaustive blocks based
on the functional key. This will decrease the number of pairs to compare, as
comparison is restricted to record pairs within each block.
Click Advanced settings to open its view
to verify that the Keep data in Hadoop is
clear. This way, the processed customer records are outputted into the data
flow and therefore, become available to tLogRow.
Select the Sort the output data by GID
check box to arrange the output data by their group IDs.

Configuring the second pass

Double-click the second tMatchGroupHadoop
component (labelled pass2) to display the
Component view.
If this component does not have the same schema of the preceding
component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from
the preceding one and once done, the warning icon disappears.
Select the Link with a tMatchGroupHadoop
check box to reuse the connection of its preceding tMatchGroupHadoop to the Hadoop system.
Click and import matching keys from the match rules created
and tested in the Profiling perspective
of Talend Studio and use them in your Job. Otherwise,
define the matching key parameters as described in the below steps.
In the Key definition table, click the
plus button to add to the list the columns on which you want to do the
matching operation, lname in this
scenario.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Click in the cell of the Matching type
column and select from the list the method(s) to be used for the matching
operation, Jaro-Winkler in this
example.
Click below the Blocking
Definition table to add one row in the table then click in
the row and select from the list the column you want to use as a blocking
value, T_GEN_KEY in this example. This
way, the matching operation is performed only between the master records
with the same key, in this example, the same initial character of the
account numbers.

Through this pass, this Job processes the matching groups provided by the
previous tMatchGroupHadoop. It selects the
groups with one single record to compare with the other master records when
both parts of them have the same generated key.
Click Advanced settings to open its view
to verify that the Keep data in Hadoop is
clear. This way, the processed customer records are outputted into the data
flow and therefore, become available to
tLogRow.
Select the Sort the output data by GID
check box to arrange the output data by their group IDs.

Sorting the input records

Double-click tSortRow to open its
Component view.
Under the Criteria table, click the plus
button twice to add two rows.
In the first row, select GID for the
Schema column column, alpha for the Sort num or
alpha column and asc for
the Order asc or desc column. This means
that the sorting is performed on the GID
column of the input schema, in terms of its ascending alphabetical order.
Thus the other columns are sorted accordingly.
Do the same to select GRP_SIZE,
num and desc accordingly in the second row.

Then you can run this Job.

The tLogRow component is used to present the
execution result of the Job.

If you want to configure the presentation mode on its Component view, double-click the tLogRow component of interest to open the
Component view and in the Mode area, then, select the Table (print values in cells of a table) check box.
Press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

The result after the first pass reads as follows:

The result after the second pass reads as follows:

Note

For the reason of page space, the results are not totally presented.

When you compare, for example, the customer name Alexander from the results of the two passes, you will find that
more customers using the last name Alexander are
grouped together after the second pass:

In the first pass, Jeremy Alexander,
Bob Alexander and Maxine Alexander are not distributed in the
same group because the matching is performed only within each block defined
in the T_GEN_KEY_postcode column while
they belong to different blocks, A9, A8 and A3 respectively.
In the second pass, the matching to be performed uses the blocks defined
in the T_GEN_KEY column. As all of the
three customer names belong to block 2,
so they are grouped together after computing the distance in between. In
addition, you can as well read, from the MASTER column, that Jeremy
Alexander is the master record of its group.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tMatchGroupHadoop – Docs for ESB 5.x

tMatchGroupHadoop

Warning

tMatchGroupHadoop properties

Note

Note

Note

Note

Note

Note

Note

Working principle

Scenario: Running customer matching through multiple passes

Warning

Dropping and linking the components

Note

Configuring the input flow

Note

Configuring the two passes

Note

Note

Note

My Website Links

Tags