tRecordMatching

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tRecordMatching properties

Component family	Data Quality
Function	tRecordMatching joins two tables by doing a fuzzy match on several columns using a wide variety of comparison algorithms. It compares columns from the main flow with reference columns from the lookup flow and according to the matching strategy you define, outputs the match data, the possible match data and the rejected data. On arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.
Purpose	Helps ensuring the data quality of any source data against a reference data source.
Basic settings	Schema and Edit schema	A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.
		Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
		Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.
	Replace output column with lookup column if matches or possible matches	Select this check box to replace the output column with the lookup column in case of match or possible match values. Once this check box is selected, the Columns Mapping table appears.
	Columns Mapping	Complete this table to define the output columns to be replaced and the corresponding substitute lookup columns. The column to be complete are: – Output Column: select the column of interest from the drop-down list. On this list, two columns MATCHING_WEIGHT and MATCHING_DISTANCES, are defined by default; the other columns are available only when you have defined them in the schema editor of this component. For further information about these two default columns, see Scenario 1: Matching entries using the Q-grams and Levenshtein algorithms; – Lookup Column: select the column of interest from the drop-down list. This list is available when a lookup link has been defined to provide the corresponding lookup columns to this component.
		Click the import icon to select a match rule from the Studio repository. When you click the import icon, a [Match Rule Selector] wizard is opened to help you import match rules from the Studio repository and use them in your Job. You can only import rules created with the VSR algorithm. For further information, see Importing match rules from the studio repository
Key definition	Input Key Attribute	Select the column(s) from the main flow that needs to be checked against the reference (lookup) key column. Note When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format. For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter “yyyy” in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
	Lookup Key Attribute	Select the lookup key columns that you will use as a reference against which to compare the columns from the input flow.
	Matching Function	Select the relevant matching algorithm from the list: Exact Match: matches each processed entry to all possible reference entries with exactly the same value. Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. Exact – ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. Soundex FR: matches processed entries according to a standard French phonetic algorithm. Jaro: matches processed entries according to spelling deviations. q-grams: matches processed entries by dividing strings into letter blocks of length `q` in order to create a number of `q`length grams. The matching result is given as the number of q-gram matches over possible q-grams. custom…: enables you to load an external matching algorithm from a Java library. The Custom Matcher class column alongside is activated when you selected this option. For further information about how to load an external Java library, see tLibraryLoad.
	Custom Matcher	Type in the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file). For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.
	Weight	Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.
	Handle Null	Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match. Null Match None: they do not match. Null Match All: they match.
Blocking Selection	Input Column	If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as “blocking”. Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data.
	Matching strategy	Select the matching output that best fulfils your needs. This option may be: All matches: this option means to output all of the matching and the possibly matching records. Best match: this option means to output only the record with the highest matching score. First match: this option means to output the first matching record; if no one is matching, it outputs the first possibly matching record. Last match: this option means to output the last matching record; if no one is matching, it outputs the last possibly matching record. When you arrange the matching strategy, the matching scores are critical. You need to define them in the Possible match interval fields of the Advanced settings view.
Advanced settings	Matching Algorithm	Select an algorithm from the list – only one is available for the time being. Simple VSR Matcher: This algorithm is based on a Vector Space Retrieval method that specifies how two records may match.
	Possible match interval	Enter a minimum and a maximum values: minimum: set the minimum record distance allowed to match the reference (0 <= minimum). maximum: set the maximum distance allowed to match the reference (maximum <=1). For example, if you set 0.5 as the minimum value and 0.9 as the maximum value, the scores equal or higher than 0.9 indicate match, the scores between 0.5 excluded and 0.9 excluded indicate possible match and the other scores indicate non match.
	Store on disk	Select this check box if you want to store processed data blocks on the disk to maximize system performance.
	tStatCatcher Statistics	Select this check box to collect log data at the component level.
Global Variables	NB_MATCH_LINE: the number of rows matching the comparison algorithm. This is an After variable and it returns an integer. NB_POSSIBLE_MATCH_LINE: the number of rows possibly matching the comparison algorithm. This is an After variable and it returns an integer. NB_NONE_MATCH_LINE: the number of rows not matching the comparison algorithm. This is an After variable and it returns an integer. ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component is not startable and it requires two input components and one or more output components.
Limitation/prerequisite	n/a

Creating a custom matching algorithm

The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.

A custom matching algorithm is written manually and stored in a
.jar file (Java archive). Talend provides an
example .jar file on the basis of which you are supposed to
develop your own file easily. To do this, proceed as follows:

In Eclipse, check out the test.mydistance project
from svn at:

http://talendforge.org/svn/top/trunk/test.mydistance/
In this project, navigate to the Java class named
MyDistance.Java: http://talendforge.org/svn/top/trunk/test.mydistance/src/main/java/org/talend/mydistance//.

Open this file that has the below code:

package org.talend.mydistance;

import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;
import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;

/**
 * @author scorreia
 * 
 * Example of Matching distance.
 */
public class MyDistance extends AbstractAttributeMatcher {

    /*
     * (non-Javadoc)
     * 
     * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()
     */
    @Override
    public AttributeMatcherType getMatchType() {
        // a custom implementation should return this type AttributeMatcherType.custom
        return AttributeMatcherType.CUSTOM;
    }

    /*
     * (non-Javadoc)
     * 
     * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatching
     Weight(java.lang.String,
     * java.lang.String)
     */
    @Override
    public double getWeight(String arg0, String arg1) {
        // Here goes the custom implementation of the matching distance between the two given strings.
        // the algorithm should return a value between 0 and 1.

        // in this example, we consider that 2 strings match if their first 4 characters are identical
        // the arguments are not null (the check for nullity is done by the caller)
        final int max = 4;
        int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));
        for (int c = 0; c &lt; max; c++) {
            if (arg0.charAt(c) != arg1.charAt(c)) {
                nbIdenticalChar = c;
                break;
            }
        }
        return (max - nbIdenticalChar) / ((double) max);
    }

}

package org.talend.mydistance;

import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;

import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;

/**

* @author scorreia

* Example of Matching distance.

public class MyDistance extends AbstractAttributeMatcher {

* (non-Javadoc)

* @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()

@Override

public AttributeMatcherType getMatchType() {

// a custom implementation should return this type AttributeMatcherType.custom

return AttributeMatcherType.CUSTOM;

}

* (non-Javadoc)

* @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatching

Weight(java.lang.String,

* java.lang.String)

@Override

public double getWeight(String arg0, String arg1) {

// Here goes the custom implementation of the matching distance between the two given strings.

// the algorithm should return a value between 0 and 1.

// in this example, we consider that 2 strings match if their first 4 characters are identical

// the arguments are not null (the check for nullity is done by the caller)

final int max = 4;

int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));

for (int c = 0; c < max; c++) {

if (arg0.charAt(c) != arg1.charAt(c)) {

nbIdenticalChar = c;

break;

}

return (max - nbIdenticalChar) / ((double) max);

}

In this file, type in the class name for the custom algorithm you are
creating in order to replace the default name. The default name is
MyDistance and you can find it in the line:
public class MyDistance implements IAttributeMatcher.

In the place where the default algorithm is in the file, type in the
algorithm you need to create to replace the default one. The default
algorithm reads as follows:

final int max = 4;
        int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));
        for (int c = 0; c &lt; max; c++) {
            if (arg0.charAt(c) != arg1.charAt(c)) {
                nbIdenticalChar = c;
                break;
            }
        }
        return (max - nbIdenticalChar) / ((double) max);

final int max = 4;

int nbIdenticalChar = Math.min(max, Math.min(arg0.length(), arg1.length()));

for (int c = 0; c < max; c++) {

if (arg0.charAt(c) != arg1.charAt(c)) {

nbIdenticalChar = c;

break;

}

return (max - nbIdenticalChar) / ((double) max);

Save your modifications.
Using Eclipse, export this new .jar file.

Then this user-defined algorithm is ready to be used by the tRecordMatching component.

Scenario 1: Matching entries using the Q-grams and Levenshtein algorithms

This scenario describes a six-component Job that aims at:

matching entries in the name column against the entries
in the reference input file by dividing strings into letter blocks of length q,
where q is 3, in order to create a number of q length grams. The matching result is given as the number of
q-gram matches over possible q-grams,
checking the edit distance between the entries in the
email column of an input file against those of the
reference input file.

The outputs of these two matching types are written in three output files: the first
for match values, the second for possible match values and the third for the values for
which there are no matches in the lookup file.

In this scenario, we have already stored the main and reference input schemas in the
Repository. For more information about storing schema metadata in the Repository, see
Talend Studio User Guide.

The main input table contains seven columns: code,
name, address,
zipcode, city, email
and col7. We want to carry the fuzzy match on two columns:
name and email.

Setting up the Job

In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schemas and
drop the relevant file onto the design workspace.

The [Components] dialog box
appears.
Select tMysqlInput and click OK to drop the tMysqlInput component onto the workspace.

The input table used in this scenario is called
person. It holds several columns including the two
columns name and email we want to
do the fuzzy match on. The following capture shows the basic properties of
the main input component:
Do the same for the second input table you want to use as a reference,
customer in this scenario.

The following capture shows the basic properties of the reference input
component:
Drop the following components from the Palette onto the design workspace: tRecordMatching and three tLogRow.
Connect the main and reference input components to tRecordMatching using Main
links. The link between the reference input table and tRecordMatching displays as a Lookup link on the design workspace.
Connect tRecordMatching to the three
tLogRow components using the Matches, Possible
Matches and Non Matches
links.

Configuring the components

Double-click tRecordMatching to display
its Basic settings view and define its
properties.
Click the Edit schema button to open a
dialog box. Here you can define the data you want to pass to the output
components.

In this example we want to pass to the tRecordMatching component the name and
email columns from the first tMysqlInput component, and the ref_name
and ref_email columns from the
second tMysqlInput component.

The MATCHING_DISTANCE and the
MATCHING_WEIGHT columns in the output schema are
defined by default.

The MATCHING_WEIGHT column is always between
0 and 1. It is a global
distance between sets of columns (defined by the columns to be matched).

The MATCHING_DISTANCE column will print a distance
for each of the columns on which we use an algorithm. The results will be
separated by a vertical bar (pipe).

Click OK to close the dialog box and
proceed to the next step.
In the Key Definition area of the
Basic settings view of tRecordMatching, click the plus button to add two
columns to the list.
Select the input columns and the output columns you want to do the fuzzy
matching on from the Input key attribute
and Lookup key attribute lists
respectively.

In this example, select name and email
as input attributes and ref-name and
ref_email as lookup attributes.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
Click in the Matching type column and
select from the list q-gram, the method to
be used on the first column to check the incoming data against the reference
data.
Set the matching type for the second column, Levenshtein in this example.

The minimum and maximum possible match values are defined in the Advanced settings view. You can change the
by-default values.
In the Confidence Weight column, set a
numerical weight for each of the columns used as key attributes.
Click in the cell of the Handle Null
column and select the null operator you want to use to handle null
attributes in the columns.
If required, click the plus button below the Blocking Selection table to add one or more lines in the
table and then click in the line and select from the list the column you
want to use as a blocking value.

Using a blocking value reduces the number of pairs of records that needs
to be examined. The input data is partitioned into exhaustive blocks based
on the blocking value. This will decrease the number of pairs to compare as
comparison is restricted to record pairs within each block. Check Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key for a use case of the blocking value.
Double-click the first tLogRow component
to display its Basic settings view, and
select Table in the Mode area to display the source file and the tRecordMatching results together to be able to
compare them.
Do the same for the other two tLogRow
components.

Executing the Job

Save your Job and press F6 to execute
it.

Three output tables are written on the console. The first shows the match
entries, the second show the possible match entries and the third shows the
non match entries according to the used matching method in the defined
columns.

The figure below illustrates extractions of the three output
tables.

The first table lists all the names and emails that could be matched with
identical entries in the reference table. Thus the matching distance and the
matching weight are equal to “1.0”.

The second table lists all the names and emails that have a possible match
in the reference table. The matching distance column prints the distances
for the name and email columns and
separate them by a vertical bar.

The third table lists all the names and emails that do not have match in
the reference table.

Scenario 2: Using a custom matching algorithm to match entries

In this scenario, reuse the previous Job to load and apply a user-defined matching
algorithm.

As a prerequisite, follow the steps described in Creating a custom matching algorithm to
manually write a custom algorithm and store it in a .jar file (Java
archive). The mydistance.jar file is used here to provide the
user-defined matching algorithm, MyDistance.class.

You will also need to use the tLibraryLoad component
to import the Java library into the Job.

Setting up the Job

On the previous Job, drop the tLibraryLoad component from the Palette to the Design workspace.
Delete the tLogRow components named
possible and none.
Connect the tLibraryLoad component to the
tMysqlInput
(person) component using a Trigger > On Subjob Ok link.

Configuring the components

Double-click tLibraryLoad to open its
Component view.
Click the […] button and browse to the
mydistance.jar file.
Click Windows>Show
view… to open the Modules
view.
In the Modules view, click and in the open dialog box, browse to the user-defined
mydistance.jar file created for this Job.
Click Open.

The user-defined .jar file is imported and listed in
the Modules view.

You will get an error message if you try to run the Job without installing
the external user-defined .jar file.
Double-click tRecordMatching to open its
Component view.
In the Key Definition table of this view,
click the name row in the Matching Type column and select custom… from the drop-down list.
In the Custom matcher class of this
name row, type in the path pointing to
MyDistance.class in the
mydistance.jar file. In this example, this path is
org.talend.mydistance.MyDistance.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.

Executing the Job

Press F6 to run this Job.

In the Run view, the matched entries are
identified and listed as follows:

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tRecordMatching – Docs for ESB 5.x

tRecordMatching

Warning

tRecordMatching properties

Note

Creating a custom matching algorithm

Scenario 1: Matching entries using the Q-grams and Levenshtein algorithms

Setting up the Job

Configuring the components

Note

Executing the Job

Scenario 2: Using a custom matching algorithm to match entries

Setting up the Job

Configuring the components

Note

Executing the Job

My Website Links

Tags