tRecordMatching
Ensures the data quality of any source data against a reference data
source.
tRecordMatching joins two tables by doing a fuzzy
match on several columns using a wide variety of comparison algorithms. It compares
columns from the main flow with reference columns from the lookup flow and according to
the matching strategy you define, outputs the match data, the possible match data and
the rejected data. On arranging your matching strategy, the user-defined matching scores
are critical to determine the match level of the data of interest.
tRecordMatching Standard properties
These properties are used to configure tRecordMatching running in the Standard Job framework.
The Standard
tRecordMatching component belongs to the Data Quality family.
This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description, it defines the number of fields to be |
 |
Built-in: You create and store the |
 |
Repository: You have already created |
Replace output column with lookup column if matches or |
Select this check box to replace the output column with the lookup Once this check box is selected, the Columns |
Columns Mapping |
Complete this table to define the output columns to be replaced and – Output Column: select the column of – Lookup Column: select the column of |
|
Click the import icon to select a match rule from the Studio When you click the import icon, a Match Rule You can import rules created with the VSR algorithm. The T-Swoosh |
Input Key Attribute |
Select the column(s) from the main flow that needs to be checked Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format. For example, |
Lookup Key Attribute |
Select the lookup key columns that you will use as a reference against |
Matching Function |
Select the relevant matching algorithm from the list:
Exact Match: matches each processed
Levenshtein: Based on the edit
Metaphone: Based on a phonetic
Double Metaphone: a new version of
Exact – ignore case: matches each
Soundex: matches processed entries
Soundex FR: matches processed entries
Jaro: matches processed entries
q-grams: matches processed entries by
Hamming: calculates the minimum
custom…: enables you to load an For further information about how to load an external Java library, |
Custom Matcher |
Type in the path pointing to the custom class (external matching For further information, see Creating a custom matching algorithm. |
 |
Tokenized measure Tokenization is the concept of splitting a string into words. Select
NO: no tokenization method is used
Same place: splits the two strings by
Same order: splits the two strings by This method should be used only with strings which has a few words,
Any order: splits the two strings by Using this method, “John Doe” and “Doe John” match. |
Weight |
Set a numerical weight for each attribute (column) of the key |
Handle Null |
Handle Null To handle null values, select from the list the null operator you want
Null Match Null: a Null attribute
Null Match None: a Null attribute
Null Match All: a Null attribute For example, if we have two columns, name and If we have two records: “Doe”, “John” “Doe”, “” Depending on the operator you choose, these two records may or may not
Null Match Null: they do not
Null Match None: they do not Null Match All: they match. And for the records: “Doe”, “” “Doe”, “” Null Match Null: they match.
Null Match None: they do not Null Match All: they match. |
Input Column |
If required, select the column(s) from the input flow according to Blocking reduces the number of pairs of records that needs to be Using blocking column(s) is very useful when you are processing very |
Matching strategy |
Select the matching output that best fulfils your needs. This option
may be:
|
Advanced settings
Matching Algorithm |
Select an algorithm from the list. Only one algorithm is available for
Simple VSR: This algorithm is based For further information about how to import rules based on the VSR |
Possible match interval |
Enter a minimum and a maximum values:
minimum: set the minimum record For example, if you set 0.5 as the minimum value and 0.9 as the |
Store on disk |
Select this check box if you want to store processed data blocks on |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Global Variables
Global Variables |
NB_MATCH_LINE: the number of rows matching the comparison
NB_POSSIBLE_MATCH_LINE: the number of rows possibly
NB_NONE_MATCH_LINE: the number of rows not matching the
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is not startable and it requires two input components |
Creating a custom matching algorithm
The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.
A custom matching algorithm is written manually and stored in a
.jar file (Java archive).
Talend
provides an
example .jar file on the basis of which you are supposed to
develop your own file easily.
-
In Eclipse, check out the test.mydistance project
from svn at: -
In this project, navigate to the Java class
named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance. -
Open this file that has the below code:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849package org.talend.mydistance;import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;/*** @author scorreia** Example of Matching distance.*/public class MyDistance extends AbstractAttributeMatcher {/** (non-Javadoc)** @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()*/@Overridepublic AttributeMatcherType getMatchType() {// a custom implementation should return this type AttributeMatcherType.customreturn AttributeMatcherType.CUSTOM;}/** (non-Javadoc)** @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchingWeight(java.lang.String,* java.lang.String)*/@Overridepublic double getWeight(String arg0, String arg1) {// Here goes the custom implementation of the matching distance between the two given strings.// the algorithm should return a value between 0 and 1.// in this example, we consider that 2 strings match if their first 4 characters are identical// the arguments are not null (the check for nullity is done by the caller)int MAX_CHAR = 4;final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));int nbIdenticalChar = 0;for (; nbIdenticalChar < max; nbIdenticalChar++) {if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {break;}}if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {MAX_CHAR = Math.max(arg0.length(), arg1.length());}return (nbIdenticalChar) / ((double) MAX_CHAR);}
-
In this file, type in the class name for the custom algorithm you are
creating in order to replace the default name. The default name is
MyDistance and you can find it in the line:
public class MyDistance implements
.
IAttributeMatcher -
In the place where the default algorithm is in the file, type in the
algorithm you need to create to replace the default one. The default
algorithm reads as follows:123456789101112int MAX_CHAR = 4;final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));int nbIdenticalChar = 0;for (; nbIdenticalChar < max; nbIdenticalChar++) {if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {break;}}if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {MAX_CHAR = Math.max(arg0.length(), arg1.length());}return (nbIdenticalChar) / ((double) MAX_CHAR); - Save your modifications.
- Using Eclipse, export this new .jar file.
Then this user-defined algorithm is ready to be used by the tRecordMatching component.
Matching entries using the Q-grams and Levenshtein
algorithms
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This scenario describes a Job which uses a match rule based on the VSR
algorithm. The Job aims at:
-
matching entries in the name column against
the entries in the reference input file by dividing strings into letter blocks
of length q, where q is 3, in order to create a number of q length grams. The matching result is given as the number of q-gram
matches over possible q-grams, -
checking the edit distance between the entries in the email column of an input file against those of the
reference input file.
The outputs of these two matching types are written in three output files:
the first for match values, the second for possible match values and the third for the
values for which there are no matches in the lookup file.
In this scenario, we have already stored the main and reference input
schemas in the Repository. For more information about storing schema metadata in the
Repository, see
Talend Studio User Guide.
The main input table contains seven columns: code,
name, address, zipcode, city, email and col7. We want to carry the fuzzy match on two columns: name and email.
Setting up the Job
-
In the Repository tree view, expand
Metadata – DB
Connections where you have stored the main input schemas and
drop the relevant file onto the design workspace.The Components dialog box
appears. -
Select tMysqlInput and click OK to drop the tMysqlInput component onto the workspace.
The input table used in this scenario is called
person. It holds several columns including the two
columns name and email we want to
do the fuzzy match on. The following capture shows the basic properties of
the main input component: -
Do the same for the second input table you want to use as a reference,
customer in this scenario.The following capture shows the basic properties of the reference input component: - Drop the following components from the Palette onto the design workspace: tRecordMatching and three tLogRow.
-
Connect the main and reference input components to tRecordMatching using Main
links. The link between the reference input table and tRecordMatching displays as a Lookup link on the design workspace. -
Connect tRecordMatching to the three
tLogRow components using the Matches, Possible
Matches and Non Matches
links.
Configuring the components
-
Double-click tRecordMatching to display its
Basic settings view and define its
properties. -
Click the Edit schema button to open a dialog
box. Here you can define the data you want to pass to the output
components.In this example we want to pass to the tRecordMatching component the name and
email columns from the first tMysqlInput component, and the ref_name and
ref_
email columns from the second
tMysqlInput component.The MATCHING_DISTANCE and the
MATCHING_WEIGHT columns in the output schema are
defined by default.The MATCHING_WEIGHT column is always between
0 and 1. It is a global distance
between sets of columns (defined by the columns to be matched).The MATCHING_DISTANCE column will print a distance for
each of the columns on which we use an algorithm. The results will be separated
by a vertical bar (pipe).Click OK to close the dialog box and proceed
to the next step. -
In the Key Definition area of the Basic settings view of tRecordMatching, click the plus button to add two columns to the
list. -
Select the input columns and the output columns you want to do the fuzzy
matching on from the Input key attribute and
Lookup key attribute lists respectively.In this example, select name and email
as input attributes and ref-name and
ref_email as lookup attributes.Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison. -
Click in the Matching type column and select
from the list q-gram, the method to be used on
the first column to check the incoming data against the reference data. -
Set the matching type for the second column, Levenshtein in this example.
The minimum and maximum possible match values are defined in the Advanced settings view. You can change the by-default
values. -
From the Tokenized measure list, select not
to use a tokenized distance for the selected algorithms. -
In the Weight column, set a numerical weight
for each of the columns used as key attributes. -
Click in the cell of the Handle Null column
and select the null operator you want to use to handle null attributes in the
columns. -
If required, click the plus button below the Blocking
Selection table to add one or more lines in the table and then
click in the line and select from the list the column you want to use as a
blocking value.Using a blocking value reduces the number of pairs of records that needs to be
examined. The input data is partitioned into exhaustive blocks based on the
blocking value. This will decrease the number of pairs to compare as comparison
is restricted to record pairs within each block. Check Comparing columns and grouping in the output flow duplicate records that have the same functional key for a use case of the blocking value. -
Click the Advanced settings tab to open the
corresponding view and make sure to select the Simple
VSR algorithm. -
Double-click the first tLogRow component to
display its Basic settings view, and select
Table in the Mode area to display the source file and the tRecordMatching results together to be able to
compare them. -
Do the same for the other two tLogRow
components.
Executing the Job
it.
entries, the second show the possible match entries and the third shows the
non match entries according to the used matching method in the defined
columns.
tables.

identical entries in the reference table. Thus the matching distance and the
matching weight are equal to “1.0”.
in the reference table. The matching distance column prints the distances
for the name and email columns and
separate them by a vertical bar.
the reference table.
Using a custom matching algorithm to match entries
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
In this scenario, reuse the previous Job to load and apply a user-defined VSR
matching algorithm.
As a prerequisite, follow the steps described in Creating a custom matching algorithm to manually write
a custom algorithm and store it in a .jar file (Java archive). The mydistance.jar file is used here to provide the user-defined matching
algorithm, MyDistance.class.
You will also need to use the tLibraryLoad component to import the Java library into the Job.

Creating a custom matching algorithm
The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.
A custom matching algorithm is written manually and stored in a
.jar file (Java archive).
Talend
provides an
example .jar file on the basis of which you are supposed to
develop your own file easily.
-
In Eclipse, check out the test.mydistance project
from svn at: -
In this project, navigate to the Java class
named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance. -
Open this file that has the below code:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849package org.talend.mydistance;import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;/*** @author scorreia** Example of Matching distance.*/public class MyDistance extends AbstractAttributeMatcher {/** (non-Javadoc)** @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()*/@Overridepublic AttributeMatcherType getMatchType() {// a custom implementation should return this type AttributeMatcherType.customreturn AttributeMatcherType.CUSTOM;}/** (non-Javadoc)** @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchingWeight(java.lang.String,* java.lang.String)*/@Overridepublic double getWeight(String arg0, String arg1) {// Here goes the custom implementation of the matching distance between the two given strings.// the algorithm should return a value between 0 and 1.// in this example, we consider that 2 strings match if their first 4 characters are identical// the arguments are not null (the check for nullity is done by the caller)int MAX_CHAR = 4;final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));int nbIdenticalChar = 0;for (; nbIdenticalChar < max; nbIdenticalChar++) {if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {break;}}if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {MAX_CHAR = Math.max(arg0.length(), arg1.length());}return (nbIdenticalChar) / ((double) MAX_CHAR);}
-
In this file, type in the class name for the custom algorithm you are
creating in order to replace the default name. The default name is
MyDistance and you can find it in the line:
public class MyDistance implements
.
IAttributeMatcher -
In the place where the default algorithm is in the file, type in the
algorithm you need to create to replace the default one. The default
algorithm reads as follows:123456789101112int MAX_CHAR = 4;final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));int nbIdenticalChar = 0;for (; nbIdenticalChar < max; nbIdenticalChar++) {if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {break;}}if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {MAX_CHAR = Math.max(arg0.length(), arg1.length());}return (nbIdenticalChar) / ((double) MAX_CHAR); - Save your modifications.
- Using Eclipse, export this new .jar file.
Then this user-defined algorithm is ready to be used by the tRecordMatching component.
Setting up the Job
- On the previous Job, drop the tLibraryLoad component from the Palette to the Design workspace.
-
Delete the tLogRow components named
possible and none. -
Connect the tLibraryLoad component to the
tMysqlInput
(person) component using a Trigger > On Subjob Ok link.
Configuring the components
-
Double-click tLibraryLoad to open its
Component view. -
Click the […] button and browse to the
mydistance.jar file. -
Click Windows>Show
view… to open the Modules
view. -
In the Modules view, click
and in the open dialog box, browse to the user-defined
mydistance.jar file created for this Job. -
Click Open.
The user-defined .jar file is imported and listed in the
Modules view.You will get an error message if you try to run the Job without installing the
external user-defined .jar file. -
Double-click tRecordMatching to open its
Component view. -
In the Key Definition table of this view,
click the name row in the Matching
Type column and select custom…
from the drop-down list. -
In the Custom Matcher of this
name row, type in the path pointing to
MyDistance.class in the
mydistance.jar file. In this example, this path is
org.talend.mydistance.MyDistance. -
Click the Advanced settings tab to open the
corresponding view and make sure to select the Simple
VSR algorithm.
Executing the Job
identified and listed as follows:
