July 30, 2023

tRecordMatching – Docs for ESB 7.x

tRecordMatching

Ensures the data quality of any source data against a reference data
source.

tRecordMatching joins two tables by doing a fuzzy
match on several columns using a wide variety of comparison algorithms. It compares
columns from the main flow with reference columns from the lookup flow and according to
the matching strategy you define, outputs the match data, the possible match data and
the rejected data. On arranging your matching strategy, the user-defined matching scores
are critical to determine the match level of the data of interest.

tRecordMatching Standard properties

These properties are used to configure tRecordMatching running in the Standard Job framework.

The Standard
tRecordMatching component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic settings

Schema and Edit
schema

A schema is a row description, it defines the number of fields to be
processed and passed on to the next component. The schema is either
Built-in or stored remotely in the
Repository.

 

Built-in: You create and store the
schema locally for this component only. Related topic: see

Talend Studio User Guide
.

 

Repository: You have already created
and stored the schema in the Repository. You can reuse it in other
projects and job designs. Related topic: see
Talend Studio User Guide
.

Replace output column with lookup column if matches or
possible matches

Select this check box to replace the output column with the lookup
column in case of match or possible match values.

Once this check box is selected, the Columns
Mapping
table appears.

Columns Mapping

Complete this table to define the output columns to be replaced and
the corresponding substitute lookup columns. The column to be complete
are:

Output Column: select the column of
interest from the drop-down list. On this list, two columns
MATCHING_WEIGHT and
MATCHING_DISTANCES, are defined by default; the
other columns are available only when you have defined them in the
schema editor of this component. For further information about these two
default columns, see Matching entries using the Q-grams and Levenshtein algorithms;

Lookup Column: select the column of
interest from the drop-down list. This list is available when a lookup
link has been defined to provide the corresponding lookup columns to
this component.

tRecordMatching_1.png

Click the import icon to select a match rule from the Studio
repository.

When you click the import icon, a Match Rule
Selector
wizard is opened to help you import match rules
from the Studio repository and use them in your Job.

You can import rules created with the VSR algorithm. The T-Swoosh
algorithm does not work with this component. For further information
about importing VSR rules, see Importing match rules from the studio repository

Input Key Attribute

Select the column(s) from the main flow that needs to be checked
against the reference (lookup) key column.

Note: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.

For example,
if you want to only compare the year in the date, in the component schema set the type
of the date column to Date and then enter “yyyy” in the Date Pattern field.
The component then converts the date format to a string according to the pattern defined
in the schema before starting a string comparison.

Lookup Key Attribute

Select the lookup key columns that you will use as a reference against
which to compare the columns from the input flow.

Matching Function

Select the relevant matching algorithm from the list:

Exact Match: matches each processed
entry to all possible reference entries with exactly the same
value.

Levenshtein: Based on the edit
distance theory. It calculates the number of insertion, deletion or
substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic
algorithm for indexing entries by their pronunciation. It first loads
the phonetics of all entries of the lookup reference and checks all
entries of the main flow against the entries of the reference
flow.

Double Metaphone: a new version of
the Metaphone phonetic algorithm, that produces more accurate results
than the original algorithm. It can return both a primary and a
secondary code for a string. This accounts for some ambiguous cases as
well as for multiple variants of surnames with common ancestry.

Exact – ignore case: matches each
processed entry to all possible reference entries with exactly the same
value while ignoring the value case.

Soundex: matches processed entries
according to a standard English phonetic algorithm.

Soundex FR: matches processed entries
according to a standard French phonetic algorithm.

Jaro: matches processed entries
according to spelling deviations.

q-grams: matches processed entries by
dividing strings into letter blocks of length q in order to
create a number of q length grams. The matching result is
given as the number of q-gram matches over possible q-grams.

Hamming: calculates the minimum
number of substitutions required to transform one string into another
string having the same length. For example, the Hamming distance between
masking” and “pairing” is 3.

custom…: enables you to load an
external matching algorithm from a Java library. The Custom Matcher column alongside is activated
when you selected this option.

For further information about how to load an external Java library,
see tLibraryLoad.

Custom Matcher

Type in the path pointing to the custom class (external matching
algorithm) you need to use. This path is defined by yourself in the
library file (.jar file) which you can import by
using the tLibraryLoad component.

For further information, see Creating a custom matching algorithm.

 

Tokenized measure

Tokenization is the concept of splitting a string into words. Select
the method to use to compute a tokenized measure for the selected
algorithm:

NO: no tokenization method is used
on the string. With this option, “John Doe” and “Jon Doe” should
match.

Same place: splits the two strings by
words to two lists, list1 and list2. Associates each element from list1
with the element which has the same position in list2. Using this
method, “She is red and he is pink” and “Catherine is red and he is
pink” should match.

Same order: splits the two strings by
words to two lists, list1 and list2 and assumes that list1 is shorter
than list2. Tries to associates the elements from list1 with the
elements in list2 taken in the same order. Using this method, “John Doe”
and “John B. Doe” match.

This method should be used only with strings which has a few words,
otherwise the number of possible combinations can be large.

Any order: splits the two strings by
words to two lists, list1 and list2 and assumes that list1 is shorter
than list2. Tries to assign each word of list1 to a word of list2, in
order to have the highest global similarity (with respect to the used
similarity).

Using this method, “John Doe” and “Doe John” match.

Weight

Set a numerical weight for each attribute (column) of the key
definition. The values can be anything >= 0.

Handle Null

Handle Null

To handle null values, select from the list the null operator you want
to use on the column:

Null Match Null: a Null attribute
only matches another Null attribute.

Null Match None: a Null attribute
never matches another attribute.

Null Match All: a Null attribute
matches any other value of an attribute.

For example, if we have two columns, name and
firstname where the name is never null, but the
first name can be null.

If we have two records:

“Doe”, “John”

“Doe”, “”

Depending on the operator you choose, these two records may or may not
match:

Null Match Null: they do not
match.

Null Match None: they do not
match.

Null Match All: they match.

And for the records:

“Doe”, “”

“Doe”, “”

Null Match Null: they match.

Null Match None: they do not
match.

Null Match All: they match.

Input Column

If required, select the column(s) from the input flow according to
which you want to partition the processed data in blocks, this is
usually referred to as “blocking”.

Blocking reduces the number of pairs of records that needs to be
examined. In blocking, input data is partitioned into exhaustive blocks
designed to increase the proportion of matches observed while decreasing
the number of pairs to compare. Comparisons are restricted to record
pairs within each block.

Using blocking column(s) is very useful when you are processing very
big data.

Matching strategy
Select the matching output that best fulfils your needs. This option
may be:

  • All matches: this option means to output all of the
    matching and the possibly matching records.

  • Best match: this option means to output only the record
    with the highest matching score.

  • First match: this option means to output the first
    matching record; if no one is matching, it outputs the first
    possibly matching record.

  • Last match: this option means to output the last matching
    record; if no one is matching, it outputs the last possibly
    matching record.

    When you arrange the matching strategy, the matching
    scores are critical. You need to define them in the
    Possible match interval
    fields of the Advanced
    settings
    view.

Advanced settings

Matching Algorithm

Select an algorithm from the list. Only one algorithm is available for
the time being

Simple VSR: This algorithm is based
on a Vector Space Retrieval method that specifies how two records may
match.

For further information about how to import rules based on the VSR
algorithm, see Importing match rules from the studio repository.

Possible match interval

Enter a minimum and a maximum values:

minimum: set the minimum record
distance allowed to match the reference (0 <= minimum). maximum: set the maximum distance allowed to
match the reference (maximum <=1).

For example, if you set 0.5 as the minimum value and 0.9 as the
maximum value, the scores equal or higher than 0.9 indicate match, the
scores between 0.5 excluded and 0.9 excluded indicate possible match and
the other scores indicate non match.

Store on disk

Select this check box if you want to store processed data blocks on
the disk to maximize system performance.

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Global Variables

Global Variables

NB_MATCH_LINE: the number of rows matching the comparison
algorithm. This is an After variable and it returns an integer.

NB_POSSIBLE_MATCH_LINE: the number of rows possibly
matching the comparison algorithm. This is an After variable and it returns an
integer.

NB_NONE_MATCH_LINE: the number of rows not matching the
comparison algorithm. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is not startable and it requires two input components
and one or more output components.

Creating a custom matching algorithm

The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.

A custom matching algorithm is written manually and stored in a
.jar file (Java archive).
Talend
provides an
example .jar file on the basis of which you are supposed to
develop your own file easily.

  1. In Eclipse, check out the test.mydistance project
    from svn at:

  2. In this project, navigate to the Java class
    named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance.
  3. Open this file that has the below code:

  4. In this file, type in the class name for the custom algorithm you are
    creating in order to replace the default name. The default name is
    MyDistance and you can find it in the line:
    public class MyDistance implements
    IAttributeMatcher
    .
  5. In the place where the default algorithm is in the file, type in the
    algorithm you need to create to replace the default one. The default
    algorithm reads as follows:

  6. Save your modifications.
  7. Using Eclipse, export this new .jar file.

Then this user-defined algorithm is ready to be used by the tRecordMatching component.

Matching entries using the Q-grams and Levenshtein
algorithms

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a Job which uses a match rule based on the VSR
algorithm. The Job aims at:

  • matching entries in the name column against
    the entries in the reference input file by dividing strings into letter blocks
    of length q, where q is 3, in order to create a number of q length grams. The matching result is given as the number of q-gram
    matches over possible q-grams,

  • checking the edit distance between the entries in the email column of an input file against those of the
    reference input file.

The outputs of these two matching types are written in three output files:
the first for match values, the second for possible match values and the third for the
values for which there are no matches in the lookup file.

In this scenario, we have already stored the main and reference input
schemas in the Repository. For more information about storing schema metadata in the
Repository, see
Talend Studio User Guide
.

The main input table contains seven columns: code,
name, address, zipcode, city, email and col7. We want to carry the fuzzy match on two columns: name and email.

Setting up the Job

  1. In the Repository tree view, expand
    MetadataDB
    Connections
    where you have stored the main input schemas and
    drop the relevant file onto the design workspace.

    The Components dialog box
    appears.
    tRecordMatching_2.png

  2. Select tMysqlInput and click OK to drop the tMysqlInput component onto the workspace.

    The input table used in this scenario is called
    person. It holds several columns including the two
    columns name and email we want to
    do the fuzzy match on. The following capture shows the basic properties of
    the main input component:
    tRecordMatching_3.png

  3. Do the same for the second input table you want to use as a reference,
    customer in this scenario.

    The following capture shows the basic properties of the reference input component:
    tRecordMatching_4.png

  4. Drop the following components from the Palette onto the design workspace: tRecordMatching and three tLogRow.
  5. Connect the main and reference input components to tRecordMatching using Main
    links. The link between the reference input table and tRecordMatching displays as a Lookup link on the design workspace.
  6. Connect tRecordMatching to the three
    tLogRow components using the Matches, Possible
    Matches
    and Non Matches
    links.

    tRecordMatching_5.png

Configuring the components

  1. Double-click tRecordMatching to display its
    Basic settings view and define its
    properties.

    tRecordMatching_6.png

  2. Click the Edit schema button to open a dialog
    box. Here you can define the data you want to pass to the output
    components.

    tRecordMatching_7.png

    In this example we want to pass to the tRecordMatching component the name and
    email columns from the first tMysqlInput component, and the ref_name and
    ref_
    email columns from the second
    tMysqlInput component.
    The MATCHING_DISTANCE and the
    MATCHING_WEIGHT columns in the output schema are
    defined by default.
    The MATCHING_WEIGHT column is always between
    0 and 1. It is a global distance
    between sets of columns (defined by the columns to be matched).
    The MATCHING_DISTANCE column will print a distance for
    each of the columns on which we use an algorithm. The results will be separated
    by a vertical bar (pipe).
    Click OK to close the dialog box and proceed
    to the next step.
  3. In the Key Definition area of the Basic settings view of tRecordMatching, click the plus button to add two columns to the
    list.
  4. Select the input columns and the output columns you want to do the fuzzy
    matching on from the Input key attribute and
    Lookup key attribute lists respectively.

    In this example, select name and email
    as input attributes and ref-name and
    ref_email as lookup attributes.
    Note: When you select a date column on which to apply an algorithm or a matching
    algorithm, you can decide what to compare in the date format.

    For example,
    if you want to only compare the year in the date, in the component schema set the type
    of the date column to Date and then enter “yyyy” in the Date Pattern field.
    The component then converts the date format to a string according to the pattern defined
    in the schema before starting a string comparison.

  5. Click in the Matching type column and select
    from the list q-gram, the method to be used on
    the first column to check the incoming data against the reference data.
  6. Set the matching type for the second column, Levenshtein in this example.

    The minimum and maximum possible match values are defined in the Advanced settings view. You can change the by-default
    values.
  7. From the Tokenized measure list, select not
    to use a tokenized distance for the selected algorithms.
  8. In the Weight column, set a numerical weight
    for each of the columns used as key attributes.
  9. Click in the cell of the Handle Null column
    and select the null operator you want to use to handle null attributes in the
    columns.
  10. If required, click the plus button below the Blocking
    Selection
    table to add one or more lines in the table and then
    click in the line and select from the list the column you want to use as a
    blocking value.

    Using a blocking value reduces the number of pairs of records that needs to be
    examined. The input data is partitioned into exhaustive blocks based on the
    blocking value. This will decrease the number of pairs to compare as comparison
    is restricted to record pairs within each block. Check Comparing columns and grouping in the output flow duplicate records that have the same functional key for a use case of the blocking value.
  11. Click the Advanced settings tab to open the
    corresponding view and make sure to select the Simple
    VSR
    algorithm.
  12. Double-click the first tLogRow component to
    display its Basic settings view, and select
    Table in the Mode area to display the source file and the tRecordMatching results together to be able to
    compare them.

    tRecordMatching_8.png

  13. Do the same for the other two tLogRow
    components.

Executing the Job

Save your Job and press F6 to execute
it.

Three output tables are written on the console. The first shows the match
entries, the second show the possible match entries and the third shows the
non match entries according to the used matching method in the defined
columns.
The figure below illustrates extractions of the three output
tables.
tRecordMatching_9.png

The first table lists all the names and emails that could be matched with
identical entries in the reference table. Thus the matching distance and the
matching weight are equal to “1.0”.
The second table lists all the names and emails that have a possible match
in the reference table. The matching distance column prints the distances
for the name and email columns and
separate them by a vertical bar.
The third table lists all the names and emails that do not have match in
the reference table.

Using a custom matching algorithm to match entries

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this scenario, reuse the previous Job to load and apply a user-defined VSR
matching algorithm.

As a prerequisite, follow the steps described in Creating a custom matching algorithm to manually write
a custom algorithm and store it in a .jar file (Java archive). The mydistance.jar file is used here to provide the user-defined matching
algorithm, MyDistance.class.

You will also need to use the tLibraryLoad component to import the Java library into the Job.

tRecordMatching_10.png

Creating a custom matching algorithm

The tRecordMatching component enables you to use
a user-defined matching algorithm for obtaining the results you need.

A custom matching algorithm is written manually and stored in a
.jar file (Java archive).
Talend
provides an
example .jar file on the basis of which you are supposed to
develop your own file easily.

  1. In Eclipse, check out the test.mydistance project
    from svn at:

  2. In this project, navigate to the Java class
    named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance.
  3. Open this file that has the below code:

  4. In this file, type in the class name for the custom algorithm you are
    creating in order to replace the default name. The default name is
    MyDistance and you can find it in the line:
    public class MyDistance implements
    IAttributeMatcher
    .
  5. In the place where the default algorithm is in the file, type in the
    algorithm you need to create to replace the default one. The default
    algorithm reads as follows:

  6. Save your modifications.
  7. Using Eclipse, export this new .jar file.

Then this user-defined algorithm is ready to be used by the tRecordMatching component.

Setting up the Job

  1. On the previous Job, drop the tLibraryLoad component from the Palette to the Design workspace.
  2. Delete the tLogRow components named
    possible and none.
  3. Connect the tLibraryLoad component to the
    tMysqlInput
    (person) component using a Trigger > On Subjob Ok link.

Configuring the components

  1. Double-click tLibraryLoad to open its
    Component view.

    tRecordMatching_11.png

  2. Click the […] button and browse to the
    mydistance.jar file.
  3. Click Windows>Show
    view…
    to open the Modules
    view.
  4. In the Modules view, click

    tRecordMatching_12.png

    and in the open dialog box, browse to the user-defined
    mydistance.jar file created for this Job.

  5. Click Open.

    The user-defined .jar file is imported and listed in the
    Modules view.
    You will get an error message if you try to run the Job without installing the
    external user-defined .jar file.
  6. Double-click tRecordMatching to open its
    Component view.

    tRecordMatching_13.png

  7. In the Key Definition table of this view,
    click the name row in the Matching
    Type
    column and select custom…
    from the drop-down list.
  8. In the Custom Matcher of this
    name row, type in the path pointing to
    MyDistance.class in the
    mydistance.jar file. In this example, this path is
    org.talend.mydistance.MyDistance.
  9. Click the Advanced settings tab to open the
    corresponding view and make sure to select the Simple
    VSR
    algorithm.

Executing the Job

Press F6 to run this Job.

In the Run view, the matched entries are
identified and listed as follows:
tRecordMatching_14.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x