August 16, 2023

tMatchModel – Docs for ESB 6.x

tMatchModel

Generates the matching model that is used by the tMatchPredict component to automatically predict the labels for the suspect
pairs and groups records which match the label(s) set in the component
properties.

For further information about the
tMatchPairing and tMatchPredict
components, see the tMatchPairing and
tMatchPredict documentation on Talend Help Center (https://help.talend.com).

tMatchModel reads the sample of the
suspect pairs outputted by tMatchPairing after you
label each second element in a pair, analyzes the data using the Random Forest algorithm
and generates a matching model.

You can use the sample suspect records labeled in a Grouping campaign defined on the

Talend Data Stewardship
server with tMatchModel.

For further information about Grouping campaigns, see
documentation on Talend Help Center (https://help.talend.com).

This component can run only with the following Hadoop distributions with
Spark 1.6+ and 2.0:

  • Spark 1.6: CDH5.7, CDH5.8, HDP2.4.0, HDP2.5.0, MapR5.2.0,
    EMR4.5.0, EMR4.6.0.

  • Spark 2.0: EMR5.0.0.

tMatchModel properties for Apache Spark Batch

These properties are used to configure tMatchModel running in the Spark Batch Job framework.

The Spark Batch
tMatchModel component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.

Basic settings

Define a storage configuration
component

Select the configuration component to be used to provide the configuration
information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local
system.

The configuration component to be used must be present in the same Job. For
example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write
the result in a given HDFS system.

Schema and Edit
Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

 

Built-In: You create and store the
schema locally for this component only. Related topic: see
Talend Studio

User Guide.

 

Repository: You have already created
the schema and stored it in the Repository. You can reuse it in various projects and
Job designs. Related topic: see
Talend Studio

User Guide.

Matching key

Select the columns on which you want to base the match
computation.

Matching label column

Select the column from the input flow which holds the
label you set manually on the suspect pairs of records.

If you select the Integration with Data
Stewardship
check box, this list does not appear. In
this case, the matching label column is the
TDS_ARBITRATION_LEVEL column, which holds the
label(s) you set on the suspect pairs of records set using Talend Data Stewardship.

Matching model location

Select the Save the model on
file system
check box and in the Folder field, set the path to the local
folder where you want to generate the matching files.

If you want to store the model in a specific file system,
for example S3 or HDFS, you must use the corresponding component in the
Job and select the Define a storage
configuration component
check box in the component basic
settings.

The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn or the Spark Standalone mode,
ensure that you have properly configured the connection in a configuration component in
the same Job, such as tHDFSConfiguration.

Integration with Data Stewardship

Select this check box to set the connection parameters to the

Talend Data Stewardship
server.

If you select this check box, tMatchModel uses the
sample suspect records labeled in a Grouping campaign defined on the

Talend Data Stewardship
server, which means this component can be used as a
standalone component.

Data Stewardship Configuration

  • URL:

    Enter the address to access the Talend Data Stewardship server suffixed
    with /data-stewardship/, for example
    http://<server_address>:19999/data-stewardship/.

  • Username and
    Password:

    Enter the authentication information to the Talend Data Stewardship server.

  • Campaign Name:

    A read-only field which shows the campaign name once the campaign is
    selected.

    Click Find a Campaign to open a dialog box
    which lists the Grouping campaigns on the server for which you
    are the Campaign owner or you have the access rights.

    Click the refresh button to retrieve the campaign details from
    the
    Talend Data Stewardship
    server.

  • Campaign Label:

    A read-only field which shows the campaign label
    once the campaign is selected.

Advanced settings

Max token number for phonetic
comparison

Set the maximum number of the tokens to be used in the
phonetic comparison.

When the number of tokens exceeds what has been defined
in this field, no phonetic comparison is done on the string.

Random Forest hyper-parameters
tuning

Number of trees range: Enter a
range for the decision trees you want to build. Each decision tree is
trained independently using a random sample of features.

Increasing this range can improve the accuracy by
decreasing the variance in predictions, but will increase the training
time.

Maximum tree-depth range: Enter a
range for the decision tree depth at which the training should stop
adding new nodes. New nodes represent further tests on features on
internal nodes and possible class labels held by leaf nodes.

Generally speaking, a deeper decision tree is more expressive and thus potentially more
accurate in predictions, but it is also more resource consuming and prone to
overfitting.

Cross-validation parameters

Number of folds: Enter a numeric
value of bins which are used as separate training and test datasets.

Evaluation metric type: Select a
type from the list. For further information, see Precision and
recall
.

Random Forest parameters

Subsampling rate: Enter the
numeric value to indicate the fraction of the input dataset used for
training each tree in the forest. The default value
1.0 is recommended, meaning to take the whole
dataset for test.

Subset Strategy: Select the
strategy about how many features should be considered on each internal
node in order to appropriately split this internal node (actually the
training set or subset of a feature on this node) into smaller subsets.
These subsets are used to build child nodes.

Each strategy takes a different number of features into
account to find the optimal point among these features for split. This
point could be, for example, the age 35 of the
categorical feature age.

  • auto: This strategy
    is based on the number of trees you have set in the
    Number of trees in the
    forest
    field in the Basic settings view. This is
    the default strategy to be used.

    If the number of trees is
    1, the strategy is actually
    all; if this
    number is greater than 1, the
    strategy is sqrt.

  • all: The total number
    of features is considered for split.

  • sqrt: The number of
    features to be considered is the square root of the total
    number of features.

  • log2: The number of
    features to be considered is the result of log2(M),
    in which M is the total number of
    features.

Max Bins

Enter the numeric value to indicate the maximum number of bins used for splitting
features.

The continuous features are automatically transformed to ordered discrete features.

Min Info gain

Enter the minimum number of information gain to be expected from a parent node to its
child nodes. When the number of information gain is less than this minimum number, node
split is stopped.

The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node.
As a result, the splitting could be stopped.

For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation.

Min instance per Node

Enter the minimum number of training instances a node should have to make it valid for
further splitting.

The default value is 1, which means when a node has
only 1 row of training data, it stops splitting.

Impurity

Select the measure used to select the best split from
each set of splits.

  • gini: it is about how often an element could be
    incorrectly labelled in a split.

  • entropy: it is about how unpredictable the information in
    each split is.

For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation.

Set a random seed

Enter the random seed number to be used for bootstrapping and choosing feature
subsets

Data Stewardship Configuration

This field only appears if you selected the Integration with
Data Stewardship
check box in the Basic
settings
.

Batch Size: Specify the number of records to be
processed in each batch.

Do not change the default value unless you are facing performance issues.
Increasing the batch size can improve the performance but setting a too
high value could cause Job failures.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only
when you are creating a Spark Batch Job.

Spark Batch Connection

You need to use the Spark Configuration tab in
the Run view to define the connection to a given
Spark cluster for the whole Job. In addition, since the Job expects its dependent jar
files for execution, you must specify the directory in the file system to which these
jar files are transferred so that Spark can access these files:

  • Yarn mode: when using Google
    Dataproc, specify a bucket in the Google Storage staging
    bucket
    field in the Spark
    configuration
    tab; when using other distributions, use a
    tHDFSConfiguration
    component to specify the directory.

  • Standalone mode: you need to choose
    the configuration component depending on the file system you are using, such
    as tHDFSConfiguration
    or tS3Configuration.

This connection is effective on a per-Job basis.

Scenario 1: Generating a matching model from a Grouping campaign

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

tMatchModel reads the sample of suspect pairs computed on a list
of duplicate childhood education centers and labeled by data stewards in Talend Data Stewardship.
It generates several matching models, searches the best combination of the learning
parameters and keeps the best matching model which comes out as the result of cross
validation.

Prerequisites:

Setting up the Job

  1. Drop the tMatchModel component from the
    Palette onto the design workspace.
  2. Check that you have defined the connection to the Spark cluster
    in the Run > Spark
    Configuration
    view as described in Selecting the Spark mode.

Generating the matching model

  1. Double-click tMatchModel to display the
    Basic settings view and define the component
    properties.

    stewardship_job_tmatchmodel.png

  2. In the Matching Key table, click the
    [+] button to add rows in the table and select the
    columns on which you want to base the match computation.

    The Original_Id column is ignored in the computation
    of the matching model.
  3. Select the Save the model on file system check box and
    in the Folder field, set the path to the local folder
    where you want to generate the matching model file.
  4. Select the Integration with Data Stewardship check box
    and set the connection parameters to the Talend Data Stewardship
    server.


    1. In the URL field, enter the address of
      the server suffixed with /data-stewardship/, for example http://localhost:19999/data-stewardship/.

    2. Enter your login information to the server in the
      Username and Password
      fields.

      To enter your password, click the […] button next to the Password field, enter your password between double
      quotes in the dialog box that opens and click OK.

    3. Click Find a campaign to open a dialog
      box which lists the campaigns defined on the server and for which you are the owner or
      you have the access rights.

    4. Select the campaign from which to read the grouping tasks,
      Site deduplication in this example, and click
      OK.
  5. Click Advanced settings and set the below
    parameters:

    1. Set the maximum number of the tokens to be used in the phonetic
      comparison in the corresponding field.
    2. In the Random Forest hyper parameters tuning
      field, enter the ranges for the decision trees you want to build and
      their depth.

      These parameters are important for the accuracy of the
      model.
    3. Leave the other by-default parameters unchanged.
  6. Press F6 to execute the
    Job and generate the matching model in the output folder.

You can now use this model with the tMatchPredict component to
label all the duplicates computed by tMatchPairing.

For further information, see the online publication about
labeling suspect pairs on Talend Help Center (https://help.talend.com).

Scenario 2: Generating a matching model

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

The tMatchModel component reads the
suspect sample pairs generated by the tMatchPairing component and manually labeled by you.

For further information, see the
tMatchPairing documentation on Talend Help Center (https://help.talend.com).

The tMatchModel component generates several matching models,
searches the best combination of the learning parameters automatically and keeps the
best matching model which comes out as the result of cross validation.

The use case described here uses the following components:

  • A tFileInputDelimited component reads the
    source file, which contains the suspect data pairs generated by tMatchPairing.

  • A tMatchModel component
    generates the features from the suspect records, implements the Random Forest
    algorithm and creates a classification model.

Setting up the Job

  • You have generated the suspect data pairs by using the
    tMatchPairing component.

  • You added a label next to the second record in each suspect pair to say
    whether it is a duplicate record or not or whether it is a possible
    duplicate as well:

    The labels used in this example are YES or
    NO, but you can use any label you like and more
    than two.

You can find an example of how to compute suspect pairs and
suspect sample from source data on Talend Help Center (https://help.talend.com).

  1. Drop the following components from the Palette onto the design workspace:
    tFileInputDelimited and tMatchModel.
  2. Connect the components together using the Row > Main link.
  3. Check that you have defined the connection to the Spark cluster in the Run > Spark Configuration view. For more information about selecting the Spark mode, see
    the documentation on Talend Help Center (https://help.talend.com).
use_case-tmatchmodel.png

Configuring the input component

  1. Double-click tFileInputDelimited to open
    its Basic settings view in the Component tab.

    The input data to be used with
    tMatchModel is the suspect data pairs generated
    by tMatchPairing. For an example of how to compute
    suspect pairs and suspect sample from source data, see the documentation in
    Talend Help Center (https://help.talend.com).

  2. Click the […] button next to Edit
    schema
    to open a dialog box and add columns to the input schema:
    Original_Id, Source,
    Site_name and Address,
    PAIR_ID, SCORE and
    LABEL.

    The input schema is the same used with the sample of suspect pairs generated
    by the tMatchPairing component. The column LABEL
    column holds the label you manually set on every second record in a
    pair.

    For readability purposes, you can ignore the columns PAIR_ID and SCORE.

  3. Click OK in the dialog box and accept to propagate the
    changes when prompted.
  4. In the Folder/File field, set the path to the input
    file.
  5. Set the row and field separators in the corresponding fields and the header and
    footer, if any.

    In this example, the number of row in the Header field
    is set to 1.

Generating the matching model

  1. Double-click tMatchModel to display the
    Basic settings view and define the component
    properties.

    use_case-tmatchmodel3.png

  2. In the Matching Key table, click the
    [+] button to add rows in the table and select the
    columns on which you want to base the match computation.

    The Original_Id column is ignored in the computation
    of the matching model.
  3. From the matching label
    column
    list, select the column which holds the labels you added
    on the suspect records.
  4. Select the Save the model on file system check box and
    in the Folder field, set the path to the local folder
    where you want to generate the matching model file.
  5. Click Advanced settings and set the below
    parameters:

    1. Set the maximum number of the tokens to be used in the phonetic
      comparison in the corresponding field.
    2. In the Random Forest hyper parameters tuning
      field, enter the ranges for the decision trees you want to build and
      their depth.

      These parameters are important for the accuracy of the
      model.
    3. Leave the other by-default parameters unchanged.
  6. Press F6 to execute the
    Job and generate the matching model in the output folder.

You can now use this model with the tMatchPredict component to
label all the duplicates computed by tMatchPairing.

For further information, see the online publication about
labeling suspect pairs on Talend Help Center (https://help.talend.com).


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x