tMatchModel

Generates the matching model that is used by the tMatchPredict component to automatically predict the labels for the suspect
pairs and groups records which match the label(s) set in the component
properties.

For further information about the
tMatchPairing and tMatchPredict
components, see the tMatchPairing and
tMatchPredict documentation on Talend Help Center (https://help.talend.com).

tMatchModel reads the sample of the
suspect pairs outputted by tMatchPairing after you
label each second element in a pair, analyzes the data using the Random Forest algorithm
and generates a matching model.

You can use the sample suspect records labeled in a Grouping campaign defined on the

Talend Data Stewardship
server with tMatchModel.

For further information about Grouping campaigns, see
documentation on Talend Help Center (https://help.talend.com).

This component can run only with the following Hadoop distributions with
Spark 1.6+ and 2.0:

Spark 1.6: CDH5.7, CDH5.8, HDP2.4.0, HDP2.5.0, MapR5.2.0,
EMR4.5.0, EMR4.6.0.
Spark 2.0: EMR5.0.0.

tMatchModel properties for Apache Spark Batch

These properties are used to configure tMatchModel running in the Spark Batch Job framework.

The Spark Batch
tMatchModel component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.

Basic settings

Define a storage configuration component	Select the configuration component to be used to provide the configuration information for the connection to the target file system such as HDFS. If you leave this check box clear, the target file system is the local system. The configuration component to be used must be present in the same Job. For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write the result in a given HDFS system.
Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Sync columns to retrieve the schema from the previous component connected in the Job. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.
	Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.
Matching key	Select the columns on which you want to base the match computation.
Matching label column	Select the column from the input flow which holds the label you set manually on the suspect pairs of records. If you select the Integration with Data Stewardship check box, this list does not appear. In this case, the matching label column is the `TDS_ARBITRATION_LEVEL` column, which holds the label(s) you set on the suspect pairs of records set using Talend Data Stewardship.
Matching model location	Select the Save the model on file system check box and in the Folder field, set the path to the local folder where you want to generate the matching files. If you want to store the model in a specific file system, for example S3 or HDFS, you must use the corresponding component in the Job and select the Define a storage configuration component check box in the component basic settings. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn or the Spark Standalone mode, ensure that you have properly configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.
Integration with Data Stewardship	Select this check box to set the connection parameters to the Talend Data Stewardship server. If you select this check box, tMatchModel uses the sample suspect records labeled in a Grouping campaign defined on the Talend Data Stewardship server, which means this component can be used as a standalone component.
Data Stewardship Configuration	URL: Enter the address to access the Talend Data Stewardship server suffixed with /data-stewardship/, for example http://<server_address>:19999/data-stewardship/. Username and Password: Enter the authentication information to the Talend Data Stewardship server. Campaign Name: A read-only field which shows the campaign name once the campaign is selected. Click Find a Campaign to open a dialog box which lists the Grouping campaigns on the server for which you are the Campaign owner or you have the access rights. Click the refresh button to retrieve the campaign details from the Talend Data Stewardship server. Campaign Label: A read-only field which shows the campaign label once the campaign is selected.

Advanced settings

Max token number for phonetic comparison	Set the maximum number of the tokens to be used in the phonetic comparison. When the number of tokens exceeds what has been defined in this field, no phonetic comparison is done on the string.
Random Forest hyper-parameters tuning	Number of trees range: Enter a range for the decision trees you want to build. Each decision tree is trained independently using a random sample of features. Increasing this range can improve the accuracy by decreasing the variance in predictions, but will increase the training time. Maximum tree-depth range: Enter a range for the decision tree depth at which the training should stop adding new nodes. New nodes represent further tests on features on internal nodes and possible class labels held by leaf nodes. Generally speaking, a deeper decision tree is more expressive and thus potentially more accurate in predictions, but it is also more resource consuming and prone to overfitting.
Cross-validation parameters	Number of folds: Enter a numeric value of bins which are used as separate training and test datasets. Evaluation metric type: Select a type from the list. For further information, see Precision and recall.
Random Forest parameters	Subsampling rate: Enter the numeric value to indicate the fraction of the input dataset used for training each tree in the forest. The default value `1.0` is recommended, meaning to take the whole dataset for test. Subset Strategy: Select the strategy about how many features should be considered on each internal node in order to appropriately split this internal node (actually the training set or subset of a feature on this node) into smaller subsets. These subsets are used to build child nodes. Each strategy takes a different number of features into account to find the optimal point among these features for split. This point could be, for example, the age `35` of the categorical feature `age`. auto: This strategy is based on the number of trees you have set in the Number of trees in the forest field in the Basic settings view. This is the default strategy to be used. If the number of trees is `1`, the strategy is actually all; if this number is greater than `1`, the strategy is sqrt. all: The total number of features is considered for split. sqrt: The number of features to be considered is the square root of the total number of features. log2: The number of features to be considered is the result of log₂(M), in which M is the total number of features.
Max Bins	Enter the numeric value to indicate the maximum number of bins used for splitting features. The continuous features are automatically transformed to ordered discrete features.
Min Info gain	Enter the minimum number of information gain to be expected from a parent node to its child nodes. When the number of information gain is less than this minimum number, node split is stopped. The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node. As a result, the splitting could be stopped. For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation.
Min instance per Node	Enter the minimum number of training instances a node should have to make it valid for further splitting. The default value is 1, which means when a node has only 1 row of training data, it stops splitting.
Impurity	Select the measure used to select the best split from each set of splits. gini: it is about how often an element could be incorrectly labelled in a split. entropy: it is about how unpredictable the information in each split is. For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation.
Set a random seed	Enter the random seed number to be used for bootstrapping and choosing feature subsets
Data Stewardship Configuration	This field only appears if you selected the Integration with Data Stewardship check box in the Basic settings. Batch Size: Specify the number of records to be processed in each batch. Do not change the default value unless you are facing performance issues. Increasing the batch size can improve the performance but setting a too high value could cause Job failures.

Usage

Usage rule	This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.
Spark Batch Connection	You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode: when using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab; when using other distributions, use a tHDFSConfiguration component to specify the directory. Standalone mode: you need to choose the configuration component depending on the file system you are using, such as tHDFSConfiguration or tS3Configuration. This connection is effective on a per-Job basis.

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only
when you are creating a Spark Batch Job.

Spark Batch Connection

You need to use the Spark Configuration tab in
the Run view to define the connection to a given
Spark cluster for the whole Job. In addition, since the Job expects its dependent jar
files for execution, you must specify the directory in the file system to which these
jar files are transferred so that Spark can access these files:

Yarn mode: when using Google
Dataproc, specify a bucket in the Google Storage staging
bucket field in the Spark
configuration tab; when using other distributions, use a
tHDFSConfiguration
component to specify the directory.
Standalone mode: you need to choose
the configuration component depending on the file system you are using, such
as tHDFSConfiguration
or tS3Configuration.

This connection is effective on a per-Job basis.

Scenario 1: Generating a matching model from a Grouping campaign

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

tMatchModel reads the sample of suspect pairs computed on a list
of duplicate childhood education centers and labeled by data stewards in Talend Data Stewardship.
It generates several matching models, searches the best combination of the learning
parameters and keeps the best matching model which comes out as the result of cross
validation.

Prerequisites:

You have generated the suspect data pairs by using the
tMatchPairing component and labeled them in Talend Data Stewardship. For
further information, see Scenario 1: Computing suspect pairs and writing a sample in Talend Data Stewardship.

For further information about handling
grouping tasks, see the documentation on Talend Help Center (https://help.talend.com).

Setting up the Job

Drop the tMatchModel component from the
Palette onto the design workspace.
Check that you have defined the connection to the Spark cluster
in the Run > Spark
Configuration view as described in Selecting the Spark mode.

Generating the matching model

Double-click tMatchModel to display the
Basic settings view and define the component
properties.
In the Matching Key table, click the
[+] button to add rows in the table and select the
columns on which you want to base the match computation.

The Original_Id column is ignored in the computation
of the matching model.
Select the Save the model on file system check box and
in the Folder field, set the path to the local folder
where you want to generate the matching model file.
Select the Integration with Data Stewardship check box
and set the connection parameters to the Talend Data Stewardship
server.
1. In the URL field, enter the address of
  the server suffixed with /data-stewardship/, for example http://localhost:19999/data-stewardship/.
2. Enter your login information to the server in the
  Username and Password
  fields.
  
  To enter your password, click the […] button next to the Password field, enter your password between double
  quotes in the dialog box that opens and click OK.
3. Click Find a campaign to open a dialog
  box which lists the campaigns defined on the server and for which you are the owner or
  you have the access rights.
4. Select the campaign from which to read the grouping tasks,
  Site deduplication in this example, and click
  OK.
Click Advanced settings and set the below
parameters:
1. Set the maximum number of the tokens to be used in the phonetic
  comparison in the corresponding field.
2. In the Random Forest hyper parameters tuning
  field, enter the ranges for the decision trees you want to build and
  their depth.
  
  These parameters are important for the accuracy of the
  model.
3. Leave the other by-default parameters unchanged.
Press F6 to execute the
Job and generate the matching model in the output folder.

You can now use this model with the tMatchPredict component to
label all the duplicates computed by tMatchPairing.

For further information, see the online publication about
labeling suspect pairs on Talend Help Center (https://help.talend.com).

Scenario 2: Generating a matching model

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

The tMatchModel component reads the
suspect sample pairs generated by the tMatchPairing component and manually labeled by you.

For further information, see the
tMatchPairing documentation on Talend Help Center (https://help.talend.com).

The tMatchModel component generates several matching models,
searches the best combination of the learning parameters automatically and keeps the
best matching model which comes out as the result of cross validation.

The use case described here uses the following components:

A tFileInputDelimited component reads the
source file, which contains the suspect data pairs generated by tMatchPairing.
A tMatchModel component
generates the features from the suspect records, implements the Random Forest
algorithm and creates a classification model.

Setting up the Job

You have generated the suspect data pairs by using the
tMatchPairing component.

You added a label next to the second record in each suspect pair to say
whether it is a duplicate record or not or whether it is a possible
duplicate as well:

480060609;DFSS_AgencySiteLies_2012.csv;Catholic Charities of the Archdiocese of Chicago St. Joseph;4800 S. Paulina; st. joseph_1;;

480060609;purple_binder_early_childhood.csv;Catholic Charities Chicago - St. Joseph;4800 S Paulina Street; st. joseph_1;0.8058642705131237;YES

425760624;chapin_dfss_providers_2011_070212.csv;CHICAGO PUBLIC SCHOOLS GOLDBLATT, NATHAN R.;4257 W ADAMS; r._20;;

422560653;chapin_dfss_providers_2011_070212.csv;CHICAGO PUBLIC SCHOOLS ROBINSON, JACKIE R.;4225 S LAKE PARK AVE; r._20;0.8219437219200757;NO

The labels used in this example are YES or
NO, but you can use any label you like and more
than two.

You can find an example of how to compute suspect pairs and
suspect sample from source data on Talend Help Center (https://help.talend.com).

Drop the following components from the Palette onto the design workspace:
tFileInputDelimited and tMatchModel.
Connect the components together using the Row > Main link.
Check that you have defined the connection to the Spark cluster in the Run > Spark Configuration view. For more information about selecting the Spark mode, see
the documentation on Talend Help Center (https://help.talend.com).

Configuring the input component

Double-click tFileInputDelimited to open
its Basic settings view in the Component tab.

The input data to be used with
tMatchModel is the suspect data pairs generated
by tMatchPairing. For an example of how to compute
suspect pairs and suspect sample from source data, see the documentation in
Talend Help Center (https://help.talend.com).
Click the […] button next to Edit
schema to open a dialog box and add columns to the input schema:
Original_Id, Source,
Site_name and Address,
PAIR_ID, SCORE and
LABEL.

The input schema is the same used with the sample of suspect pairs generated
by the tMatchPairing component. The column LABEL
column holds the label you manually set on every second record in a
pair.

For readability purposes, you can ignore the columns PAIR_ID and SCORE.
Click OK in the dialog box and accept to propagate the
changes when prompted.
In the Folder/File field, set the path to the input
file.
Set the row and field separators in the corresponding fields and the header and
footer, if any.

In this example, the number of row in the Header field
is set to 1.

Generating the matching model

Double-click tMatchModel to display the
Basic settings view and define the component
properties.
In the Matching Key table, click the
[+] button to add rows in the table and select the
columns on which you want to base the match computation.

The Original_Id column is ignored in the computation
of the matching model.
From the matching label
column list, select the column which holds the labels you added
on the suspect records.
Select the Save the model on file system check box and
in the Folder field, set the path to the local folder
where you want to generate the matching model file.
Click Advanced settings and set the below
parameters:
1. Set the maximum number of the tokens to be used in the phonetic
  comparison in the corresponding field.
2. In the Random Forest hyper parameters tuning
  field, enter the ranges for the decision trees you want to build and
  their depth.
  
  These parameters are important for the accuracy of the
  model.
3. Leave the other by-default parameters unchanged.
Press F6 to execute the
Job and generate the matching model in the output folder.

You can now use this model with the tMatchPredict component to
label all the duplicates computed by tMatchPairing.

For further information, see the online publication about
labeling suspect pairs on Talend Help Center (https://help.talend.com).

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 6.x

0 Comments

Inline Feedbacks

View all comments

tMatchModel – Docs for ESB 6.x

tMatchModel

tMatchModel properties for Apache Spark Batch

Basic settings

Advanced settings

Usage

Scenario 1: Generating a matching model from a Grouping campaign

Setting up the Job

Generating the matching model

Scenario 2: Generating a matching model

Setting up the Job

Configuring the input component

Generating the matching model

My Website Links

Tags