July 30, 2023

tMatchPredict – Docs for ESB 7.x

tMatchPredict

Labels suspect records automatically and groups suspect records which match
the label(s) set in the component
properties.

tMatchPredict labels suspect pairs based on
the pairing and matching models generated by the tMatchPairing and
tMatchModel components.

If the input data is new and has not
been paired previously, you can define the input data as “unpaired” and set the path to the
pairing model folder to separate the exact duplicates from unique
records.

tMatchPredict can also output unique records, exact
duplicates and suspect duplicates from a new data set.

This component runs with Apache
Spark 1.6.0 and later versions.

tMatchPredict properties for Apache Spark Batch

These properties are used to configure tMatchPredict running in the Spark Batch Job framework.

The Spark Batch
tMatchPredict component belongs to the Data Quality family.

This component is available in Talend Platform products with Big Data and
in Talend Data Fabric.

Basic settings

Define a storage configuration
component

Select the configuration component to be used to provide the configuration
information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local
system.

The configuration component to be used must be present in the same Job.
For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write
the result in a given HDFS system.

Schema and Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component has read-only columns
in its output links:

LABEL: used only with the Suspect duplicates output link. It holds the
prediction labels.

COUNT: used only with the Exact duplicates output link. It holds the
number of exact duplicates.

GROUPID: used only with the Suspect duplicates output link. It holds the
group identifiers.

CONFIDENCE_SCORE: indicates the confidence score of a
prediction for a pair or cluster. If you set a Clustering classes label, a confidence
score is computed for each pair in the cluster. The confidence score in
the output column is the lowest one.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Pairing

From the Input
type
list, select:

paired: to use as input the
suspect duplicates generated by the tMatchPairing component.

unpaired: to use as input new
data set which has not been paired by tMatchPairing.

Pairing model folder: (available
only with the unpaired input
type) Set the path to the folder which has the model files generated by
the tMatchPairing component.

The button for browsing does not work with the Spark
Local mode; if you are
using the other Spark Yarn
modes that the Studio supports with your distribution, ensure that you have properly
configured the connection in a configuration component in the same Job, such as

tHDFSConfiguration
. Use the
configuration component depending on the filesystem to be used.

For further information, see tMatchPairing.

Matching

Matching model location: Select
from the list where to get the model file generated by the
classification Job with the tMatchModel component:

from file
system
: Set the path to the folder where the model file is
generated by the classification component. For further information, see
tMatchModel.

from current
Job
: Set the name of the model file generated by the
classification component. You can use this option only if the
classification Job with the tMatchModel component is integrated in the Job with the
tMatchPredict component.

Matching model folder: Set the
path to the folder which has the model files generated by the tMatchModel component.

The button for browsing does not work with the Spark
Local mode; if you are
using the other Spark Yarn
modes that the Studio supports with your distribution, ensure that you have properly
configured the connection in a configuration component in the same Job, such as

tHDFSConfiguration
. Use the
configuration component depending on the filesystem to be used.

For further information, see tMatchModel.

Clustering classes

In the table, add one or more labels used
on the sample suspects generated by tMatchPairing.

The component then groups suspect records
which match the label(s) set in the table.

If you labeled a sample of suspect records using Talend Data Stewardship, add the answer(s) defined in the
Grouping campaign to the table.

The field is case-sensitive.

Advanced settings

Set Checkpoint Interval

Set the frequency of checkpoints. It is recommended to leave the default
value (2).

Before setting a value for this parameter, activate
checkpointing and set the checkpoint directory in the Spark Configuration tab of the
Run view.

For further information about checkpointing the
activities of your Apache Spark Job, see the documentation on Talend Help Center (https://help.talend.com).

Use Timestamp format for Date type

Select the check box to output dates, hours, minutes and seconds contained in your
Date-type data. If you clear this check box, only years, months and days are
outputted.

The format used by Deltalake is yyyy-MM-dd HH:mm:ss.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Batch Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Labeling suspect pairs with assigned labels

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

For further information about the two workflows used when
matching with Spark, see the documentation on Talend Help Center (https://help.talend.com).

The use case described here uses:

  • a tFileInputDelimited
    component to read the input suspect pairs generated by tMatchPairing;

  • a tMatchPredict component to
    label suspect records automatically and groups together suspect records which
    match the label set in the component properties; and

  • a tFileOutputDelimited component to output the labeled duplicate
    records and the groups created on the suspect records which match the label set
    in tMatchPredict properties.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tMatchPredict and
    tFileOutputDelimited.
  2. Connect tFileInputDelimited to tMatchPredict using the Main link.
  3. Connect tMatchPredict to
    tFileOutputDelimited using the Suspect duplicates link.
  4. Check that you have defined the connection to the Spark cluster and activated
    checkpointing in the Run > Spark Configuration view. For more information about selecting the Spark mode, see
    the documentation on Talend Help Center (https://help.talend.com).
tMatchPredict_1.png

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

The Spark documentation provides an exhaustive list of Spark properties and
their default values at Spark Configuration. A Spark Job designed in the Studio uses
this default configuration except for the properties you explicitly defined in the
Spark Configuration tab or the components
used in your Job.

  1. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.

    You can launch
    your Job without any further configuration.

  3. Clear the Use local mode check box to display the
    list of the available Hadoop distributions and from this list, select
    the distribution corresponding to your Spark cluster to be used.

    This distribution could be:

    • Databricks

    • Qubole

    • Amazon EMR

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • Cloudera

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Google Cloud
      Dataproc

      For this distribution, Talend supports:

      • Yarn client

    • Hortonworks

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • MapR

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Microsoft HD
      Insight

      For this distribution, Talend supports:

      • Yarn cluster

    • Cloudera Altus

      For this distribution, Talend supports:

      • Yarn cluster

        Your Altus cluster should run on the following Cloud
        providers:

        • Azure

          The support for Altus on Azure is a technical
          preview feature.

        • AWS

    As a Job relies on Avro to move data among its components, it is recommended to set your
    cluster to use Kryo to handle the Avro types. This not only helps avoid
    this Avro known issue but also
    brings inherent preformance gains. The Spark property to be set in your
    cluster is:

    If you cannot find the distribution corresponding to yours from this
    drop-down list, this means the distribution you want to connect to is not officially
    supported by
    Talend
    . In this situation, you can select Custom, then select the Spark
    version
    of the cluster to be connected and click the
    [+] button to display the dialog box in which you can
    alternatively:

    1. Select Import from existing
      version
      to import an officially supported distribution as base
      and then add other required jar files which the base distribution does not
      provide.

    2. Select Import from zip to
      import the configuration zip for the custom distribution to be used. This zip
      file should contain the libraries of the different Hadoop/Spark elements and the
      index file of these libraries.

      In
      Talend

      Exchange, members of
      Talend
      community have shared some ready-for-use configuration zip files
      which you can download from this Hadoop configuration
      list and directly use them in your connection accordingly. However, because of
      the ongoing evolution of the different Hadoop-related projects, you might not be
      able to find the configuration zip corresponding to your distribution from this
      list; then it is recommended to use the Import from
      existing version
      option to take an existing distribution as base
      to add the jars required by your distribution.

      Note that custom versions are not officially supported by

      Talend
      .
      Talend
      and its community provide you with the opportunity to connect to
      custom versions from the Studio but cannot guarantee that the configuration of
      whichever version you choose will be easy. As such, you should only attempt to
      set up such a connection if you have sufficient Hadoop and Spark experience to
      handle any issues on your own.

    For a step-by-step example about how to connect to a custom
    distribution and share this connection, see Hortonworks.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration
tab.

  1. Double-click tHDFSConfiguration to open its Component view.

    Spark uses this component to connect to the HDFS system to which the jar
    files dependent on the Job are transferred.

  2. If you have defined the HDFS connection metadata under the Hadoop
    cluster
    node in Repository, select
    Repository from the Property
    type
    drop-down list and then click the
    […] button to select the HDFS connection you have
    defined from the Repository content wizard.

    For further information about setting up a reusable
    HDFS connection, search for centralizing HDFS metadata on Talend Help Center
    (https://help.talend.com).

    If you complete this step, you can skip the following steps about configuring
    tHDFSConfiguration because all the required fields
    should have been filled automatically.

  3. In the Version area, select
    the Hadoop distribution you need to connect to and its version.
  4. In the NameNode URI field,
    enter the location of the machine hosting the NameNode service of the cluster.
    If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; WebHDFS with SSL is not
    supported yet.
  5. In the Username field, enter
    the authentication information used to connect to the HDFS system to be used.
    Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the input component

  1. Double-click tFileInputDelimited to open its Basic settings view in the Component tab.

    tMatchPredict_2.png

    The input data to be used with
    tMatchPredict is the suspect data pairs
    generated by tMatchPairing. You can find examples
    of how to compute suspect pairs and suspect sample from source data on
    Talend Help Center (https://help.talend.com).

  2. Click the […] button next to Edit
    schema
    to open a dialog box and add columns to the input schema:
    Original_Id, Source,
    Site_name, Address,
    PAIR_ID and SCORE.

    SCORE is a Double-typed column. The other ones are
    String-typed columns.

  3. In the Folder/File
    field, set the path to the input file.
  4. Set the row and field separators in the corresponding fields,
    and limit the header to 1.

Applying the matching model on the data set

  1. Double-click tMatchPredict to display the Basic
    settings
    view and define the component properties.

    tMatchPredict_3.png

  2. Click Sync columns to
    retrieve the schema defined in the input component.
  3. From the Input type
    list, select paired as the input data is
    already paired with tMatchPairing.
  4. From the Matching model
    location
    list, select from file
    system
    and then set the path to the matching model in the
    folder field.
  5. In the Clustering
    classes
    table, add one or more of the labels you used on the
    sample suspects generated by tMatchPairing, YES in this
    example.

    The labels were set manually or through Talend Data Stewardship. If
    you labeled the sample of suspect records using Talend Data Stewardship, add the answer(s) defined in the
    Grouping campaign to the table.

    The tMatchPredict component will group suspect records
    which match the YES label.

Configuring the output components to write the labeled suspect
pairs

  1. Double-click the first tFileOutputDelimited component to
    display the Basic settings view and define the component
    properties.

    You have already accepted to propagate the schema to the output components
    when you defined the input component.
  2. Clear the Define a storage configuration component check
    box to use the local system as your target file system.
  3. In the Folder field, set the path to the folder which
    will hold the output data.
  4. From the Action list, select the operation for writing
    data:

    • Select Create when you run the Job for the first
      time.
    • Select Overwrite to replace the file every time
      you run the Job.
  5. Set the row and field separators in the corresponding fields.
  6. Select the Merge results to single file check box, and
    in the Merge file path field set the path where to output
    the file of the labeled suspect pairs.

Executing the Job to label suspect pairs with assigned labels

Press F6 to execute the Job.

tMatchPredict labels the suspect pairs, groups the suspect
records which match the YES label and writes all the suspect
pairs in the output file.

The suspect records which match the YES
label belong to groups because tMatchPredict
was configured to
group
records which match this clustering class.

tMatchPredict_4.png

The records labeled with the
NO label do not belong to any group.

You can now create a single representation of each duplicates group and merge these
representations with the unique rows computed by
tMatchPairing.

You can find an example of how to create a clean and
deduplicated dataset on Talend Help Center (https://help.talend.com).


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x