July 30, 2023

tKMeansModel – Docs for ESB 7.x

tKMeansModel

Analyzes incoming datasets based on applying the K-Means algorithm.

This component analyzes feature vectors usually pre-processed by tModelEncoder to generate a clustering model that is used by
tPredict to cluster given elements.

It generates a clustering model out of this analysis and
writes this model either in memory or in a given file system.

tKMeansModel properties for Apache Spark Batch

These properties are used to configure tKMeansModel running in the Spark Batch Job framework.

The Spark Batch
tKMeansModel component belongs to the Machine Learning family.

This component is available in Talend Platform products with Big Data and
in Talend Data Fabric.

Basic settings

Vector to process

Select the input column used to provide feature vectors. Very often,
this column is the output of the feature engineering computations
performed by tModelEncoder.

Save the model on file
system

Select this check box to store the model in a given file system. Otherwise, the model is
stored in memory. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn
or the Spark Standalone mode, ensure that you have properly
configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.

Number of clusters (K)

Enter the number of clusters into which you want tKMeansModel to group data.

In general, a large number of clusters can decreases errors in
predictions but increases the risk of overfitting. Therefore, it is
recommended to put a reasonable number based on how many potential
clusters you think, by observation for example, the data to be processed
might contain.

Set distance threshold of the convergence
(Epsilon)

Select this check box and in the Epsilon field that is displayed, enter the convergence
distance you want to use. The model training is considered accomplished
once all of the cluster centers move less than this distance.

If you leave this check box clear, the default convergence distance
0.0001 is used.

Set the maximum number of
runs

Select this check box and in the Maximum number
of runs
field that is displayed, enter the number of
iterations you want the Job to perform to train the model.

If you leave this check box clear, the default value 20 is used.

Set the number of parallelized
runs

Select this check box and in the Number of
parallelized runs
field that is displayed, enter the number
of iterations you want the Job to run in parallel.

If you leave this check box clear, the default value 1 is used. This actually means that the
iterations will be run in succession.

Note that this parameter helps you optimize the use of your resources
for the computations but does not impact the prediction performance of
the model.

Initialization function

Select the mode to be used to select the points as initial cluster
centers.

  • Random: the points are
    selected randomly. In general, this mode is used for simple
    datasets.

  • K-Means||: this mode is
    known as Scalable K-Means++, a parallel algorithm that can
    obtain a nearly optimal initialization result. This is also
    the default initialization mode.

    For further information about this mode, see Scalable K-Means++.

Set the number of steps for the
initialization

Select this check box and in the Steps field that is displayed, enter the number of
initialization rounds to be run for the optimal initialization
result.

If you leave this check box clear, the default value 5 is used. 5
rounds are almost always enough for the K-Means|| mode to obtain the optimal result.

Define the random seed

Select this check box and in the Seed
field that is displayed, enter the seed to be used for the
initialization of the cluster centers.

Advanced settings

Display the centers after the
processing

Select this check box to output the vectors of the cluster centers
into the console of the Run
view.

This feature is often useful when you need to understand how the
cluster centers move in the process of training your K-Means
model.

Usage

Usage rule

This component is used as an end component and requires an input link.

You can accelerate the training process by adjusting the stopping conditions such as the
maximum number of runs or the convergence distance but note that the
training that stops too early could impact its performance.

Model evaluation

The parameters you need to set are free parameters and so their values may be provided by
previous experiments, empirical guesses or the like. They do not have any optimal values
applicable for all datasets.

Therefore, you need to train the relationship model you are generating with different sets
of parameter values until you can obtain the best evaluation result. But note that you need
to write the evaluation code yourself to rank your model with scores.

Modeling the accident-prone areas in a city

This scenario applies only to subscription-based Talend products with Big
Data
.

In this scenario, the tKMeansModel component is used to
analyze a set of sample geographical data about the destination of ambulances in a city in
order to model the accident-prone areas.

A model like this can be employed to help determine the optimal locations for building hospitals.

tKMeansModel_1.png

You can download this sample data from here. It consists of pairs of latitudes and longitudes.

The sample data was randomly and automatically generated for demonstration purposes only
and in any case it does not reflect the situation of these areas in the real world.

Prerequisite:

  • The Spark version to be used is 1.4 onwards.

  • The sample data is stored in your Hadoop file system and you have proper rights
    and permissions to at least read it.

  • Your Hadoop cluster is properly installed and is running.

If you are not sure about these requirements, ask the administrator of your
Hadoop system.

The components to be used are:

  • tFileInputDelimited: it loads the sample data into
    the data flow of the Job.

  • tReplicate: it replicates the sample data and
    caches the replication.

  • tKMeansModel: it analyzes the data to train the
    model and writes the model to HDFS.

  • tModelEncoder: it pre-process the data to prepare
    proper feature vectors to be used by tKMeansModel.

  • tPredict: it applies the KMeans model on the
    replication of the sample data. In the real-world practice, this data should be a set of
    reference data to test the model accuracy.

  • tFileOutputDelimited: it writes the result of the
    prediction to HDFS.

  • tHDFSConfiguration: this component is used by Spark to connect to
    the HDFS system where the jar files dependent on the Job are transferred.

    In the Spark
    Configuration
    tab in the Run
    view, define the connection to a given Spark cluster for the whole Job. In
    addition, since the Job expects its dependent jar files for execution, you must
    specify the directory in the file system to which these jar files are
    transferred so that Spark can access these files:

    • Yarn mode (Yarn client or Yarn cluster):

      • When using Google Dataproc, specify a bucket in the
        Google Storage staging bucket
        field in the Spark configuration
        tab.

      • When using HDInsight, specify the blob to be used for Job
        deployment in the Windows Azure Storage
        configuration
        area in the Spark
        configuration
        tab.

      • When using Altus, specify the S3 bucket or the Azure
        Data Lake Storage for Job deployment in the Spark
        configuration
        tab.
      • When using Qubole, add a
        tS3Configuration to your Job to write
        your actual business data in the S3 system with Qubole. Without
        tS3Configuration, this business data is
        written in the Qubole HDFS system and destroyed once you shut
        down your cluster.
      • When using on-premise
        distributions, use the configuration component corresponding
        to the file system your cluster is using. Typically, this
        system is HDFS and so use tHDFSConfiguration.

    • Standalone mode: use the
      configuration component corresponding to the file system your cluster is
      using, such as tHDFSConfiguration or
      tS3Configuration.

      If you are using Databricks without any configuration component present
      in your Job, your business data is written directly in DBFS (Databricks
      Filesystem).

Arranging data flow for the KMeans Job

  1. In the
    Integration
    perspective of the Studio, create an empty Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Job, see
    Talend Open Studio for Big Data Getting Started Guide
    .
  2. In the workspace, enter the name of the component to be used and select this component from the list that appears.
  3. Connect tFileInputDelimited to tReplicate using the Row >
    Main
    link.
  4. Do the same to connect tReplicate to tModelEncoder and then tModelEncoder to tKMeansModel.
  5. Repeat the operations to connect tReplicate to tPredict and then tPredict to tFileOutputDelimited.
  6. Leave tHDFSConfiguration as it is.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration
tab.

  1. Double-click tHDFSConfiguration to open its Component view.

    Spark uses this component to connect to the HDFS system to which the jar
    files dependent on the Job are transferred.

  2. If you have defined the HDFS connection metadata under the Hadoop
    cluster
    node in Repository, select
    Repository from the Property
    type
    drop-down list and then click the
    […] button to select the HDFS connection you have
    defined from the Repository content wizard.

    For further information about setting up a reusable
    HDFS connection, search for centralizing HDFS metadata on Talend Help Center
    (https://help.talend.com).

    If you complete this step, you can skip the following steps about configuring
    tHDFSConfiguration because all the required fields
    should have been filled automatically.

  3. In the Version area, select
    the Hadoop distribution you need to connect to and its version.
  4. In the NameNode URI field,
    enter the location of the machine hosting the NameNode service of the cluster.
    If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; WebHDFS with SSL is not
    supported yet.
  5. In the Username field, enter
    the authentication information used to connect to the HDFS system to be used.
    Note that the user name must be the same as you have put in the Spark configuration tab.

Reading and caching the sample data

  1. Double-click the first tFileInputInput component to
    open its Component view.
  2. Click the […] button next to Edit schema and in the pop-up schema dialog box, define the
    schema by adding two columns latitude and
    longitude of Double type.

    tKMeansModel_2.png

  3. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.
  4. Select the Define a storage configuration
    component
    check box and select the tHDFSConfiguration component to be used.

    tFileInputDelimited uses this configuration
    to access the sample data to be used as training set.
  5. In the Folder/File field, enter the directory
    where the training set is stored.
  6. Double-click the tReplicate component to open its
    Component view.
  7. Select the Cache replicated RDD check box and from
    the Storage level drop-down list, select
    Memory only. This way, this sample data is
    replicated and stored in memory for use as test set.

Preparing features for KMeans

  1. Double-click the tModelEncoder component to open its Component view.

    tKMeansModel_3.png

  2. Click the […] button
    next to Edit schema and on the tModelEncoder side of the pop-up schema dialog box,
    define the schema by adding one column named map of
    Vector type.
  3. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  4. In the Transformations
    table, add one row by clicking the [+]
    button and then proceed as follows:

    1. In the Output column column, select the column
      that carry the features. In this scenario, it is
      map.
    2. In the Transformation column, select the
      algorithm to be used for the transformation. In this scenario, it is
      Vector assembler.
    3. In the Parameters column, enter the parameters
      you want to customize for use in the Vector assembler algorithm. In this
      scenario, enter
      inputCols=latitude,longitude.
    In this transformation, tModelEncoder combines all feature vectors into one single
    feature column.
  5. Double-click tKMeansModel to open its Component view.

    tKMeansModel_4.png

  6. Select the Define a storage configuration
    component
    check box and select the tHDFSConfiguration component to be used.
  7. From the Vector to
    process
    list, select the column that provides the feature vectors to
    be analyzed. In this scenario, it is map, which combines
    all features.
  8. Select the Save the model on file system
    check box and in the HDFS folder field that
    is displayed, enter the directory you want to use to store the generated
    model.
  9. In the Number of cluster
    field, enter the number of decision trees you want tKMeans to build. You need to try different numbers to run the
    current Job to create the clustering model several times; after comparing the
    evaluation results of every model created on each run, you can decide the number
    you need to use. For example, put 6.

    You need to write the evaluation code yourself.
  10. From the Initialization function, select Random. In general, this mode is used for simple datasets.
  11. Leave the other parameters as they are.

Testing the KMeans model

  1. Double-click tPredict to open its
    Component view.

    tKMeansModel_5.png

  2. Select the Define a storage configuration
    component
    check box and select the tHDFSConfiguration component to be used.
  3. From the Model type drop-down list, select
    Kmeans model.
  4. Select the Model on filesystem radio button and enter the
    directory in which the KMeans model is stored.

    In this case, the tPredict component contains a
    read-only column called label in which the
    model provides the labels of the clusters.
  5. Double-click tFileOutputDelimited to open its
    Component view.

    tKMeansModel_6.png

  6. Select the Define a storage configuration
    component
    check box and select the tHDFSConfiguration component to be used.
  7. In the Folder field, browse to the location
    in HDFS in which you want to store the prediction result.
  8. From the Action drop-down list, select Overwrite. But if target folder does not exist, select Create.
  9. Select the Merge result to single file check
    box and then the Remove source dir check
    box.
  10. In the Merge file path field, browse to the
    location in HDFS in which you want to store the merged prediction result.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

The Spark documentation provides an exhaustive list of Spark properties and
their default values at Spark Configuration. A Spark Job designed in the Studio uses
this default configuration except for the properties you explicitly defined in the
Spark Configuration tab or the components
used in your Job.

  1. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.

    You can launch
    your Job without any further configuration.

  3. Clear the Use local mode check box to display the
    list of the available Hadoop distributions and from this list, select
    the distribution corresponding to your Spark cluster to be used.

    This distribution could be:

    • Databricks

    • Qubole

    • Amazon EMR

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • Cloudera

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Google Cloud
      Dataproc

      For this distribution, Talend supports:

      • Yarn client

    • Hortonworks

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • MapR

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Microsoft HD
      Insight

      For this distribution, Talend supports:

      • Yarn cluster

    • Cloudera Altus

      For this distribution, Talend supports:

      • Yarn cluster

        Your Altus cluster should run on the following Cloud
        providers:

        • Azure

          The support for Altus on Azure is a technical
          preview feature.

        • AWS

    As a Job relies on Avro to move data among its components, it is recommended to set your
    cluster to use Kryo to handle the Avro types. This not only helps avoid
    this Avro known issue but also
    brings inherent preformance gains. The Spark property to be set in your
    cluster is:

    If you cannot find the distribution corresponding to yours from this
    drop-down list, this means the distribution you want to connect to is not officially
    supported by
    Talend
    . In this situation, you can select Custom, then select the Spark
    version
    of the cluster to be connected and click the
    [+] button to display the dialog box in which you can
    alternatively:

    1. Select Import from existing
      version
      to import an officially supported distribution as base
      and then add other required jar files which the base distribution does not
      provide.

    2. Select Import from zip to
      import the configuration zip for the custom distribution to be used. This zip
      file should contain the libraries of the different Hadoop/Spark elements and the
      index file of these libraries.

      In
      Talend

      Exchange, members of
      Talend
      community have shared some ready-for-use configuration zip files
      which you can download from this Hadoop configuration
      list and directly use them in your connection accordingly. However, because of
      the ongoing evolution of the different Hadoop-related projects, you might not be
      able to find the configuration zip corresponding to your distribution from this
      list; then it is recommended to use the Import from
      existing version
      option to take an existing distribution as base
      to add the jars required by your distribution.

      Note that custom versions are not officially supported by

      Talend
      .
      Talend
      and its community provide you with the opportunity to connect to
      custom versions from the Studio but cannot guarantee that the configuration of
      whichever version you choose will be easy. As such, you should only attempt to
      set up such a connection if you have sufficient Hadoop and Spark experience to
      handle any issues on your own.

    For a step-by-step example about how to connect to a custom
    distribution and share this connection, see Hortonworks.

Executing the Job

  1. Press Ctrl + S to save the Job.
  2. Press F6 to run the Job.

    The merged prediction result is stored in HDFS and you can evaluate this
    result using your evaluation process. Then run this Job more times with
    different KMeans parameters in order to obtain the optimal model.

The following image shows an example of the predicted clusters. This visualization is
produced via a Python script. You can download this script from here and bear in mind to adapt the path in the script to access the
prediction result in your own machine.

tKMeansModel_7.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x