July 30, 2023

tRandomForestModel – Docs for ESB 7.x

tRandomForestModel

Analyzes feature vectors.

These vectors are usually pre-processed by tModelEncoder to
generate a classifier model that is used by tPredict to classify
given elements.

tRandomForestModel analyzes incoming
datasets based on applying the Random Forest algorithm.

It generates a classification model out of this analysis and writes
this model either in memory or in a given file system.

In local mode, Apache Spark 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.3.0 and 2.4.0 are
supported.

tRandomForestModel properties for Apache Spark Batch

These properties are used to configure tRandomForestModel running in the Spark Batch Job framework.

The Spark Batch
tRandomForestModel component belongs to the Machine Learning family.

This component is available in Talend Platform products with Big Data and
in Talend Data Fabric.

Basic settings

Label column

Select the input column used to provide classification labels. The records of this column
are used as the class names (Target in terms of classification) of the elements to be
classified.

Feature column

Select the input column used to provide features. Very often, this column is the output of
the feature engineering computations performed by tModelEncoder.

Save the model on file
system

Select this check box to store the model in a given file system. Otherwise, the model is
stored in memory. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn
or the Spark Standalone mode, ensure that you have properly
configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.

Number of trees in the forest

Enter the number of decision trees you want tRandomForestModel to build.

Each decision tree is trained independently using a random sample of
features.

Increasing this number can improve the accuracy by decreasing the
variance in predictions, but will increase the training time.

Maximum depth of each tree in the
forest

Enter the decision tree depth at which the training should stop adding new nodes. New
nodes represent further tests on features on internal nodes and possible class labels held
by leaf nodes.

For a tree of n depth, the number of internal nodes is
2n – 1. For example, depth
1 means 1 internal
node plus 2 leaf nodes.

Generally speaking, a deeper decision tree is more expressive and thus potentially more
accurate in predictions, but it is also more resource consuming and prone to
overfitting.

Advanced settings

Subsampling rate

Enter the numeric value to indicate the fraction of the
input dataset used for training each tree in the forest. The default
value 1.0 is recommended, meaning to take the
whole dataset for test.

Subset strategy

Select the strategy about how many features should be
considered on each internal node in order to appropriately split this
internal node (actually the training set or subset of a feature on this
node) into smaller subsets. These subsets are used to build child
nodes.

Each strategy takes a different number of features into
account to find the optimal point among these features for split. This
point could be, for example, the age 35 of the
categorical feature age.

  • auto: this strategy
    is based on the number of trees you have set in the
    Number of trees in the
    forest
    field in the Basic settings view. This is
    the default strategy to be used.

    If the number of trees is 1, the strategy is actually all; if this number is
    greater than 1, the strategy is
    sqrt.

  • all: the total number
    of features is considered for split.

  • sqrt: the number of
    features to be considered is the square root of the total
    number of features.

  • log2: the number of
    features to be considered is the result of log2(M),
    in which M is the total number of
    features.

Set Checkpoint Interval

Set the frequency of checkpoints. It is recommended to leave the default
value (10).

Before setting a value for this parameter, activate checkpointing and set
the checkpoint directory in the Spark
Configuration
tab of the Run
view.

For further information about checkpointing the
activities of your Apache Spark Job, see the documentation on Talend Help Center (https://help.talend.com).

Max bins

Enter the numeric value to indicate the maximum number of bins used for splitting
features.

The continuous features are automatically transformed to ordered discrete features.

Min info gain

Enter the minimum number of information gain to be expected from a parent node to its
child nodes. When the number of information gain is less than this minimum number, node
split is stopped.

The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node.
As a result, the splitting could be stopped.

For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation.

Min instances per node

Enter the minimum number of training instances a node should have to make it valid for
further splitting.

The default value is 1, which means when a node has
only 1 row of training data, it stops splitting.

Impurity

Select the measure used to select the best split from
each set of splits.

  • gini: it is about how often an element could be
    incorrectly labelled in a split.

  • entropy: it is about how unpredictable the information in
    each split is.

For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation.

Set a random seed

Enter the random seed number to be used for bootstrapping and choosing feature
subsets

Usage

Usage rule

This component is used as an end component and requires an input link.

You can accelerate the training process by adjusting the stopping
conditions such as the maximum depth of each decision tree, the maximum
number of bins of splitting or the minimum number of information gain,
but note that the training that stops too early could impact its
performance.

Model evaluation

The parameters you need to set are free parameters and so their values may be provided by
previous experiments, empirical guesses or the like. They do not have any optimal values
applicable for all datasets.

Therefore, you need to train the classifier model you are generating with different sets
of parameter values until you can obtain the best confusion matrix. But note that you need
to write the evaluation code yourself to rank your model with scores.

You need to select the scores to be used depending on the algorithm you want to use to
train your classifier model. This allows you to build the most relevant confusion
matrix.

For examples about how the confusion matrix is used in a
Talend
Job for classification, see Creating a classification model to filter spam.

For a general explanation about confusion matrix, see https://en.wikipedia.org/wiki/Confusion_matrix from Wikipedia.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Creating a classification model to filter spam

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

In this scenario, you create Spark Batch Jobs. The key components to be used are as follows:

  • tModelEncoder: several tModelEncoder components are used to transform given SMS text messages
    into feature sets.

  • tRandomForestModel: it analyzes the features
    incoming from tModelEncoder to build a
    classification model that understands what a junk message or a normal message could
    look like.

  • tClassify: in a new Job, it applies this
    classification model to process a new set of SMS text messages to classify the spam
    and the normal messages. In this scenario, the result of this classification is used
    to evaluate the accuracy of the model, since the classification of the messages
    processed by tClassify is already known and
    explicitly marked.

  • tHDFSConfiguration: this component is used by Spark to connect to
    the HDFS system where the jar files dependent on the Job are transferred.

    In the Spark
    Configuration
    tab in the Run
    view, define the connection to a given Spark cluster for the whole Job. In
    addition, since the Job expects its dependent jar files for execution, you must
    specify the directory in the file system to which these jar files are
    transferred so that Spark can access these files:

    • Yarn mode (Yarn client or Yarn cluster):

      • When using Google Dataproc, specify a bucket in the
        Google Storage staging bucket
        field in the Spark configuration
        tab.

      • When using HDInsight, specify the blob to be used for Job
        deployment in the Windows Azure Storage
        configuration
        area in the Spark
        configuration
        tab.

      • When using Altus, specify the S3 bucket or the Azure
        Data Lake Storage for Job deployment in the Spark
        configuration
        tab.
      • When using Qubole, add a
        tS3Configuration to your Job to write
        your actual business data in the S3 system with Qubole. Without
        tS3Configuration, this business data is
        written in the Qubole HDFS system and destroyed once you shut
        down your cluster.
      • When using on-premise
        distributions, use the configuration component corresponding
        to the file system your cluster is using. Typically, this
        system is HDFS and so use tHDFSConfiguration.

    • Standalone mode: use the
      configuration component corresponding to the file system your cluster is
      using, such as tHDFSConfiguration or
      tS3Configuration.

      If you are using Databricks without any configuration component present
      in your Job, your business data is written directly in DBFS (Databricks
      Filesystem).

Prerequisites:

  • Two sets of SMS text messages: one is used to train classification models
    and the other is used to evaluate the created models. You can download the
    train set from trainingSet.zip and the test
    set from testSet.zip.


    Talend
    created these two sets out of the dataset
    downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by
    using this
    dataSet_preparation
    Job to add 3
    feature columns (number of currency symbols, number of numeric values and
    number of exclamation marks) to the raw dataset and proportionally split the
    dataset.

    An example of the junk messages reads as
    follows:

    An example of the normal messages reads as
    follows:

    Note that the new features added to the raw dataset were discovered as the
    result of the observation of the junk messages used specifically in this
    scenario (these junk messages often contain prices and/or exclamation marks)
    and so cannot be generalized for whatever junk messages you want to analyze.
    In addition, the dataset was randomly split into two sets and used as is but
    in a real-world practice, you can continue to preprocess them using many
    different methods such as dataset balancing in order to better train your
    classification model.

  • The two sets must be stored in the machine where the Job is going to be
    executed, for example in the HDFS system of your Yarn cluster if you use the
    Spark Yarn client mode to run
    Talend
    Spark Jobs, and you have appropriate rights and
    permissions to read data from and write data in this system.

    In this scenario, the Spark Yarn client
    will be used and the datasets are stored in the associated HDFS
    system.

  • The Spark cluster to be used must have been properly set up and is
    running.

Creating a classification model using Random Forest

tRandomForestModel_1.png

Arranging the data
flow

  1. In the
    Integration
    perspective of the Studio, create an empty
    Spark Batch Job, named rf_model_creation
    for example, from the Job Designs node in
    the Repository tree view.

    For further information about how to create a Spark Batch Job, see the Getting Started Guide of the Studio.
  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tFileInputDelimited, tRandomForestModel component, and 4 tModelEncoder components.

    It is recommended to label the 4 tModelEncoder components to
    different names so that you can easily recognize the task each of them is
    used to complete. In this scenario, they are labelled Tokenize, tf,
    tf_idf and features_assembler, respectively.
  3. Except tHDFSConfiguration, connect the other
    components using the Row > Main link as is
    previously displayed in the image.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration
tab.

  1. Double-click tHDFSConfiguration to open its Component view.

    Spark uses this component to connect to the HDFS system to which the jar
    files dependent on the Job are transferred.

  2. If you have defined the HDFS connection metadata under the Hadoop
    cluster
    node in Repository, select
    Repository from the Property
    type
    drop-down list and then click the
    […] button to select the HDFS connection you have
    defined from the Repository content wizard.

    For further information about setting up a reusable
    HDFS connection, search for centralizing HDFS metadata on Talend Help Center
    (https://help.talend.com).

    If you complete this step, you can skip the following steps about configuring
    tHDFSConfiguration because all the required fields
    should have been filled automatically.

  3. In the Version area, select
    the Hadoop distribution you need to connect to and its version.
  4. In the NameNode URI field,
    enter the location of the machine hosting the NameNode service of the cluster.
    If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; WebHDFS with SSL is not
    supported yet.
  5. In the Username field, enter
    the authentication information used to connect to the HDFS system to be used.
    Note that the user name must be the same as you have put in the Spark configuration tab.

Reading the training
set

  1. Double-click tFileInputDelimited to open its
    Component view.

    tRandomForestModel_2.png

  2. Select the Define a storage configuration component check box
    and select the tHDFSConfiguration component
    to be used.

    tFileInputDelimited uses this
    configuration to access the training set to be used.
  3. Click the […] button next to Edit
    schema
    to open the schema editor.
  4. Click the [+] button five times to add five rows and in the
    Column column, rename them to label, sms_contents, num_currency,
    num_numeric and num_exclamation, respectively.

    tRandomForestModel_3.png

    The label and the sms_contents columns carries the raw data which is composed of
    the SMS text messages in the sms_contents
    column and the labels indicating whether a message is spam in the label column.
    The other columns are used to carry the features added to the raw datasets
    as explained previously in this scenario. These three features are the
    number of currency symbols, the number of numeric values and the number of
    exclamation marks found in each SMS message.
  5. In the Type column, select Integer for the num_currency, num_numeric and
    num_exclamation columns.
  6. Click OK to validate these changes.
  7. In the Folder/File field, enter the
    directory where the training set to be used is stored.
  8. In the Field separator field, enter
    , which is the separator used by the
    datasets you can download for use in this scenario.

Transforming SMS text messages to feature vectors using tModelEncoder

This step is meant to implement the feature engineering process.

Transforming messages to words
  1. Double-click the tModelEncoder component labelled Tokenize to
    open its Component view. This component
    tokenize the SMS messages into words.

    tRandomForestModel_4.png

  2. Click the Sync columns button to retrieve the schema from the
    preceding one.
  3. Click the […] button next to Edit
    schema
    to open the schema editor.
  4. On the output side, click the [+] button to add one row and in the Column column, rename it to
    sms_tokenizer_words. This column is used to carry the
    tokenized messages.

    tRandomForestModel_5.png

  5. In the Type column,
    select Object for this
    sms_tokenizer_words row.
  6. Click OK to validate these changes.
  7. In the Transformations
    table, add one row by clicking the [+]
    button and then proceed as follows:

    1. In the Input column column, select the column
      that provides data to be transformed to features. In this scenario, it
      is sms_contents.
    2. In the Output column column, select the column
      that carry the features. In this scenario, it is
      sms_tokenizer_words.
    3. In the Transformation column, select the
      algorithm to be used for the transformation. In this scenario, it is
      Regex tokenizer.
    4. In the Parameters column, enter the parameters
      you want to customize for use in the algorithm you have selected. In
      this scenario, enter
      pattern=\W;minTokenLength=3.

Using this transformation, tModelEncoder
splits each input message by whitespace, selects only the words contains at least 3
letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values,
punctuations and words such as a, an
or to are excluded from this column.

Calculating the weight of a word in each message
  1. Double-click the tModelEncoder component
    labelled tf to open its Component view.

    tRandomForestModel_6.png

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the sms_tf_vect column of the Vector type to the output schema and define the
    transformation as displayed in the image above.

    tRandomForestModel_7.png

    In this transformation, tModelEncoder
    uses HashingTF to convert the already
    tokenized SMS messages into fixed-length (15 in this scenario) feature vectors to reflect the importance
    of a word in each SMS message.
Downplaying the weight of the irrelevant words in each message
  1. Double-click the tModelEncoder component labelled tf_idf to open its Component view. In this process, tModelEncoder reduces the weight of the words that appears very
    often but in too many messages, because a word like this often brings no
    meaningful information for text analysis, such as the word the.

    tRandomForestModel_8.png

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the sms_tf_idf_vect column of the Vector type to the output schema and define the
    transformation as displayed in the image above.

    tRandomForestModel_9.png

    In this transformation, tModelEncoder uses
    Inverse Document Frequency to downplay the
    weight of the words that appears in 5 or more than 5 messages.
Combining feature vectors
  1. Double-click the tModelEncoder component labelled features_assembler to open its Component view.

    tRandomForestModel_10.png

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the features_vect column of the Vector type to the output schema and define the
    transformation as displayed in the image above.

    Note that the parameter to be put in the Parameters column is inputCols=sms_tf_idf_vect,num_currency,num_numeric,num_exclamation.
    tRandomForestModel_11.png

    In this transformation, tModelEncoder
    combines all feature vectors into one single feature column.

Training the model using Random Forest

  1. Double-click tRandomForestModel to open its
    Component view.

    tRandomForestModel_12.png

  2. From the Label column list, select the
    column that provides the classes to be used for classification. In this
    scenario, it is label, which contains two
    class names: spam for junk messages and
    ham for normal messages.
  3. From the Features column list, select the
    column that provides the feature vectors to be analyzed. In this scenario,
    it is features_vect, which combines all
    features.
  4. Select the Save the model on file system
    check box and in the HDFS folder field that
    is displayed, enter the directory you want to use to store the generated
    model.
  5. In the Number of trees in the forest
    field, enter the number of decision trees you want tRandomForestModel to build. You need to try different numbers
    to run the current Job to create the classification model several times;
    after comparing the evaluation results of every model created on each run,
    you can decide the number you need to use. In this scenario, put 20.

    An evaluation Job will be presented in one of the following
    sections.
  6. Leave the other parameters as is.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

The Spark documentation provides an exhaustive list of Spark properties and
their default values at Spark Configuration. A Spark Job designed in the Studio uses
this default configuration except for the properties you explicitly defined in the
Spark Configuration tab or the components
used in your Job.

  1. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.

    You can launch
    your Job without any further configuration.

  3. Clear the Use local mode check box to display the
    list of the available Hadoop distributions and from this list, select
    the distribution corresponding to your Spark cluster to be used.

    This distribution could be:

    • Databricks

    • Qubole

    • Amazon EMR

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • Cloudera

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Google Cloud
      Dataproc

      For this distribution, Talend supports:

      • Yarn client

    • Hortonworks

      For this distribution, Talend supports:

      • Yarn client

      • Yarn cluster

    • MapR

      For this distribution, Talend supports:

      • Standalone

      • Yarn client

      • Yarn cluster

    • Microsoft HD
      Insight

      For this distribution, Talend supports:

      • Yarn cluster

    • Cloudera Altus

      For this distribution, Talend supports:

      • Yarn cluster

        Your Altus cluster should run on the following Cloud
        providers:

        • Azure

          The support for Altus on Azure is a technical
          preview feature.

        • AWS

    As a Job relies on Avro to move data among its components, it is recommended to set your
    cluster to use Kryo to handle the Avro types. This not only helps avoid
    this Avro known issue but also
    brings inherent preformance gains. The Spark property to be set in your
    cluster is:

    If you cannot find the distribution corresponding to yours from this
    drop-down list, this means the distribution you want to connect to is not officially
    supported by
    Talend
    . In this situation, you can select Custom, then select the Spark
    version
    of the cluster to be connected and click the
    [+] button to display the dialog box in which you can
    alternatively:

    1. Select Import from existing
      version
      to import an officially supported distribution as base
      and then add other required jar files which the base distribution does not
      provide.

    2. Select Import from zip to
      import the configuration zip for the custom distribution to be used. This zip
      file should contain the libraries of the different Hadoop/Spark elements and the
      index file of these libraries.

      In
      Talend

      Exchange, members of
      Talend
      community have shared some ready-for-use configuration zip files
      which you can download from this Hadoop configuration
      list and directly use them in your connection accordingly. However, because of
      the ongoing evolution of the different Hadoop-related projects, you might not be
      able to find the configuration zip corresponding to your distribution from this
      list; then it is recommended to use the Import from
      existing version
      option to take an existing distribution as base
      to add the jars required by your distribution.

      Note that custom versions are not officially supported by

      Talend
      .
      Talend
      and its community provide you with the opportunity to connect to
      custom versions from the Studio but cannot guarantee that the configuration of
      whichever version you choose will be easy. As such, you should only attempt to
      set up such a connection if you have sufficient Hadoop and Spark experience to
      handle any issues on your own.

    For a step-by-step example about how to connect to a custom
    distribution and share this connection, see Hortonworks.

Executing the Job to create the classification model

Then you can run this Job.

Press
F6
to run this
Job.

Once done, the model file is created in the directory you have specified in tRandomForestModel.

Evaluating the classification model

tRandomForestModel_13.png

Linking the components

  1. In the
    Integration
    perspective of the
    Studio, create another empty Spark Batch Job, named classify_and_evaluation for example, from the Job Designs node in the Repository tree view.
  2. In the workspace, enter the name of the component to be used and select
    this component from the list that appears. In this Job, the components are
    tHDFSConfiguration, tFileInputDelimited, tClassify,
    tReplicate, tJava, tFilterColumns and
    tLogRow.
  3. Except tHDFSConfiguration, connect them
    using the Row > Main link as is displayed
    in the image above.
  4. Double-click tHDFSConfiguration to open
    its Component view and configure it as
    explained previously in this scenario.

Loading the test set into the Job

  1. Double-click tFileInputDelimited to open its
    Component view.

    tRandomForestModel_14.png

  2. Select the Define a storage configuration component check box
    and select the tHDFSConfiguration component
    to be used.

    tFileInputDelimited uses this
    configuration to access the training set to be used.
  3. Click the […] button next to Edit
    schema
    to open the schema editor.
  4. Click the [+] button five times to add five rows and in the
    Column column, rename them to reallabel, sms_contents, num_currency,
    num_numeric and num_exclamation, respectively.

    tRandomForestModel_15.png

    The reallabel and the sms_contents columns carries the raw data which is
    composed of the SMS text messages in the sms_contents column and the labels indicating whether a message
    is spam in the reallabel column.
    The other columns are used to carry the features added to the raw datasets
    as explained previously in this scenario. They contains the number of
    currency symbols, the number of numeric values and the number of exclamation
    marks found in each SMS message.
  5. In the Type column, select Integer for the num_currency, num_numeric and
    num_exclamation columns.
  6. Click OK to validate these changes.
  7. In the Folder/File field, enter the
    directory where the test set to be used is stored.
  8. In the Field separator field, enter
    , which is the separator used by the
    datasets you can download for use in this scenario.

Applying the classification model

  1. Double-click tClassify to open its
    Component view.

    tRandomForestModel_16.png

  2. Select the Model on filesystem radio button and enter the
    directory in which the classification model to be used is stored.

    The tClassify component contains a read-only column called
    label in which the model provides the
    classes to be used in the classification process, while the reallabel column retrieved from the input schema
    contains the classes to which each message actually belongs. The model will
    be evaluated by comparing the actual label of each message with the label
    the model determines.
    tRandomForestModel_17.png

Replicating the classification result

  1. Double-click tReplicate to open its
    Component view.

    tRandomForestModel_18.png

  2. Leave the default configuration as is.

Filtering the classification result

  1. Double-click tFilterColumns to open its
    Component view.
  2. Click the […] button next to Edit
    schema
    to open the schema editor.
  3. On the output side, click the [+] button three times to add
    three rows and in the Column column, rename
    them to reallabel, label and sms_contents, respectively. They receive data from the input
    columns that are using the same names.

    tRandomForestModel_19.png

  4. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.

Writing the evaluation program in tJava

  1. Double-click tJava to open its
    Component view.
  2. Click Sync columns to ensure that
    tJava retrieves the replicated schema of
    tClassify.
  3. Click the Advanced settings tab to open its view.

    tRandomForestModel_20.png

  4. In the Classes field, enter code to
    define the Java classes to be used to verify whether the predicted class
    labels match the actual class labels (spam for junk messages and ham for normal messages). In this scenario, row7 is the ID of the connection between
    tClassify and tReplicate and carries the classification result to be sent
    to its following components and row7Struct is the Java class of the RDD for the
    classification result. In your code, you need to replace row7, whether it is used alone or within
    row7Struct, with the corresponding
    connection ID used in your Job.

    Column names such as reallabel or
    label were defined in the previous step
    when configuring different components. If you named them differently, you
    need to keep them consistent for use in your code.
  5. Click the Basic settings tab to open its
    view and in the Code field, enter the code
    to be used to compute the accuracy score and the Matthews Correlation
    Coefficient (MCC) of the classification model.

    For general explanation about Mathews Correlation Coefficient, see https://en.wikipedia.org/wiki/Matthews_correlation_coefficient from Wikipedia.

Configuring Spark connection

Repeat the operations described above in the section that addresses the same subject.

Executing the Job

Then you can run this Job.

  1. The tLogRow component is used to present the execution
    result of the
    Job.

    If you want to configure the presentation mode on its Component view, double-click the tLogRow component to open the Component view and in the Mode area, select the Table (print
    values in cells of a table)
    radio button.
  2. If you need to display only the error-level information of Log4j logging in the console of the
    Run view, click Run to open its view and then click the Advanced settings tab.
  3. Select the log4jLevel check box from its view and select
    Error from the list.
  4. Press
    F6
    to run this
    Job.

In the console of the Run view, you can read the classification result along with the actual labels:

tRandomForestModel_21.png

You can also read the computed scores in the same console:

tRandomForestModel_22.png

The scores show a good quality of the model. But you can still enhance the model
by continuing to tune the parameters used in tRandomForestModel and run the model-creation Job with new
parameters to obtain and then evaluate new versions of the model.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x