July 30, 2023

tNaiveBayesModel – Docs for ESB 7.x

tNaiveBayesModel

Generates a classifier model that is used by tPredict to
classify given elements.

tNaiveBayesModel analyzes incoming datasets based on applying Bayes’ law
with the (naive) assumption that the analyzed features of an element are independent of each
other.

It generates a classification model out of this analysis and writes
this model in a given file system in the PMML (Predictive Model Markup
Language) format.

In local mode, Apache Spark 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.3.0 and 2.4.0 are
supported.

tNaiveBayesModel properties for Apache Spark Batch

These properties are used to configure tNaiveBayesModel running in the Spark Batch Job framework.

The Spark Batch
tNaiveBayesModel component belongs to the Machine Learning family.

This component is available in Talend Platform products with Big Data and
in Talend Data Fabric.

Basic settings

Define a storage configuration
component

Select the configuration component to be used to provide the configuration
information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local
system.

The configuration component to be used must be present in the same Job.
For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write
the result in a given HDFS system.

Spark version

Select the Spark version you are using.

For Spark V1.4 onwards, the parameters to be set are:

  • Save the model on file
    system
    :

    Select this check box to store the model in a given file system. Otherwise, the model is
    stored in memory. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn
    or the Spark Standalone mode, ensure that you have properly
    configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.

  • Label column:

    Select the input column used to provide classification labels. The records of this column
    are used as the class names (Target in terms of classification) of the elements to be
    classified.

  • Feature column:

    Select the input column used to provide features. Very often, this column is the output of
    the feature engineering computations performed by tModelEncoder.

For Spark 1.3, see the parameters explained in the following rows of
this table.

Column type

Complete this table to define the feature type of each input column in
order to compute the classifier model.

  • Column: this column lists
    the input column automatically retrieved from the input
    schema.

  • Usage: select the type of
    the feature that the records from each input column
    represent.

    For example, people’s ages represent the continuous
    feature while people’s genders the categorical feature (also
    called discrete feature).

    If you select Label for
    an input column, the records of this column are used as the
    class names (Target in terms of classification) of the
    elements to be classified; if you need to ignore a column in
    the model computation, select Unused.

  • Bin edges: this column is
    activated only when the input column represents the
    continuous feature. It allows you to discretize the
    continuous data into bins, that is to say, to partition the
    continuous data into half-open segments by putting boundary
    values within double quotation marks.

    For example, if you enter “18;35” for a column as to people’s ages,
    these ages will be grouped into three segments, among which
    the ages less or equal to 18 go into a segment, ages greater than
    18 and less or equal
    to 35 into a segment and
    ages greater than 35 the
    other segment.

  • Categories: this column
    is activated only when the input column represents the
    categorical feature. You need to enter the names of each
    category to be used and separate them using a semicolon (;),
    for example, “male;female”.

    Note that the categories you enter must exist in the input
    column.

  • Class name: this column
    is activated only when the Label
    option has been selected in the Usage column. You need to enter
    the name of the classes used in the classification and
    separate them using a semicolon (;), for example, “platinum-level customer;gold-level
    customer”
    .

Training percentage

Enter the percentage (expressed in the decimal form) of the input data
to be used to train the classifier model. The rest of the data is used
to test the model.

PMML model path

Enter the directory in which you need to store the generated
classifier model in the file system to be used.

The button for browsing does not work with the Spark
Local mode; if you are
using the other Spark Yarn
modes that the Studio supports with your distribution, ensure that you have properly
configured the connection in a configuration component in the same Job, such as

tHDFSConfiguration
. Use the
configuration component depending on the filesystem to be used.

For further information about the PMML format used by Naive Bayes
model, see http://www.dmg.org/v4-2-1/NaiveBayes.html.

Parquet model name

Enter the name you need to use for the classifier model.

Usage

Usage rule

This component is used as an end component and requires an input link.

Model evaluation

The parameters you need to set are free parameters and so their values
may be provided by previous experiments, empirical guesses or the like.
They do not have any optimal values applicable for all datasets.

Therefore, you need to train the classifier model you are generating
with different sets of parameter values until you can obtain the best
Accuracy (ACC) score and the optimal Precision, Recall and F1-measure
scores for each class:

  • The Accuracy score varies from 0 to 1 to indicate how
    accurate a classification is. More approximate to 1 an
    Accuracy score is, more accurate the corresponding
    classification is.

  • The Precision score, also varying from 0 to 1, indicates
    how relevant the elements selected by the classification are
    to a given class.

  • The Recall score, still varying from 0 to 1, indicates how
    many relevant elements are selected.

  • The F1-measure score is the harmonic mean of the Precision
    score and the Recall score.

Scores

These scores can be output to the console of the Run view
when you execute the Job when you have added the following code to the Log4j view in the Project Settings dialog
box.

These scores are output along with the other Log4j INFO-level information. If you want to
prevent outputting the irrelevant information, you can, for example, change the Log4j level
of this kind of information to WARN but note you need to keep this DataScience Logger code as INFO.

If you are using a subscription-based version of the Studio, the activity of this
component can be logged using the log4j feature. For more
information on this feature, see
Talend Studio User Guide
.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x