August 17, 2023

tSparkALSModel – Docs for ESB 5.x

tSparkALSModel

tsparkalsmodel_icon32_white.png

Warning

This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

tSparkALSModel properties

Component family

Big Data / Machine Learning

 

Function

tSparkALSModel leverages Spark to
process a large amount of information about users’ preferences over
given products.

It receives this kind of information from its preceding Spark
component and performs ALS (Alternating Least Squares) computations
over these sets of information in order to generate a fine-tuned
product recommender model that can be written in a given file system
by the tSparkStore
component.

Purpose

Based on given user-product interactive data, tSparkALSModel generates an
user-ranking-product associated matrix that is used by tSparkRecommend to estimate these users’
preferences.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Note that the output schema of tSparkALSModel is read-only. It defines the factors
used to create the recommender matrix:

  • Flag: the Character
    values (P or
    U) in this column
    indicate whether an entry is about an user or a
    product.

  • ID: the numerical
    values in this column are the user IDs or the product
    IDs incoming from the source data.

  • FEATURE_VALUES: the
    numerical values in this column are the measured
    heuristic values that reflect the features of each
    product or each user.

 

Feature table

Complete this table to map the input columns with the three
factors required to compute the recommender model.

  • Input column: select
    the input column to be used from the drop-down list.
    These selected columns must contain the user IDs, the
    product IDs and the ratings and the data must be
    numerical values.

  • Feature type: select
    the factor that each selected input column needs to be
    mapped with. The three factors are User_ID, Product_ID and Rating.

This map allows tSparkASLModel to
read the right type of data for each required factor.

Training percentage

Enter the percentage (expressed in the decimal form) of the input
data to be used to train the recommender model. The rest of the data
is used to test the model.

 

Number of latent factors

Enter the number of the latent factors, with which each user or
product feature is measured.

 

Number of iterations

Enter the number of iterations you want the Job to perform to
train the model.

According to Spark’s community, by default this number should be
smaller than 30 in order to avoid stack overflow issues and in
practices, the convergent score (RMSE score) can often be obtained
before you have to use a number beyond 30.

However, if you need to perform more than 30 iterations, you must
increase the stack size used to run the Job; to do this, you can add
the -Xss argument, for example -Xss2048k, to the JVM
Settings
table in the Advanced
settings
tab of the Run view. For further information about the
JVM Settings table, see
Talend Studio User
Guide
.

 

Regularization factor

Enter the regularization number you want to use to avoid
overfitting.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is placed in the middle of the Spark process as an
intermediate component.

Note that the parameters you need to set are free parameters and
so their values may be provided by previous experiments, empirical
guesses or the like. They do not have any optimal values applicable
for all datasets. Therefore, you need to train the model you are
generating with different sets of parameter values until you can
obtain the minimum RMSE score. This score is outputted in the
console of the Run view each time a
Job execution is done.

MLlib installation

Spark’s machine learning library, MLlib, uses the gfortran runtime library and for this
reason, you need to ensure that this library is already present in
every node of the Spark cluster to be used.

For further information about MLlib and this library, see Spark’s
related documentation.

Limitation

This component does not support the Spark streaming mode.

It is strongly recommended to use this component in a Spark-only Job, that is to say, to
design and run a Spark Job separately from the non Spark components or Jobs. For example, it
is not recommended to use the tRunJob component to
coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Scenario: creating an user-movie recommender model

In this scenario, a four-component Job is created to analyze existing data about
users’ preference for movies to generate the recommender model that is used to recommend
movies to these users.

.

use_case-tsparkalsmodel1.png

Linking the components

  1. In the Integration perspective
    of the Studio, create an empty Job from the Job
    Designs
    node in the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. In the workspace, enter the name of the component to be used and select
    this component from the list that opens. In this scenario, the components
    are tSparkConnection, tSparkLoad, tSparkALSModel
    and tSparkStore.

  3. Connect tSparkConnection to tSparkLoad using the Trigger > On Subjob Ok link.

  4. Connect the other components using the Row > Spark
    combine
    link.

Configuring the connection to Spark

  1. Double-click tSparkConnection to open its
    Component view.

    use_case-tsparkalsmodel2.png
  2. From the Spark mode list, select the mode
    that fits the Spark system you need to use. In this scenario, select
    Standalone since the Spark system to be
    used is installed in a standalone Hadoop cluster.

    For further details about the different Spark modes available in this
    component, see tSparkConnection. You can as well read
    Apache’s documentation about Spark or the documentation of the Hadoop
    distribution you are using for more relevant details.

  3. In the Distribution and the Version lists, select the options that
    corresponds to the Hadoop cluster to be used.

    If the distribution you need to use is not yet officially supported by the
    Spark components, you need to select Custom
    and use the […] button that is displayed
    to set up the related configuration. For further information about how to
    configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution.

  4. In the Spark host field, enter the URI of
    the Spark master node.

  5. In the Spark home field, enter the path
    to the Spark executables and libraries in the Hadoop cluster to be
    used.

  6. Select the Define the driver hostname or IP
    address
    check box and in the field that is displayed, enter
    the IP address of the machine in which the Job is to be run.

    This value is actually the value of the spark.driver.host
    property. For further information about this property, see Apache’s
    documentation about Spark.

Reading the existing user preference data

  1. Double-click tSparkLoad to open its
    Component view.

    use_case-tsparkalsmodel3.png
  2. From the Spark connection list, select
    the connection to be used.

  3. In the Input file field, enter the path
    pointing to the file to be read. In this scenario, the name of the sample
    data about user preferences for movies is ratings.dat. This file is available in the recommendation_with_spark folder under the
    Documentation node of the Repository tree view in the Integration perspective.

  4. From the Type list, select the type of
    the data to be read. In this scenario, select Text
    File
    .

  5. In the Field separator field, enter the
    separator used by the source data to separate its fields. In this example,
    it is two colons (::).

  6. Click the [+] button next to Edit schema to open the schema editor.

  7. Click the [+] button four times to add
    four rows and rename them to UserID,
    Movie ID, Rating and Timestamp,
    respectively. These columns correspond to the schema of the source
    data.

    use_case-tsparkalsmodel4.png
  8. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  9. In the Storage source field, select the
    type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
    HDFS system of the Hadoop cluster where the Spark to be used is installed.
    You can use tHDFSPut to put the source data
    to the HDFS system. For further information, see tHDFSInput

  10. In the Namenode URI field, enter the
    location of the Namenode of the Hadoop cluster to be used.

Training the recommender model

  1. Double-click tSparkALSModel to open its
    Component view.

    use_case-tsparkalsmodel5.png
  2. In the Feature table, add three rows by
    clicking the [+] button three times and
    configure the columns of this table:

    • Input column: select the
      columns you need to use from the input schema.

    • Feature type: select the
      feature type of the data stored in each input column.

    This table allows you to indicate which feature an input column is
    about.

  3. Give experimental values to the other parameters present in this Basic settings view.

    • In the Training percentage
      field, enter the percentage of source data to be used to train
      the model. The rest of the data is used to test the model. In
      this scenario, put 0.7.

    • In the Number of latent
      factors
      field, enter the number of factors you
      need to use to measure each user or movie feature, for example,
      10.

    • In the Number of iterations
      field, enter the number of iterations you need tSparkASLModel to perform to train
      the model. For example, enter 20.

    • In the Regularization factor
      field, enter the regularization number (of the type Double) to
      prevent overfitting, for example, 10.0.

    You need to try different sets of values of these parameters along with
    the execution of the Job until you can read the RMSE score minimum enough to
    your satisfaction in the console of the Run
    view.

    Note that you need to enter the values themselves only and do not add the
    double quotation marks around them.

  4. Right-click tSparkStore to open the
    contextual menu and select Deactivate
    tSparkStore_1
    . Note that this tSparkStore_1 is the ID of the tSparkStore component used in this scenario. It might be
    another value in your Job.

  5. Press F6 to run this Job to verify the
    RMSE value.

    In the Run view of this Job, the RMSE
    value of this model has been outputted into the console.

    use_case-tsparkalsmodel8.png
  6. Repeat the execution with different parameters used in the Basic setting tab until you find a RMSE score
    small enough to your satisfaction.

Writing the model files to HDFS

  1. Right-click tSparkStore to open the
    contextual menu and select Activate
    tSparkStore_1
    to make this component active again in this
    Job.

  2. Double-click tSparkStore to open its
    Component view.

    use_case-tsparkalsmodel6.png
  3. From the Spark connection list, select
    the connection to be used.

  4. Click the Sync column button to ensure
    that tSparkStore retrieves the entire
    schema of the recommender model from tSparkALSModel.

  5. In the Storage area, select HDFS from the Storage
    target
    list and in the Namenode
    URI
    field, enter the location of the Namenode of the cluster
    to be used.

  6. In the Result folder URI field, enter the
    directory in HDFS in which you need to write the recommender model
    files.

    If you need to store the models of each experimental execution of the Job,
    you must enter a different directory for each model every time when you run
    the Job.

  7. In the Field separator field, enter the
    separator you want to use to separate the fields in the recommender model.
    In this example, it is a semicolon (;).

Executing the Job

Then you can press F6 to run this Job.

Once done, you can find the recommender model files in the HDFS system. The
following image shows a part of the model’s contents.

use_case-tsparkalsmodel7.png

The columns are separated by semicolon (;). The letter U in the first column means that these rows are about users’
features, the second column stores the users’ IDs and the third column stores the
numeric values of the latent factors that describe each user’s feature. As presented
previously in this scenario, 10 latent factors are used for each feature and from
this image, you can read that these factors are separated by underscores (_).

Note that these model files are only underlying data of the model eventually used
by tSparkRecommend. At runtime, the tSparkRecommend component reads these files to
automatically build the actual user-product interactive matrix to recommend
products. For further information about the scenario using these model files, see
Scenario: recommending movies to users.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x