Warning
This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.
Component family |
Big Data / Machine Learning |
|
Function |
tSparkALSModel leverages Spark to It receives this kind of information from its preceding Spark |
|
Purpose |
Based on given user-product interactive data, tSparkALSModel generates an |
|
Basic settings |
Schema and Edit schema |
A schema is a row description. It defines the number of fields to be processed and passed on Note that the output schema of tSparkALSModel is read-only. It defines the factors
|
Feature table |
Complete this table to map the input columns with the three
This map allows tSparkASLModel to |
|
Training percentage |
Enter the percentage (expressed in the decimal form) of the input |
|
Number of latent factors |
Enter the number of the latent factors, with which each user or |
|
Number of iterations |
Enter the number of iterations you want the Job to perform to According to Spark’s community, by default this number should be However, if you need to perform more than 30 iterations, you must |
|
Regularization factor |
Enter the regularization number you want to use to avoid |
|
Advanced settings |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is placed in the middle of the Spark process as an Note that the parameters you need to set are free parameters and |
|
MLlib installation |
Spark’s machine learning library, MLlib, uses the gfortran runtime library and for this For further information about MLlib and this library, see Spark’s |
|
Limitation |
This component does not support the Spark streaming mode. It is strongly recommended to use this component in a Spark-only Job, that is to say, to |
In this scenario, a four-component Job is created to analyze existing data about
users’ preference for movies to generate the recommender model that is used to recommend
movies to these users.
.
-
In the Integration perspective
of the Studio, create an empty Job from the Job
Designs node in the Repository tree view.For further information about how to create a Job, see Talend Studio User Guide.
-
In the workspace, enter the name of the component to be used and select
this component from the list that opens. In this scenario, the components
are tSparkConnection, tSparkLoad, tSparkALSModel
and tSparkStore. -
Connect tSparkConnection to tSparkLoad using the Trigger > On Subjob Ok link.
-
Connect the other components using the Row > Spark
combine link.
-
Double-click tSparkConnection to open its
Component view. -
From the Spark mode list, select the mode
that fits the Spark system you need to use. In this scenario, select
Standalone since the Spark system to be
used is installed in a standalone Hadoop cluster.For further details about the different Spark modes available in this
component, see tSparkConnection. You can as well read
Apache’s documentation about Spark or the documentation of the Hadoop
distribution you are using for more relevant details. -
In the Distribution and the Version lists, select the options that
corresponds to the Hadoop cluster to be used.If the distribution you need to use is not yet officially supported by the
Spark components, you need to select Custom
and use the […] button that is displayed
to set up the related configuration. For further information about how to
configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution. -
In the Spark host field, enter the URI of
the Spark master node. -
In the Spark home field, enter the path
to the Spark executables and libraries in the Hadoop cluster to be
used. -
Select the Define the driver hostname or IP
address check box and in the field that is displayed, enter
the IP address of the machine in which the Job is to be run.This value is actually the value of the
spark.driver.host
property. For further information about this property, see Apache’s
documentation about Spark.
-
Double-click tSparkLoad to open its
Component view. -
From the Spark connection list, select
the connection to be used. -
In the Input file field, enter the path
pointing to the file to be read. In this scenario, the name of the sample
data about user preferences for movies is ratings.dat. This file is available in the recommendation_with_spark folder under the
Documentation node of the Repository tree view in the Integration perspective. -
From the Type list, select the type of
the data to be read. In this scenario, select Text
File. -
In the Field separator field, enter the
separator used by the source data to separate its fields. In this example,
it is two colons (::). -
Click the [+] button next to Edit schema to open the schema editor.
-
Click the [+] button four times to add
four rows and rename them to UserID,
Movie ID, Rating and Timestamp,
respectively. These columns correspond to the schema of the source
data. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Storage source field, select the
type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
HDFS system of the Hadoop cluster where the Spark to be used is installed.
You can use tHDFSPut to put the source data
to the HDFS system. For further information, see tHDFSInput -
In the Namenode URI field, enter the
location of the Namenode of the Hadoop cluster to be used.
-
Double-click tSparkALSModel to open its
Component view. -
In the Feature table, add three rows by
clicking the [+] button three times and
configure the columns of this table:-
Input column: select the
columns you need to use from the input schema. -
Feature type: select the
feature type of the data stored in each input column.
This table allows you to indicate which feature an input column is
about. -
-
Give experimental values to the other parameters present in this Basic settings view.
-
In the Training percentage
field, enter the percentage of source data to be used to train
the model. The rest of the data is used to test the model. In
this scenario, put 0.7. -
In the Number of latent
factors field, enter the number of factors you
need to use to measure each user or movie feature, for example,
10. -
In the Number of iterations
field, enter the number of iterations you need tSparkASLModel to perform to train
the model. For example, enter 20. -
In the Regularization factor
field, enter the regularization number (of the type Double) to
prevent overfitting, for example, 10.0.
You need to try different sets of values of these parameters along with
the execution of the Job until you can read the RMSE score minimum enough to
your satisfaction in the console of the Run
view.Note that you need to enter the values themselves only and do not add the
double quotation marks around them. -
-
Right-click tSparkStore to open the
contextual menu and select Deactivate
tSparkStore_1. Note that this tSparkStore_1 is the ID of the tSparkStore component used in this scenario. It might be
another value in your Job. -
Press F6 to run this Job to verify the
RMSE value.In the Run view of this Job, the RMSE
value of this model has been outputted into the console. -
Repeat the execution with different parameters used in the Basic setting tab until you find a RMSE score
small enough to your satisfaction.
-
Right-click tSparkStore to open the
contextual menu and select Activate
tSparkStore_1 to make this component active again in this
Job. -
Double-click tSparkStore to open its
Component view. -
From the Spark connection list, select
the connection to be used. -
Click the Sync column button to ensure
that tSparkStore retrieves the entire
schema of the recommender model from tSparkALSModel. -
In the Storage area, select HDFS from the Storage
target list and in the Namenode
URI field, enter the location of the Namenode of the cluster
to be used. -
In the Result folder URI field, enter the
directory in HDFS in which you need to write the recommender model
files.If you need to store the models of each experimental execution of the Job,
you must enter a different directory for each model every time when you run
the Job. -
In the Field separator field, enter the
separator you want to use to separate the fields in the recommender model.
In this example, it is a semicolon (;).
Then you can press F6 to run this Job.
Once done, you can find the recommender model files in the HDFS system. The
following image shows a part of the model’s contents.
The columns are separated by semicolon (;). The letter U in the first column means that these rows are about users’
features, the second column stores the users’ IDs and the third column stores the
numeric values of the latent factors that describe each user’s feature. As presented
previously in this scenario, 10 latent factors are used for each feature and from
this image, you can read that these factors are separated by underscores (_).
Note that these model files are only underlying data of the model eventually used
by tSparkRecommend. At runtime, the tSparkRecommend component reads these files to
automatically build the actual user-product interactive matrix to recommend
products. For further information about the scenario using these model files, see
Scenario: recommending movies to users.