Warning
This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.
Component family |
Big Data / Machine Learning |
|
Function |
tSparkRecommend uses a given |
|
Purpose |
Based on the user-product recommender model generated by tSparkASLModel, tSparkRecommend recommends products to users known |
|
Schema and Edit |
A schema is a row description. It defines the number of fields to be processed and passed on Click Edit schema to make changes to the schema. If the
Note that apart from the columns you can edit by yourself, the |
|
Number of top recommended |
Enter the number of the most recommended products to be Note that this is a numeric value and so you cannot use the double |
|
Input model path |
Enter the directory where the recommender model to be used is |
|
File separator |
Enter the file separator you want to use. |
|
Select the User Identity column |
Select the column that is carrying the user ID data from the input |
|
Advanced settings |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is placed in the middle of the Spark process as an Note that the user IDs processed by this component must be known |
|
MLlib installation |
Spark’s machine learning library, MLlib, uses the gfortran runtime library and for this For further information about MLlib and this library, see Spark’s |
|
Limitation |
This component does not support the Spark streaming mode. It is strongly recommended to use this component in a Spark-only Job, that is to say, to |
In this scenario, a eight-component Job is created to leverage a given user-movie
interactive model to recommend movies to users recognized by this model.
-
In the Integration perspective
of the Studio, create an empty Job from the Job
Designs node in the Repository tree view.For further information about how to create a Job, see Talend Studio User Guide.
-
In the workspace, enter the name of the component to be used and select
this component from the list that opens. In this scenario, the components
are tSparkConnection, two tSparkLoad components (labelled LoadUser and LoadMovie, respectively), tSparkRecommend, tSparkJoin,
tSparkFilterColumns, tSparkSort and tSparkLog. -
Connect tSparkConnection to the tSparkLoad component that is labelled LoadUser using the Trigger > On Subjob Ok link.
-
Connect the other components using the Row > Spark
combine link in the order in which these components are
listed above. Among them, the tSparkLoad
component that is labelled LoadMovie is
connected to the tSparkJoin
component.
-
Double-click tSparkConnection to open its
Component view. -
From the Spark mode list, select the mode
that fits the Spark system you need to use. In this scenario, select
Standalone since the Spark system to be
used is installed in a standalone Hadoop cluster.For further details about the different Spark modes available in this
component, see tSparkConnection. You can as well read
Apache’s documentation about Spark or the documentation of the Hadoop
distribution you are using for more relevant details. -
In the Distribution and the Version lists, select the options that
corresponds to the Hadoop cluster to be used.If the distribution you need to use is not yet officially supported by the
Spark components, you need to select Custom
and use the […] button that is displayed
to set up the related configuration. For further information about how to
configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution. -
In the Spark host field, enter the URI of
the Spark master node. -
In the Spark home field, enter the path
to the Spark executables and libraries in the Hadoop cluster to be
used. -
Select the Define the driver hostname or IP
address check box and in the field that is displayed, enter
the IP address of the machine in which the Job is to be run.This value is actually the value of the
spark.driver.host
property. For further information about this property, see Apache’s
documentation about Spark.
-
Double-click the tSparkLoad component
labelled LoadUser to open its Component view. -
From the Spark connection list, select
the connection to be used. -
In the Input file field, enter the path
pointing to the file to be read. This file contains the users for whom this
Job is used to recommend movies. In this scenario, the file is 50_users.dat. This file is available in the
recommendation_with_spark folder
under the Documentation node of the
Repository tree view in the Integration perspective. -
From the Type list, select the type of
the data to be read. In this scenario, select Text
File. -
In the Field separator field, enter the
separator used by the source data to separate its fields. In this example,
it is a semicolon (;). -
Click the [+] button next to Edit schema to open the schema editor.
-
Click the [+] button five times to add
five rows and rename them to UserID,
FullName, Age, Occupation and
ZipCode, respectively. These columns
correspond to the schema of the source data. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Storage source field, select the
type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
HDFS system of the Hadoop cluster where the Spark to be used is installed.
You can use tHDFSPut to put the source data
to the HDFS system. For further information, see tHDFSInput -
In the Namenode URI field, enter the
location of the Namenode of the Hadoop cluster to be used.
-
Double-click the tSparkLoad component
labelled LoadMovie to open its Component view. -
From the Spark connection list, select
the connection to be used. -
In the Input file field, enter the path
pointing to the file to be read. This file contains the movies ready to be
recommended to the users loaded by the other tSparkLoad component. In this scenario, the file is
movie.dat. This file is available in
the recommendation_with_spark folder
under the Documentation node of the
Repository tree view in the Integration perspective. -
From the Type list, select the type of
the data to be read. In this scenario, select Text
File. -
In the Field separator field, enter the
separator used by the source data to separate its fields. In this example,
it is two colons (::). -
Click the [+] button next to Edit schema to open the schema editor.
-
Click the [+] button three times to add three rows and
rename them to MovieID, MovieTitle and MovieGenres, respectively. These columns correspond to the
schema of the source movie data. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Storage source field, select the
type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
HDFS system of the Hadoop cluster where the Spark to be used is installed.
You can use tHDFSPut to put the source data
to the HDFS system. For further information, see tHDFSInput -
In the Namenode URI field, enter the
location of the Namenode of the Hadoop cluster to be used.
-
Double-click tSparkRecommend to open its
Component view.This component uses the given user-movie recommender model to match the 50
users in question with the user records along with their features recognized
by this model. -
Click the […] button next to Edit schema to open the schema editor. The schema
of tSparkRecommend contains already three
read-only columns: UserID, ProductID and Rating, which are used to receive the corresponding data
from the user-movie recommender model to be used. -
In the input side (left) of the schema editor, select the columns you need to output from
tSparkRecommend and replicate them to
the output side (right) by clicking the button. The columns to be replicated are FullName, Age, Occupation and
ZipCode. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Input model field, enter the directory where the
recommender model files are stored. These files must be stored in the same
HDFS system you have defined in the LoadUser
tSparkLoad component. These files are used
to build the actual user-movie interactive matrix on the fly.For further information about how to create these model files, see Scenario: creating an user-movie recommender model.
-
In the Field separator field, enter the
separator used by the model to separate its fields. In this scenario, it is
a comma (,). -
From the Select the user identity column
list, select the column of the input schema to provide the users’ identities
to the User_ID column of the output schema.
In this scenario, this input column is UserID. This allows tSparkRecommend to match the 50 users to the users already
recognized by the model. -
In the Number of recommendations field,
enter the number of movies to be recommended. For example, 5 means that each user will receive five
recommended movies.Note that you need to enter the recommendation number itself only and do
not add the double quotation marks around it.
-
Double-click tSparkJoin to open its
Component view.This component matches movies to be recommended with the movies along with
their features recognized by the recommender model to be used. -
Click the […] button next to Edit schema to open the schema editor. As
explained in the properties table of tSparkJoin, the output schema must contain the columns of
the entire input schemas in their original order. Therefore, by using the button, you replicate all of the columns of the two
input flow into the output schema with the lookup one appending to the main
one.The final schema editor should look like the following image:
-
From the Spark connection list, select
the connection to be used. -
In the Join key table, add one row by clicking the
[+] button under this table and then
select Product_ID in the Input column and row3.MovieID in the Lookup
column.Note that this row3 is the ID of the
lookup link in this example and might be another value in your Job. -
From the Join mode list, select inner-join.
-
Double-click tSparkfilterColumns to open
its Component view. -
Click the […] button next to Edit schema to open the schema editor, in which
you select the columns to be output to the next component. -
On the input side, select each column to be used and click the button to replicate it into the output side.
The final output schema should look like the image above.
-
Double-click tSparkSort to open its
Component view. -
Click the Sync columns button to ensure that the entire
input schema from tSparkFilterColumns is
retrieved by this component. -
In the Sort key table, add one row by
clicking the [+] button under this table
and select User_ID in the Column column, num in the Sort num or
alpha column and ASC in the
Order column.
The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS
system.
This tSparkLog component does not require any
configuration in its Basic settings view.