tSparkRecommend

Warning

This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

tSparkRecommend properties

Component family	Big Data / Machine Learning
Function	tSparkRecommend uses a given recommender model to analyse a large number of user data incoming from its preceding Spark component so as to estimate the preferences of these users.
Purpose	Based on the user-product recommender model generated by tSparkASLModel, tSparkRecommend recommends products to users known to this model.
	Schema and Edit Schema	A schema is a row description. It defines the number of fields to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window. Note that apart from the columns you can edit by yourself, the User_ID, Product_ID and Rating columns are read-only and used to receive the data from the recommender model being used.
	Number of top recommended	Enter the number of the most recommended products to be outputted. Note that this is a numeric value and so you cannot use the double quotation marks around it.
	Input model path	Enter the directory where the recommender model to be used is stored. This directory must be in the machine where the Job is run.
	File separator	Enter the file separator you want to use.
	Select the User Identity column	Select the column that is carrying the user ID data from the input columns.
Advanced settings	tStatCatcher Statistics	Select this check box to collect log data at the component level.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component is placed in the middle of the Spark process as an intermediate component. Note that the user IDs processed by this component must be known to the recommender model to be used.
MLlib installation	Spark’s machine learning library, MLlib, uses the gfortran runtime library and for this reason, you need to ensure that this library is already present in every node of the Spark cluster to be used. For further information about MLlib and this library, see Spark’s related documentation.
Limitation	This component does not support the Spark streaming mode. It is strongly recommended to use this component in a Spark-only Job, that is to say, to design and run a Spark Job separately from the non Spark components or Jobs. For example, it is not recommended to use the tRunJob component to coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Scenario: recommending movies to users

In this scenario, a eight-component Job is created to leverage a given user-movie
interactive model to recommend movies to users recognized by this model.

Linking the components

In the Integration perspective
of the Studio, create an empty Job from the Job
Designs node in the Repository tree view.

For further information about how to create a Job, see Talend Studio User Guide.
In the workspace, enter the name of the component to be used and select
this component from the list that opens. In this scenario, the components
are tSparkConnection, two tSparkLoad components (labelled LoadUser and LoadMovie, respectively), tSparkRecommend, tSparkJoin,
tSparkFilterColumns, tSparkSort and tSparkLog.
Connect tSparkConnection to the tSparkLoad component that is labelled LoadUser using the Trigger > On Subjob Ok link.
Connect the other components using the Row > Spark
combine link in the order in which these components are
listed above. Among them, the tSparkLoad
component that is labelled LoadMovie is
connected to the tSparkJoin
component.

Configuring the connection to Spark

Double-click tSparkConnection to open its
Component view.
From the Spark mode list, select the mode
that fits the Spark system you need to use. In this scenario, select
Standalone since the Spark system to be
used is installed in a standalone Hadoop cluster.

For further details about the different Spark modes available in this
component, see tSparkConnection. You can as well read
Apache’s documentation about Spark or the documentation of the Hadoop
distribution you are using for more relevant details.
In the Distribution and the Version lists, select the options that
corresponds to the Hadoop cluster to be used.

If the distribution you need to use is not yet officially supported by the
Spark components, you need to select Custom
and use the […] button that is displayed
to set up the related configuration. For further information about how to
configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution.
In the Spark host field, enter the URI of
the Spark master node.
In the Spark home field, enter the path
to the Spark executables and libraries in the Hadoop cluster to be
used.
Select the Define the driver hostname or IP
address check box and in the field that is displayed, enter
the IP address of the machine in which the Job is to be run.

This value is actually the value of the spark.driver.host
property. For further information about this property, see Apache’s
documentation about Spark.

Reading the user list

Double-click the tSparkLoad component
labelled LoadUser to open its Component view.
From the Spark connection list, select
the connection to be used.
In the Input file field, enter the path
pointing to the file to be read. This file contains the users for whom this
Job is used to recommend movies. In this scenario, the file is 50_users.dat. This file is available in the
recommendation_with_spark folder
under the Documentation node of the
Repository tree view in the Integration perspective.
From the Type list, select the type of
the data to be read. In this scenario, select Text
File.
In the Field separator field, enter the
separator used by the source data to separate its fields. In this example,
it is a semicolon (;).
Click the [+] button next to Edit schema to open the schema editor.
Click the [+] button five times to add
five rows and rename them to UserID,
FullName, Age, Occupation and
ZipCode, respectively. These columns
correspond to the schema of the source data.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Storage source field, select the
type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
HDFS system of the Hadoop cluster where the Spark to be used is installed.
You can use tHDFSPut to put the source data
to the HDFS system. For further information, see tHDFSInput
In the Namenode URI field, enter the
location of the Namenode of the Hadoop cluster to be used.

Reading the movies to be recommended

Double-click the tSparkLoad component
labelled LoadMovie to open its Component view.
From the Spark connection list, select
the connection to be used.
In the Input file field, enter the path
pointing to the file to be read. This file contains the movies ready to be
recommended to the users loaded by the other tSparkLoad component. In this scenario, the file is
movie.dat. This file is available in
the recommendation_with_spark folder
under the Documentation node of the
Repository tree view in the Integration perspective.
From the Type list, select the type of
the data to be read. In this scenario, select Text
File.
In the Field separator field, enter the
separator used by the source data to separate its fields. In this example,
it is two colons (::).
Click the [+] button next to Edit schema to open the schema editor.
Click the [+] button three times to add three rows and
rename them to MovieID, MovieTitle and MovieGenres, respectively. These columns correspond to the
schema of the source movie data.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Storage source field, select the
type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
HDFS system of the Hadoop cluster where the Spark to be used is installed.
You can use tHDFSPut to put the source data
to the HDFS system. For further information, see tHDFSInput
In the Namenode URI field, enter the
location of the Namenode of the Hadoop cluster to be used.

Loading the recommender model

Double-click tSparkRecommend to open its
Component view.

This component uses the given user-movie recommender model to match the 50
users in question with the user records along with their features recognized
by this model.
Click the […] button next to Edit schema to open the schema editor. The schema
of tSparkRecommend contains already three
read-only columns: UserID, ProductID and Rating, which are used to receive the corresponding data
from the user-movie recommender model to be used.
In the input side (left) of the schema editor, select the columns you need to output from
tSparkRecommend and replicate them to
the output side (right) by clicking the button. The columns to be replicated are FullName, Age, Occupation and
ZipCode.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Input model field, enter the directory where the
recommender model files are stored. These files must be stored in the same
HDFS system you have defined in the LoadUser
tSparkLoad component. These files are used
to build the actual user-movie interactive matrix on the fly.

For further information about how to create these model files, see Scenario: creating an user-movie recommender model.
In the Field separator field, enter the
separator used by the model to separate its fields. In this scenario, it is
a comma (,).
From the Select the user identity column
list, select the column of the input schema to provide the users’ identities
to the User_ID column of the output schema.
In this scenario, this input column is UserID. This allows tSparkRecommend to match the 50 users to the users already
recognized by the model.
In the Number of recommendations field,
enter the number of movies to be recommended. For example, 5 means that each user will receive five
recommended movies.

Note that you need to enter the recommendation number itself only and do
not add the double quotation marks around it.

Joining movie data to the main flow

Double-click tSparkJoin to open its
Component view.

This component matches movies to be recommended with the movies along with
their features recognized by the recommender model to be used.
Click the […] button next to Edit schema to open the schema editor. As
explained in the properties table of tSparkJoin, the output schema must contain the columns of
the entire input schemas in their original order. Therefore, by using the button, you replicate all of the columns of the two
input flow into the output schema with the lookup one appending to the main
one.

The final schema editor should look like the following image:
From the Spark connection list, select
the connection to be used.
In the Join key table, add one row by clicking the
[+] button under this table and then
select Product_ID in the Input column and row3.MovieID in the Lookup
column.

Note that this row3 is the ID of the
lookup link in this example and might be another value in your Job.
From the Join mode list, select inner-join.

Selecting the columns to be eventually output

Double-click tSparkfilterColumns to open
its Component view.
Click the […] button next to Edit schema to open the schema editor, in which
you select the columns to be output to the next component.
On the input side, select each column to be used and click the button to replicate it into the output side.

The final output schema should look like the image above.

Sorting the records to be output

Double-click tSparkSort to open its
Component view.
Click the Sync columns button to ensure that the entire
input schema from tSparkFilterColumns is
retrieved by this component.
In the Sort key table, add one row by
clicking the [+] button under this table
and select User_ID in the Column column, num in the Sort num or
alpha column and ASC in the
Order column.

Writing the sorted data

The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS
system.

This tSparkLog component does not require any
configuration in its Basic settings view.

Executing the Job

Then you can press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tSparkRecommend – Docs for ESB 5.x