August 17, 2023

tSparkRecommend – Docs for ESB 5.x

tSparkRecommend

tsparkrecommend_icon32_white.png

Warning

This component will be available in the Palette of
the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

tSparkRecommend properties

Component family

Big Data / Machine Learning

 

Function

tSparkRecommend uses a given
recommender model to analyse a large number of user data incoming
from its preceding Spark component so as to estimate the preferences
of these users.

Purpose

Based on the user-product recommender model generated by tSparkASLModel, tSparkRecommend recommends products to users known
to this model.

 

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

Note that apart from the columns you can edit by yourself, the
User_ID, Product_ID and Rating columns are read-only and used to receive the
data from the recommender model being used.

 

Number of top recommended

Enter the number of the most recommended products to be
outputted.

Note that this is a numeric value and so you cannot use the double
quotation marks around it.

Input model path

Enter the directory where the recommender model to be used is
stored. This directory must be in the machine where the Job is
run.

 

File separator

Enter the file separator you want to use.

 

Select the User Identity column

Select the column that is carrying the user ID data from the input
columns.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is placed in the middle of the Spark process as an
intermediate component.

Note that the user IDs processed by this component must be known
to the recommender model to be used.

MLlib installation

Spark’s machine learning library, MLlib, uses the gfortran runtime library and for this
reason, you need to ensure that this library is already present in
every node of the Spark cluster to be used.

For further information about MLlib and this library, see Spark’s
related documentation.

Limitation

This component does not support the Spark streaming mode.

It is strongly recommended to use this component in a Spark-only Job, that is to say, to
design and run a Spark Job separately from the non Spark components or Jobs. For example, it
is not recommended to use the tRunJob component to
coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Scenario: recommending movies to users

In this scenario, a eight-component Job is created to leverage a given user-movie
interactive model to recommend movies to users recognized by this model.

use_case-tsparkrecommend1.png

Linking the components

  1. In the Integration perspective
    of the Studio, create an empty Job from the Job
    Designs
    node in the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. In the workspace, enter the name of the component to be used and select
    this component from the list that opens. In this scenario, the components
    are tSparkConnection, two tSparkLoad components (labelled LoadUser and LoadMovie, respectively), tSparkRecommend, tSparkJoin,
    tSparkFilterColumns
    , tSparkSort and tSparkLog.

  3. Connect tSparkConnection to the tSparkLoad component that is labelled LoadUser using the Trigger > On Subjob Ok link.

  4. Connect the other components using the Row > Spark
    combine
    link in the order in which these components are
    listed above. Among them, the tSparkLoad
    component that is labelled LoadMovie is
    connected to the tSparkJoin
    component.

Configuring the connection to Spark

  1. Double-click tSparkConnection to open its
    Component view.

    use_case-tsparkrecommend2.png
  2. From the Spark mode list, select the mode
    that fits the Spark system you need to use. In this scenario, select
    Standalone since the Spark system to be
    used is installed in a standalone Hadoop cluster.

    For further details about the different Spark modes available in this
    component, see tSparkConnection. You can as well read
    Apache’s documentation about Spark or the documentation of the Hadoop
    distribution you are using for more relevant details.

  3. In the Distribution and the Version lists, select the options that
    corresponds to the Hadoop cluster to be used.

    If the distribution you need to use is not yet officially supported by the
    Spark components, you need to select Custom
    and use the […] button that is displayed
    to set up the related configuration. For further information about how to
    configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution.

  4. In the Spark host field, enter the URI of
    the Spark master node.

  5. In the Spark home field, enter the path
    to the Spark executables and libraries in the Hadoop cluster to be
    used.

  6. Select the Define the driver hostname or IP
    address
    check box and in the field that is displayed, enter
    the IP address of the machine in which the Job is to be run.

    This value is actually the value of the spark.driver.host
    property. For further information about this property, see Apache’s
    documentation about Spark.

Reading the user list

  1. Double-click the tSparkLoad component
    labelled LoadUser to open its Component view.

    use_case-tsparkrecommend3.png
  2. From the Spark connection list, select
    the connection to be used.

  3. In the Input file field, enter the path
    pointing to the file to be read. This file contains the users for whom this
    Job is used to recommend movies. In this scenario, the file is 50_users.dat. This file is available in the
    recommendation_with_spark folder
    under the Documentation node of the
    Repository tree view in the Integration perspective.

  4. From the Type list, select the type of
    the data to be read. In this scenario, select Text
    File
    .

  5. In the Field separator field, enter the
    separator used by the source data to separate its fields. In this example,
    it is a semicolon (;).

  6. Click the [+] button next to Edit schema to open the schema editor.

  7. Click the [+] button five times to add
    five rows and rename them to UserID,
    FullName, Age, Occupation and
    ZipCode, respectively. These columns
    correspond to the schema of the source data.

    use_case-tsparkrecommend4.png
  8. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  9. In the Storage source field, select the
    type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
    HDFS system of the Hadoop cluster where the Spark to be used is installed.
    You can use tHDFSPut to put the source data
    to the HDFS system. For further information, see tHDFSInput

  10. In the Namenode URI field, enter the
    location of the Namenode of the Hadoop cluster to be used.

Reading the movies to be recommended

  1. Double-click the tSparkLoad component
    labelled LoadMovie to open its Component view.

    use_case-tsparkrecommend5.png
  2. From the Spark connection list, select
    the connection to be used.

  3. In the Input file field, enter the path
    pointing to the file to be read. This file contains the movies ready to be
    recommended to the users loaded by the other tSparkLoad component. In this scenario, the file is
    movie.dat. This file is available in
    the recommendation_with_spark folder
    under the Documentation node of the
    Repository tree view in the Integration perspective.

  4. From the Type list, select the type of
    the data to be read. In this scenario, select Text
    File
    .

  5. In the Field separator field, enter the
    separator used by the source data to separate its fields. In this example,
    it is two colons (::).

  6. Click the [+] button next to Edit schema to open the schema editor.

  7. Click the [+] button three times to add three rows and
    rename them to MovieID, MovieTitle and MovieGenres, respectively. These columns correspond to the
    schema of the source movie data.

    use_case-tsparkrecommend6.png
  8. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  9. In the Storage source field, select the
    type of the source data to be processed. In this scenario, select HDFS since this data has been uploaded to the
    HDFS system of the Hadoop cluster where the Spark to be used is installed.
    You can use tHDFSPut to put the source data
    to the HDFS system. For further information, see tHDFSInput

  10. In the Namenode URI field, enter the
    location of the Namenode of the Hadoop cluster to be used.

Loading the recommender model

  1. Double-click tSparkRecommend to open its
    Component view.

    This component uses the given user-movie recommender model to match the 50
    users in question with the user records along with their features recognized
    by this model.

    use_case-tsparkrecommend7.png
  2. Click the […] button next to Edit schema to open the schema editor. The schema
    of tSparkRecommend contains already three
    read-only columns: UserID, ProductID and Rating, which are used to receive the corresponding data
    from the user-movie recommender model to be used.

  3. In the input side (left) of the schema editor, select the columns you need to output from
    tSparkRecommend and replicate them to
    the output side (right) by clicking the Schema_icon_RightArrow.png button. The columns to be replicated are FullName, Age, Occupation and
    ZipCode.

    use_case-tsparkrecommend8.png
  4. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  5. In the Input model field, enter the directory where the
    recommender model files are stored. These files must be stored in the same
    HDFS system you have defined in the LoadUser
    tSparkLoad component. These files are used
    to build the actual user-movie interactive matrix on the fly.

    For further information about how to create these model files, see Scenario: creating an user-movie recommender model.

  6. In the Field separator field, enter the
    separator used by the model to separate its fields. In this scenario, it is
    a comma (,).

  7. From the Select the user identity column
    list, select the column of the input schema to provide the users’ identities
    to the User_ID column of the output schema.
    In this scenario, this input column is UserID. This allows tSparkRecommend to match the 50 users to the users already
    recognized by the model.

  8. In the Number of recommendations field,
    enter the number of movies to be recommended. For example, 5 means that each user will receive five
    recommended movies.

    Note that you need to enter the recommendation number itself only and do
    not add the double quotation marks around it.

Joining movie data to the main flow

  1. Double-click tSparkJoin to open its
    Component view.

    This component matches movies to be recommended with the movies along with
    their features recognized by the recommender model to be used.

    use_case-tsparkrecommend9.png
  2. Click the […] button next to Edit schema to open the schema editor. As
    explained in the properties table of tSparkJoin, the output schema must contain the columns of
    the entire input schemas in their original order. Therefore, by using the Schema_icon_RightArrow.png button, you replicate all of the columns of the two
    input flow into the output schema with the lookup one appending to the main
    one.

    The final schema editor should look like the following image:

    use_case-tsparkrecommend10.png

  3. From the Spark connection list, select
    the connection to be used.

  4. In the Join key table, add one row by clicking the
    [+] button under this table and then
    select Product_ID in the Input column and row3.MovieID in the Lookup
    column.

    Note that this row3 is the ID of the
    lookup link in this example and might be another value in your Job.

  5. From the Join mode list, select inner-join.

Selecting the columns to be eventually output

  1. Double-click tSparkfilterColumns to open
    its Component view.

    use_case-tsparkrecommend11.png
  2. Click the […] button next to Edit schema to open the schema editor, in which
    you select the columns to be output to the next component.

    use_case-tsparkrecommend12.png
  3. On the input side, select each column to be used and click the Schema_icon_RightArrow.png button to replicate it into the output side.

    The final output schema should look like the image above.

Sorting the records to be output

  1. Double-click tSparkSort to open its
    Component view.

    use_case-tsparkrecommend13.png
  2. Click the Sync columns button to ensure that the entire
    input schema from tSparkFilterColumns is
    retrieved by this component.

  3. In the Sort key table, add one row by
    clicking the [+] button under this table
    and select User_ID in the Column column, num in the Sort num or
    alpha
    column and ASC in the
    Order column.

Writing the sorted data

The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS
system.

This tSparkLog component does not require any
configuration in its Basic settings view.

Executing the Job

Then you can press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

use_case-tsparkrecommend14.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x