August 17, 2023

tSparkLoad – Docs for ESB 5.x

tSparkLoad

tsparkload_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.

tSparkLoad properties

Component family

Big Data / Spark

 

Function

tSparkLoad uses the Spark
connection created by a given tSparkConnection component and reads the data to be
processed by Spark from a specific source, such as an HDFS system or
a Twitter feed.

Purpose

tSparkLoad loads the data to be
handled into the Spark process you are designing.

Basic settings

Spark connection

Select the Spark connection component to be used from the drop-down list in order to reuse
the connection created by that component.

 

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

Storage source

Select the type of the source system storing the data to be
loaded.

  • HDFS: the data to be
    read is stored in an HDFS system. You need to provide
    the URI of the NameNode service of this HDFS system in
    the NameNode field that
    is displayed.

  • Twitter feed: the
    data to be read is from Twitter feed. This option
    requires the Spark streaming mode and so is available
    only when you have selected the Execute this Job as a streaming
    application
    check box in the tSparkConnection component to
    be used. For further information about this connection
    component, see tSparkConnection.

    Note that at least two cores are required in the
    machine or the cluster that runs the Twitter
    streaming.

  • Custom: the data to
    be read is stored in a system that is not yet officially
    supported by the Spark components. In this situation,
    you need to provide the path to the data in this system
    using the protocol recognized by the system.

    Note that the connection to this custom distribution
    should be configured in the tSparkConnection component.

 

Input file

Enter the location of the data to be read in the source system.
Along with this location parameter, you need to set the following
parameters about the source data:

  • Type: select the type
    of the source data.

  • Field separator:
    enter the separator of the fields of the source
    data.

This field is not available for the Twitter
feed
source type.

Twitter configuration

Consumer key and Consumer secret

Enter the consumer authentication to the Twitter account to be
used. For further information about how to obtain this information,
see Twitter’s documentation about App development.

To enter the consumer secret, click the […] button next
to the consumer secret field, and then in the pop-up dialog box enter the consumer secret
between double quotes and click OK to save the
settings.

These fields are displayed only when you have selected Twitter feed from the Storage source list.

 

Access token and Secret
token

Enter the access token and the token secret obtained from Twitter
to make authorized calls to Twitter’s API.

To enter the secret token, click the […] button next to
the client secret field, and then in the pop-up dialog box enter the secret token between
double quotes and click OK to save the settings.

These fields are displayed only when you have selected Twitter feed from the Storage source list.

 

Filters

Enter the phrases on which you want to perform a filter so that
the Job selects the tweets using these phrases.

You need to use the coma (,) to separate each phrase. This coma is
considered as an OR operator while the space between each two words
of a phrase is considered as an AND operator.

For example, if you enter Talend
Spark,components
, the Job will select the tweets such
as

  • The Talend’s Spark components
    are awesome.

  • These components are easy to
    use.

This field is displayed only when you have selected Twitter feed from the Storage source list.

  Data mapping

Complete this table to map the tweet related information to the
schema you have defined for the data flow to be processed. This
table has two columns:

  • Column: this column is
    automatically filled with the schema you have
    defined.

  • Property: select the type
    of tweet related information you want to retrieve from the
    drop-down list.

This table is displayed only when you have selected Twitter feed from the Storage source list.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is the start component of a Spark process.

Limitations

It is strongly recommended to use this component in a Spark-only Job, that is to say, to
design and run a Spark Job separately from the non Spark components or Jobs. For example, it
is not recommended to use the tRunJob component to
coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Scenario: streaming Twitter feed of a given Twitter account

In this scenario, a five-component Job is created to leverage Apache’s Spark system to
sort out the tweets from a given Twitter account.

use_case-tsparkload1.png

Linking the components

  1. In the Integration perspective
    of the Studio, create an empty Job from the Job
    Designs
    node in the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. In the workspace, enter the name of the component to be used and select
    this component from the list that opens. In this scenario, the components
    are tSparkConnection, tSparkLoad, tSparkNormalize, tSparkFilterRow and tSparkLog.

  3. Connect tSparkConnection to tSparkLoad using the Trigger > On Subjob Ok link.

  4. Connect the other components using the Row > Spark
    combine
    link.

Configuring the connection to Spark

  1. Double-click tSparkConnection to open its
    Component view.

    use_case-tsparkload2.png
  2. From the Spark mode list, select the mode
    that fits the Spark system you need to use. In this scenario, select
    Standalone since the Spark system to be
    used is installed in a standalone Hadoop cluster. Since this Job needs to
    read contents from a given Twitter account, ensure that the cluster you are
    connecting has the access to the Internet.

    For further details about the different Spark modes available in this
    component, see tSparkConnection. You can as well read
    Apache’s documentation about Spark or the documentation of the Hadoop
    distribution you are using for more relevant details.

  3. In the Distribution and the Version lists, select the options that
    corresponds to the Hadoop cluster to be used.

    If the distribution you need to use is not yet officially supported by the
    Spark components, you need to select Custom
    and use the […] button that is displayed
    to set up the related configuration. For further information about how to
    configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution.

  4. In the Spark host field, enter the URI of
    the Spark master node.

  5. In the Spark home field, enter the path
    to the Spark executables and libraries in the Hadoop cluster to be used.
    This path is the value of the SPARK_HOME variable.

  6. Select the Define the driver hostname or IP
    address
    check box and in the field that is displayed, enter
    the IP address of the machine in which the Job is to be run.

    This value is actually the value of the spark.driver.host
    property. For further information about this property, see Apache’s
    documentation about Spark.

  7. Select the Execute this Job as a streaming
    application
    check box to switch the Job to the streaming
    mode. Then the Batch size field and the
    Define a streaming timeout check box
    are displayed to allow you to configure this mode.

    As explained previously, at least two cores are required in the cluster to
    be used to run the streaming mode.

  8. In the Batch size field, enter the time
    interval at the end of which you want to read the new ingestion of data from
    the Twitter account to be used. In this scenario, enter 1000, meaning 1000 milliseconds.

    You need to use an appropriate time interval to keep up with the ingestion
    rate of the data streams, which can depend on a number of variables. For
    further information, see Apache’s Spark documentation about streaming
    programming.

  9. In the Define a streaming timeout field,
    enter the time frame at the end of which the Job stops running. For example,
    enter 60000, meaning 60000
    milliseconds.

Loading the Twitter stream

  1. Double-click tSparkLoad to open its
    Component view.

    use_case-tsparkload3.png
  2. From the Spark connection list, select
    the connection to be used.

  3. Click the [+] button next to Edit schema to open the schema editor.

  4. Click the [+] button three times to add
    three rows and rename them to id,
    username and hashtag, respectively.

    use_case-tsparkload4.png
  5. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  6. In the Storage source field, select the
    type of the source data to be processed. In this scenario, select Twitter feed. Then the Twitter configuration area is displayed.

  7. In the following fields, enter the authentication information for the
    Twitter account to be accessed: Consumer
    key
    , Consumer secret,
    Access token, Secret token. You need to obtain this group of information
    from the Twitter side.

  8. In the Filters field, enter the phrases
    to filter the tweets you want to select. In this scenario, enter hadoop,talend,spark,streaming between double
    quotation marks.

    This filter selects the tweets using any of the phrases put in this
    field.

  9. In the Data mapping table, the columns
    you defined in the schema have been automatically added to the Column column; then in the Properties column, you need to select which property of
    Twitter data you want each column to receive.

    In this scenario, select Id for the
    id column, Username for usrename and
    Hashtags for hashtag.

Normalizing the Twitter data

  1. Double-click tSparkNormalize to open its
    Component view.

    use_case-tsparkload5.png
  2. From the Column to normalize list, select
    the column the normalization is based on. In this scenario, it is hashtag.

  3. In the Item separator field, enter the
    separator you need to use to normalize the hashtags such that each row
    contains one hashtag. In this example, enter the comma (,) because the hashtags retrieved in each batch
    are automatically separated by the comma.

Removing empty hashtag rows

  1. Double-click tSparkFilterRow to open its
    Component view.

    use_case-tsparkload6.png
  2. In the Filter configuration table, click
    the [+] button once to add one row and then
    configure this row as follows:

    • Logical: leave the default
      logical operator as is because it is ignored in the
      execution.

    • Column: select the column
      which the filtering is based on. It is hashtag in this scenario.

    • Operator: select the function
      you need to use to filter. It is Not
      equal
      in this example.

    • Value: enter the value used
      as the criteria of the filtering function. In this scenario,
      enter a pair of double quotation marks, meaning the value is
      empty.

    This filter allows you select the rows in which the hashtag
    is of the String type and is not empty.

Writing the filtered data

The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS system for
analytical use.

This tSparkLog component does not require any
configuration in its Basic settings view.

Executing the Job

Then you can press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

use_case-tsparkload7.png

You can read that based on the filter you have put in tSparkLoad, the user NoSQL is
selected along with its Twitter user ID and the hashtags used in its tweets that
meet the filtering condition.

This Job runs continuously until the end of the time window you have defined in
the Define a streaming timeout field.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x