tSparkLoad

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.

tSparkLoad properties

Component family	Big Data / Spark
Function	tSparkLoad uses the Spark connection created by a given tSparkConnection component and reads the data to be processed by Spark from a specific source, such as an HDFS system or a Twitter feed.
Purpose	tSparkLoad loads the data to be handled into the Spark process you are designing.
Basic settings	Spark connection	Select the Spark connection component to be used from the drop-down list in order to reuse the connection created by that component.
	Schema and Edit Schema	A schema is a row description. It defines the number of fields to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.
	Storage source	Select the type of the source system storing the data to be loaded. HDFS: the data to be read is stored in an HDFS system. You need to provide the URI of the NameNode service of this HDFS system in the NameNode field that is displayed. Twitter feed: the data to be read is from Twitter feed. This option requires the Spark streaming mode and so is available only when you have selected the Execute this Job as a streaming application check box in the tSparkConnection component to be used. For further information about this connection component, see tSparkConnection. Note that at least two cores are required in the machine or the cluster that runs the Twitter streaming. Custom: the data to be read is stored in a system that is not yet officially supported by the Spark components. In this situation, you need to provide the path to the data in this system using the protocol recognized by the system. Note that the connection to this custom distribution should be configured in the tSparkConnection component.
	Input file	Enter the location of the data to be read in the source system. Along with this location parameter, you need to set the following parameters about the source data: Type: select the type of the source data. Field separator: enter the separator of the fields of the source data. This field is not available for the Twitter feed source type.
Twitter configuration	Consumer key and Consumer secret	Enter the consumer authentication to the Twitter account to be used. For further information about how to obtain this information, see Twitter’s documentation about App development. To enter the consumer secret, click the […] button next to the consumer secret field, and then in the pop-up dialog box enter the consumer secret between double quotes and click OK to save the settings. These fields are displayed only when you have selected Twitter feed from the Storage source list.
	Access token and Secret token	Enter the access token and the token secret obtained from Twitter to make authorized calls to Twitter’s API. To enter the secret token, click the […] button next to the client secret field, and then in the pop-up dialog box enter the secret token between double quotes and click OK to save the settings. These fields are displayed only when you have selected Twitter feed from the Storage source list.
	Filters	Enter the phrases on which you want to perform a filter so that the Job selects the tweets using these phrases. You need to use the coma (,) to separate each phrase. This coma is considered as an OR operator while the space between each two words of a phrase is considered as an AND operator. For example, if you enter Talend Spark,components, the Job will select the tweets such as The Talend’s Spark components are awesome. These components are easy to use. This field is displayed only when you have selected Twitter feed from the Storage source list.
	Data mapping	Complete this table to map the tweet related information to the schema you have defined for the data flow to be processed. This table has two columns: Column: this column is automatically filled with the schema you have defined. Property: select the type of tweet related information you want to retrieve from the drop-down list. This table is displayed only when you have selected Twitter feed from the Storage source list.
Advanced settings	tStatCatcher Statistics	Select this check box to collect log data at the component level.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component is the start component of a Spark process.
Limitations	It is strongly recommended to use this component in a Spark-only Job, that is to say, to design and run a Spark Job separately from the non Spark components or Jobs. For example, it is not recommended to use the tRunJob component to coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Scenario: streaming Twitter feed of a given Twitter account

In this scenario, a five-component Job is created to leverage Apache’s Spark system to
sort out the tweets from a given Twitter account.

Linking the components

In the Integration perspective
of the Studio, create an empty Job from the Job
Designs node in the Repository tree view.

For further information about how to create a Job, see Talend Studio User Guide.
In the workspace, enter the name of the component to be used and select
this component from the list that opens. In this scenario, the components
are tSparkConnection, tSparkLoad, tSparkNormalize, tSparkFilterRow and tSparkLog.
Connect tSparkConnection to tSparkLoad using the Trigger > On Subjob Ok link.
Connect the other components using the Row > Spark
combine link.

Configuring the connection to Spark

Double-click tSparkConnection to open its
Component view.
From the Spark mode list, select the mode
that fits the Spark system you need to use. In this scenario, select
Standalone since the Spark system to be
used is installed in a standalone Hadoop cluster. Since this Job needs to
read contents from a given Twitter account, ensure that the cluster you are
connecting has the access to the Internet.

For further details about the different Spark modes available in this
component, see tSparkConnection. You can as well read
Apache’s documentation about Spark or the documentation of the Hadoop
distribution you are using for more relevant details.
In the Distribution and the Version lists, select the options that
corresponds to the Hadoop cluster to be used.

If the distribution you need to use is not yet officially supported by the
Spark components, you need to select Custom
and use the […] button that is displayed
to set up the related configuration. For further information about how to
configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution.
In the Spark host field, enter the URI of
the Spark master node.
In the Spark home field, enter the path
to the Spark executables and libraries in the Hadoop cluster to be used.
This path is the value of the SPARK_HOME variable.
Select the Define the driver hostname or IP
address check box and in the field that is displayed, enter
the IP address of the machine in which the Job is to be run.

This value is actually the value of the spark.driver.host
property. For further information about this property, see Apache’s
documentation about Spark.
Select the Execute this Job as a streaming
application check box to switch the Job to the streaming
mode. Then the Batch size field and the
Define a streaming timeout check box
are displayed to allow you to configure this mode.

As explained previously, at least two cores are required in the cluster to
be used to run the streaming mode.
In the Batch size field, enter the time
interval at the end of which you want to read the new ingestion of data from
the Twitter account to be used. In this scenario, enter 1000, meaning 1000 milliseconds.

You need to use an appropriate time interval to keep up with the ingestion
rate of the data streams, which can depend on a number of variables. For
further information, see Apache’s Spark documentation about streaming
programming.
In the Define a streaming timeout field,
enter the time frame at the end of which the Job stops running. For example,
enter 60000, meaning 60000
milliseconds.

Loading the Twitter stream

Double-click tSparkLoad to open its
Component view.
From the Spark connection list, select
the connection to be used.
Click the [+] button next to Edit schema to open the schema editor.
Click the [+] button three times to add
three rows and rename them to id,
username and hashtag, respectively.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Storage source field, select the
type of the source data to be processed. In this scenario, select Twitter feed. Then the Twitter configuration area is displayed.
In the following fields, enter the authentication information for the
Twitter account to be accessed: Consumer
key, Consumer secret,
Access token, Secret token. You need to obtain this group of information
from the Twitter side.
In the Filters field, enter the phrases
to filter the tweets you want to select. In this scenario, enter hadoop,talend,spark,streaming between double
quotation marks.

This filter selects the tweets using any of the phrases put in this
field.
In the Data mapping table, the columns
you defined in the schema have been automatically added to the Column column; then in the Properties column, you need to select which property of
Twitter data you want each column to receive.

In this scenario, select Id for the
id column, Username for usrename and
Hashtags for hashtag.

Normalizing the Twitter data

Double-click tSparkNormalize to open its
Component view.
From the Column to normalize list, select
the column the normalization is based on. In this scenario, it is hashtag.
In the Item separator field, enter the
separator you need to use to normalize the hashtags such that each row
contains one hashtag. In this example, enter the comma (,) because the hashtags retrieved in each batch
are automatically separated by the comma.

Removing empty hashtag rows

Double-click tSparkFilterRow to open its
Component view.
In the Filter configuration table, click
the [+] button once to add one row and then
configure this row as follows:
- Logical: leave the default
  logical operator as is because it is ignored in the
  execution.
- Column: select the column
  which the filtering is based on. It is hashtag in this scenario.
- Operator: select the function
  you need to use to filter. It is Not
  equal in this example.
- Value: enter the value used
  as the criteria of the filtering function. In this scenario,
  enter a pair of double quotation marks, meaning the value is
  empty.
This filter allows you select the rows in which the hashtag
is of the String type and is not empty.

Writing the filtered data

The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS system for
analytical use.

This tSparkLog component does not require any
configuration in its Basic settings view.

Executing the Job

Then you can press F6 to run this Job.

Once done, the Run view is opened automatically,
where you can check the execution result.

You can read that based on the filter you have put in tSparkLoad, the user NoSQL is
selected along with its Twitter user ID and the hashtags used in its tweets that
meet the filtering condition.

This Job runs continuously until the end of the time window you have defined in
the Define a streaming timeout field.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tSparkLoad – Docs for ESB 5.x