
Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.
Component family |
Big Data / Spark |
|
Function |
tSparkLoad uses the Spark |
|
Purpose |
tSparkLoad loads the data to be |
|
Basic settings |
Spark connection |
Select the Spark connection component to be used from the drop-down list in order to reuse |
Schema and Edit |
A schema is a row description. It defines the number of fields to be processed and passed on Click Edit schema to make changes to the schema. If the
|
|
Storage source |
Select the type of the source system storing the data to be
|
|
|
Input file |
Enter the location of the data to be read in the source system.
This field is not available for the Twitter |
Twitter configuration |
Consumer key and Consumer secret |
Enter the consumer authentication to the Twitter account to be To enter the consumer secret, click the […] button next These fields are displayed only when you have selected Twitter feed from the Storage source list. |
|
Access token and Secret |
Enter the access token and the token secret obtained from Twitter To enter the secret token, click the […] button next to These fields are displayed only when you have selected Twitter feed from the Storage source list. |
|
Filters |
Enter the phrases on which you want to perform a filter so that You need to use the coma (,) to separate each phrase. This coma is For example, if you enter Talend
This field is displayed only when you have selected Twitter feed from the Storage source list. |
Data mapping |
Complete this table to map the tweet related information to the
This table is displayed only when you have selected Twitter feed from the Storage source list. |
|
Advanced settings |
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is the start component of a Spark process. |
|
Limitations |
It is strongly recommended to use this component in a Spark-only Job, that is to say, to |
In this scenario, a five-component Job is created to leverage Apache’s Spark system to
sort out the tweets from a given Twitter account.

-
In the Integration perspective
of the Studio, create an empty Job from the Job
Designs node in the Repository tree view.For further information about how to create a Job, see Talend Studio User Guide.
-
In the workspace, enter the name of the component to be used and select
this component from the list that opens. In this scenario, the components
are tSparkConnection, tSparkLoad, tSparkNormalize, tSparkFilterRow and tSparkLog. -
Connect tSparkConnection to tSparkLoad using the Trigger > On Subjob Ok link.
-
Connect the other components using the Row > Spark
combine link.
-
Double-click tSparkConnection to open its
Component view. -
From the Spark mode list, select the mode
that fits the Spark system you need to use. In this scenario, select
Standalone since the Spark system to be
used is installed in a standalone Hadoop cluster. Since this Job needs to
read contents from a given Twitter account, ensure that the cluster you are
connecting has the access to the Internet.For further details about the different Spark modes available in this
component, see tSparkConnection. You can as well read
Apache’s documentation about Spark or the documentation of the Hadoop
distribution you are using for more relevant details. -
In the Distribution and the Version lists, select the options that
corresponds to the Hadoop cluster to be used.If the distribution you need to use is not yet officially supported by the
Spark components, you need to select Custom
and use the […] button that is displayed
to set up the related configuration. For further information about how to
configure a custom Hadoop connection, see Connecting to a custom Hadoop distribution. -
In the Spark host field, enter the URI of
the Spark master node. -
In the Spark home field, enter the path
to the Spark executables and libraries in the Hadoop cluster to be used.
This path is the value of theSPARK_HOME
variable. -
Select the Define the driver hostname or IP
address check box and in the field that is displayed, enter
the IP address of the machine in which the Job is to be run.This value is actually the value of the
spark.driver.host
property. For further information about this property, see Apache’s
documentation about Spark. -
Select the Execute this Job as a streaming
application check box to switch the Job to the streaming
mode. Then the Batch size field and the
Define a streaming timeout check box
are displayed to allow you to configure this mode.As explained previously, at least two cores are required in the cluster to
be used to run the streaming mode. -
In the Batch size field, enter the time
interval at the end of which you want to read the new ingestion of data from
the Twitter account to be used. In this scenario, enter 1000, meaning 1000 milliseconds.You need to use an appropriate time interval to keep up with the ingestion
rate of the data streams, which can depend on a number of variables. For
further information, see Apache’s Spark documentation about streaming
programming. -
In the Define a streaming timeout field,
enter the time frame at the end of which the Job stops running. For example,
enter 60000, meaning 60000
milliseconds.
-
Double-click tSparkLoad to open its
Component view. -
From the Spark connection list, select
the connection to be used. -
Click the [+] button next to Edit schema to open the schema editor.
-
Click the [+] button three times to add
three rows and rename them to id,
username and hashtag, respectively. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Storage source field, select the
type of the source data to be processed. In this scenario, select Twitter feed. Then the Twitter configuration area is displayed. -
In the following fields, enter the authentication information for the
Twitter account to be accessed: Consumer
key, Consumer secret,
Access token, Secret token. You need to obtain this group of information
from the Twitter side. -
In the Filters field, enter the phrases
to filter the tweets you want to select. In this scenario, enter hadoop,talend,spark,streaming between double
quotation marks.This filter selects the tweets using any of the phrases put in this
field. -
In the Data mapping table, the columns
you defined in the schema have been automatically added to the Column column; then in the Properties column, you need to select which property of
Twitter data you want each column to receive.In this scenario, select Id for the
id column, Username for usrename and
Hashtags for hashtag.
-
Double-click tSparkNormalize to open its
Component view. -
From the Column to normalize list, select
the column the normalization is based on. In this scenario, it is hashtag. -
In the Item separator field, enter the
separator you need to use to normalize the hashtags such that each row
contains one hashtag. In this example, enter the comma (,) because the hashtags retrieved in each batch
are automatically separated by the comma.
-
Double-click tSparkFilterRow to open its
Component view. -
In the Filter configuration table, click
the [+] button once to add one row and then
configure this row as follows:-
Logical: leave the default
logical operator as is because it is ignored in the
execution. -
Column: select the column
which the filtering is based on. It is hashtag in this scenario. -
Operator: select the function
you need to use to filter. It is Not
equal in this example. -
Value: enter the value used
as the criteria of the filtering function. In this scenario,
enter a pair of double quotation marks, meaning the value is
empty.
This filter allows you select the rows in which the hashtag
is of the String type and is not empty. -
The tSparkLog component is used to output the
execution result in the Job console. You can use tSparkStore to replace it to write data into a given HDFS system for
analytical use.
This tSparkLog component does not require any
configuration in its Basic settings view.
Then you can press F6 to run this Job.
Once done, the Run view is opened automatically,
where you can check the execution result.

You can read that based on the filter you have put in tSparkLoad, the user NoSQL is
selected along with its Twitter user ID and the hashtags used in its tweets that
meet the filtering condition.
This Job runs continuously until the end of the time window you have defined in
the Define a streaming timeout field.