August 15, 2023

tDataprepRun – Docs for ESB 6.x

tDataprepRun

Applies a preparation made using Talend Data Preparation in a standard Data Integration
Job.

tDataprepRun fetches a preparation made
using Talend Data Preparation and applies it to
a set of data.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

tDataprepRun Standard properties

These properties are used to configure tDataprepRun running in
the Standard Job framework.

The Standard
tDataprepRun component belongs to the
Talend Data Preparation family.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.

Basic settings

URL

Type the URL to the Talend Data Preparation web
application, between double quotes.

Username

Type the email address that you use to log in the Talend Data Preparation web application, between double quotes.

Note: If you are using Talend Data Preparation
Cloud, you must use your Talend Integration Cloud login instead.

Password

Click the […] button and type your user password
for the Talend Data Preparation web application, between
double quotes.

When using the default preparation selection properties:

Preparation

To complete the Preparation field, click one of the
following:

  • Choose an existing
    preparation
    to select from a list of the preparations that were
    previously created in Talend Data Preparation.

  • Or create a new one to
    create a new preparation based on your input data.

components_edit_preparation_button.png

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the
Preparation field.

Version

If you have created several versions of your preparation, you
can choose which one you want to use in the Job. To complete the
Version field, click Choose a Version to
select from the list of existing versions, including the current version of the
preparation.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the
Preparation field.

When using the Dynamic preparation selection:

Dynamic preparation selection

Select this checkbox to define a preparation path and version using context
variables. The preparation will be dynamically selected at runtime.

Preparation path

Use a context variable to define a preparation path. Paths with or without the
initial / are supported.

Preparation version

Use a context variable to define the version of the preparation to use.
Preparation versions are referenced by their number. As a consequence, to execute the
version #2 of a preparation for example, the expected value is 2.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to dynamically retrieve the schema from the preparations
defined by the context variable in the Preparation path field. If
the fetch is successful, any previously configured schema will be overwritten. If the
fetch fails, the current schema is kept.

Advanced settings

Limit Preview

Specify the number of rows to which you want to limit the
preview.

tStatCatcher
Statistics

Select this check box to gather the Job processing metadata at the Job
level as well as at each component level.

Global Variables

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input flow as well as an
output.

Limitations

  • If the dataset is updated after the tDataprepRun component has been configured, the
    schema needs to be fetched again.

  • If a context variable was used in the URL of the
    dataset, you cannot use the components_edit_preparation_button.png button to edit the preparation
    directly in Talend Data Preparation.

Using the tDataprepRun component to prepare data from a
database in a Talend Job

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing
preparation made in Talend Data Preparation,
directly in a data integration Job. In other words, you can operationalize the process of
applying a preparation to input data with the same model.

The following scenario creates a simple Job
that :

  • retrieves a table from a MySQL database, that holds some employee-related
    data,
  • applies an existing preparation on this data,
  • outputs the prepared data into an Excel file.
Use_Case_tDataprepRun_1.png

This assumes that a preparation has been created beforehand, on a dataset
with the same schema as your input data for the Job. In this case, the existing preparation is
called datapreprun_scenario. This simple preparation puts the employees
last names into upper case and isolate the employees with a salary greater than 1500$.

Use_Case_tDataprepRun_2.png

Adding and linking the components

  1. In the Integration perspective of the Studio, create an empty
    Standard Job from the Job Designs node in
    the Repository tree view.
  2. Drop the following components from the Palette onto the design workspace: tMysqlInput, tDataprepRun and
    tFileOutputExcel.
  3. Connect the three components using Row > Main links.

Configuring the components

Retrieving the data from the database

  1. In the design workspace, select tMysqlInput
    and click the Component tab to define its
    basic settings.

    Use_Case_tDataprepRun_3.png

  2. In the Property Type list, select Built-in to set the database connection details
    manually.
  3. In the DB Version list, select the version of
    MySQL you are using, MySQL 5 in this
    example.
  4. In the Host, Port, Database, Username and Password
    fields, enter the MySQL connection details and the user authentication data for
    the database, between double quotes.
  5. In the Table Name field, type the name of the
    table to be read, between double quotes.
  6. In the Query field,
    enter your database query between double quotes. In this example, the query is
    select * from employees to retrieve all of the
    information from the table employees, in the
    test database.
  7. Click Guess schema to automatically retrieve
    the schema from the database table or enter the schema manually by clicking the
    […] button next to Edit
    schema
    .

    Make sure that the schema of the tMysqlInput component matches the schema expected by the
    tDataprepRun component. In other
    words, the input schema must be the same as the dataset upon which the
    preparation was made in the first place.

Accessing the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun
    and click the Component tab to define its
    basic settings.

    Use_Case_tDataprepRun_4.png

  2. In the URL field, type
    the URL of the Talend Data Preparation web application, between double quotes. Port 9999 is the default port for Talend Data Preparation.
  3. In the Username and
    Password fields, enter your Talend Data Preparation connection
    information, between double quotes.

    Note: If you are using Talend Data Preparation Cloud, you need to use your Talend Integration Cloud login instead of your
    Talend Data Preparation email.

  4. Click Choose an existing
    preparation
    to display a list of the prepations available in
    Talend Data Preparation, and select
    datapreprun_scenario.

    Use_Case_tDataprepRun_5.png

    This scenario assumes that a preparation with a compatible schema has been
    created beforehand.

  5. Click Fetch Schema to retrieve the schema of
    the preparation.

    The output schema of the tDataprepRun component now
    reflects the changes made with each preparation step. The schema takes into
    account columns that were added or removed for example.

Outputting the preparation into an Excel file

  1. In the design workspace, select tFileOutputExcel and click the Component tab to define its basic settings.

    Use_Case_tDataprepRun_6.png

  2. In the File Name field, enter the location
    where you want to save the result of the preparation.
  3. Click Sync columns to
    retrieve the new output schema, inherited from the tDataprepRun
    component.

Saving and executing the Job

  1. Save your Job and press F6 to execute it.
  2. You can now open the Excel file containing the result of the preparation
    applied on your data from the MySQL database.

Using the tDataprepRun component to dynamically select a
preparation at runtime according to the input

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing preparation
made in Talend Data Preparation, directly in a data
integration, Spark Batch or Spark Streaming Job. In other words, you can operationalize the
process of applying a preparation to input data with the same model.

By default, the tDataprepRun component retrieves
preparations using their technical id. However, the dynamic preparation selection feature
allows you to call a preparation via its path in Talend Data Preparation. Through the use of the Dynamic
preparation selection
check box and some variables, it is then possible to
dynamically select a preparation at runtime, according to runtime data or metadata.

In case you wanted to operationalize preparations in a Talend Job using the regular preparation selection
properties, you would actually need several Jobs: one for each preparation to apply on a
specific dataset. By retrieving the correct preparation according to the input file name, you
will be able to dynamically run more than one preparation on your source data, in a single
Job.

The following scenario creates a Job that:

  • Scans the content of a folder containing several datasets
  • Creates a dynamic path to your CSV files
  • Dynamically retrieves the preparations according to the input file name and
    applies them on the data
  • Outputs the prepared data into a Redshift database
Use_Case_tDataprepRun_dynamic_prep_1.png

In this example, .csv datasets with data from two of your
clients are locally stored in a folder called customers_files. Each of
your clients datasets have their specific naming convention and are stored in dedicated sub
folders. All the datasets in the customers_files folder have identical
schemas, or data model.

Use_Case_tDataprepRun_dynamic_prep_2.png

A customers folder has also been created in Talend Data Preparation, containing two preparations. These two
distinct preparation are each aimed at cleaning data from your two different customers.

Use_Case_tDataprepRun_dynamic_prep_3.png

The purpose of customer_1_preparation for example is to isolate a
certain type of email addresses, while customer_2_preparation aims at
cleansing invalid values and formatting the data. In this example, the preparations names are
based on the two sub folders names customer_1 and
customer_2, with _preparation as suffix.

Use_Case_tDataprepRun_dynamic_prep_4.png

Just like the input schema that all four dataset have in common, all of your output data must
also share the same model. For this reason, you cannot have one preparation that modifies the
schema by adding a column for example, while the other does not.

By following this scenario, a single Job will allow you to use the appropriate preparation,
depending on whether the dataset extracted from the local customers_files
folder belongs to customer 1 or customer 2.

Designing the Job

  1. In the Integration perspective of the Studio, create an
    empty Standard, Spark Batch or Spark Streaming Job from the Job
    Designs
    node in the Repository tree
    view.
  2. Drop the following components from the palette onto the
    design workspace: two tFileList, a
    tFileInputDelimited, a tDataprepRun and a
    tRedshiftOutput.
  3. Connect the two tFileList and the
    tFileInputDelimited components using Row > Iterate links.
  4. Connect the tFileInputDelimited,
    tDataprepRun and tRedshiftOutput
    components using Row > Main links.

Configuring the components

Reading the input files from your local folder

  1. In the design workspace, select
    tFileList_1 and click the
    Component tab to define its basic settings.

    This first tFileList will read the
    customers_files folder, and retrieve the path of the
    two sub folders so that they can be reused later.

    Use_Case_tDataprepRun_dynamic_prep_5.png

  2. In the Directory field, enter the path to the
    customers_files folder, containing the customers datasets,
    in their respective sub folders.
  3. Click the + button in the
    Filemask table to add a new line and rename it
    *, between double quotes.
  4. In the design workspace, select tFileList_2 and click
    the Component tab to define its basic settings.

    This second tFileList will read the four
    .csv datasets contained in the two sub folders and
    retrieve their file paths.

    Use_Case_tDataprepRun_dynamic_prep_6.png

  5. To fill the Directory field with the expression that
    will dynamically retrieve the input files paths, drag it from the
    tFileList_1 list of expressions in the
    Outline panel.

    Use_Case_tDataprepRun_dynamic_prep_7.png

  6. Check the Includes subdirectories check box.
  7. Click the + button in the
    Filemask table to add a new line and rename it
    *.csv, between double quotes.
  8. In the design workspace, select the tFileInputDelimited and
    click the Component tab to define its basic
    settings.

    Use_Case_tDataprepRun_dynamic_prep_8.png

  9. To fill the File name/Stream field with the expression
    that will dynamically retrieve the input files paths, drag it from the
    tFileList_2 list of expressions in the
    Outline panel.

    Use_Case_tDataprepRun_dynamic_prep_9.png

  10. Enter the Row Separator and Field
    Separator
    that correspond to your datasets, between double
    quotes.
  11. Click the Edit schema button to define the columns of
    the source datasets and their data type.

    The schema is the same for all the datasets from the
    customers_files folder. Make sure that this schema
    matches the schema expected by the tDataprepRun
    component. In other words, the input schema must be the same as the datasets
    upon which the preparations were made in the first place.

Dynamically selecting the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun and click the
    Component tab to define its basic settings.

    Use_Case_tDataprepRun_dynamic_prep_10.png

  2. In the URL field, type the URL of the Talend Data Preparation web application, between
    double quotes. Port 9999 is the default port for Talend Data Preparation.
  3. In the Username and Password
    fields, enter your Talend Data Preparation connection information, between
    double quotes.

    Note: If you are using Talend Data Preparation Cloud, you need to use your Talend Data Preparation Cloud login instead of
    your Talend Data Preparation
    email.

  4. Select the Dynamic preparation selection check box to
    dynamically define the preparations with their paths in Talend Data Preparation rather than their technical
    ids.
  5. In the Preparation path field, enter
    "customers/"+((String)globalMap.get("tFileList_2_CURRENT_FILE"))+"_preparation".

    This expression is made of three distinct parts. In the path,
    customers is the folder in Talend Data Preparation where the preparations are
    kept. As for the preparations names, because they are partly reused from the
    local sub folders names, you will use this expression to retrieve those sub
    folders name from the tFileList_1, and attach the
    _preparation suffix.

  6. In the Preparation version field, type
    current between double quotes, in order to use the
    current version of the preparation.
  7. Click Sync columns to retrieve the schema of the
    previous component.

Outputting the result of the preparation into a database

  1. In the design workspace, select tRedshiftOutput and click the
    Component tab to define its basic settings.

    Use_Case_tDataprepRun_dynamic_prep_11.png

  2. In the Property Type list, select
    Built-in to set the database connection details
    manually.
  3. In the Host, Port,
    Database, Schema,
    Username and Password fields, enter the
    Redshift connection details and the user authentication data for the database,
    between double quotes
  4. To fill the Table field, with the expression that will
    dynamically reuse the input file name and give it to the table, drag it from the
    tFileList_2 list of expressions in the
    Outline panel.

    Use_Case_tDataprepRun_dynamic_prep_12.png

  5. From the Action on table drop-down list, select
    Create table if does not exist.
  6. From the Action on data drop-down list, select
    Insert.
  7. Click Sync columns to retrieve the new output schema,
    inherited from the tDataprepRun component.

Saving and executing the Job

  1. Save your Job.
  2. Press F6 to execute it.

The four datasets have been dynamically prepared with a single Job, using the
preparation dedicated to each of your customers and the output of those four
preparations has been sent to a Redshift database, in four new tables.

Using the tDataprepRun component to promote a Job
leveraging a preparation across environments

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration, Spark Batch
or Spark Streaming Job. In other words, you can operationalize the process of applying a
preparation to input data with the same model.

A good practice when using Talend Data Preparation is to set up at least two environments to
work with: a development one, and a production one for example. When a preparation is ready on
the development environment, you can use the Import/Export Preparation
feature to promote it to the production environment, that has a different URL. For more
information, see Promoting a preparation across environments.

Use_Case_tDataprepRun_dynamic_env_1.png

Following this logic, you will likely find yourself with a preparation that has the same name
on different environments. The thing is that preparations are not actually identified by their
name, but rather by a technical id, such as
prepid=faf4fe3e-3cec-4550-ae0b-f1ce108f83d5. As a consequence, what you
really have is two dinstinct preparations, each with its specific id.

In case you wanted to operationalize this recipe in a Talend Job using the regular preparation selection
properties, you would actually need two Jobs: one for the preparation on the development
environment, with a specific url and id, and a second one for the production environment, with
different parameters.

Through the use the Dynamic preparation selection checkbox and some
context variables, you will be able to use a single Job to run your preparation, regardless of
the environment. Indeed, the dynamic preparation selection relies on the preparation path in
Talend Data Preparation, and not on the preparation id.

You will be able to use a single Job definition to later deploy on your development or
production environment

The following scenario creates a simple Job that:

  • Receives data from a local CSV file containing customers data
  • Dynamically retrieves an existing preparation based on its path and
    environment
  • Applies the preparation on the input data
  • Outputs the prepared data into a MySQL database.
Use_Case_tDataprepRun_dynamic_env_2.png

In this example, the customers_leads preparation has
been created beforehand in Talend Data Preparation. This
simple preparation was created on a dataset that has the same schema as the CSV file used as
input for this Job, and its purpose is to remove invalid values from your customers data.

Use_Case_tDataprepRun_dynamic_env_3.png

Designing the Job

  1. In the Integration perspective of Talend Studio, create an empty Standard Job
    from the Job Designs node in the
    Repository tree view.
  2. Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tDataprepRun and tMysqlOutput.
  3. Connect the three components using Row > Main links.

Creating contexts for your different environments

For your Job to be usable in differents situations, you will create two distincts
contexts.

Talend Studio allows you to create a set of
variables that you can regroup in a context. In this example, you will create a context
called Development and another one called
Production. Each of these contexts will contain the variables to
be used for the tDataprepRun configuration, according to the
target environment.

  1. Click the Contexts tab of your
    Job.

    Use_Case_tDataprepRun_dynamic_env_4.png

  2. Click the + button on the top right of the variables
    table to create a new context.
  3. Select the Default context and click
    Edit… to rename it
    Development.
  4. Click New… to create a new context called
    Production.
  5. Click the + button on the bottom left of the variables
    table to create a new variable called
    parameter_URL.

    The parameter_ prefix is mandatory, in order for the
    variable to be retrievable in Talend Integration Cloud, if you use the Cloud
    version of Talend Data Preparation.

  6. On the newly created URL line, enter the URLs of your
    development and production instances of Talend Data Preparation in the corresponding
    columns.
  7. Repeat the last two steps in order to create the
    parameter_USERNAME and
    parameter_PASSWORD variables, that will store your
    Talend Data Preparation credentials depending
    on the environment.

These two different contexts will be available at execution time, when deploying your
Job.

Configuring the components

Retrieving the data from a CSV file

The input data retrieved by the component must have same model as the dataset on
wich the preparation to operationalize was created.

  1. In the design workspace, select the
    tFileInputDelimited component, and click the
    Component tab to define its basic settings.

    Use_Case_tDataprepRun_dynamic_env_5.png

    Fields marked with a * are mandatory.

  2. In the Property Type list, select
    Built-in.
  3. In the File name/Stream field, enter the path to the
    file containing the input data on which you want to apply the preparation.
  4. Define the characters to be used as Row Separator and
    Field Separator.
  5. Cick the Edit schema button to define the columns of the
    source dataset and their data type.

    Make sure that the schema of the tFileInputDelimited
    component matches the schema expected by the
    tDataprepRun component. In other words, the input
    schema must be the same as the dataset upon which the preparation was made
    in the first place.

Dynamically selecting the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun
    and click the Component tab to define its
    basic settings.

    Use_Case_tDataprepRun_dynamic_env_6.png

  2. In the URL field, enter
    context.parameter_URL to reuse one of the values
    previously set for the URL context variable.
  3. In the Username and
    Password fields, respectively enter
    context.parameter_USERNAME and
    context.parameter_PASSWORD to reuse one of the values
    previously set for the USERNAME and PASSWORD
    context variables.
  4. Select the Dynamic preparation selection checkbox to
    define a preparation with its path in Talend Data Preparation rather than its technical
    id.
  5. In the Preparation path field, enter the path to the
    customers_leads preparation you want to apply on the
    .csv file.

    Your preparation must have the same path on your Talend Data Preparation development environment
    and production environment.

  6. In the Preparation version field, type
    current between double quotes, in order to use the
    current version of the preparation.
  7. Click Fetch Schema to retrieve the schema of the
    preparation.

Outputting the result of the preparation into a database

The result of the preparation will be exported to a MySQL database.

  1. In the design workspace, select tDataprepRun and click the
    Component tab to define its basic
    settings.

    Use_Case_tDataprepRun_dynamic_env_7.png

  2. In the Property Type list, select Built-in to set the
    database connection details manually.
  3. In the DB Version list, select the version of MySQL you are
    using, MySQL 5 in this example.
  4. In the Host, Port,
    Database, Username and
    Password fields, enter the MySQL connection details and the
    user authentication data for the database, between double quotes.
  5. In the Table Name field, type the name of the table to be
    write in, between double quotes.
  6. From the Action on table drop-down list, select
    Create table if does not exist.
  7. From the Action on data drop-down list, select
    Insert.
  8. Click Sync columns to retrieve the new output schema,
    inherited from the tDataprepRun component.

Deploying the Job to production

Now that the Job has been designed, you can deploy it to production and use it to
operationalize your preparation on real data.

  1. Save your Job.
  2. To promote your Job to production, export it to Talend Administration Center.
  3. Create an execution task with the production context variables and configure
    your preferred schedule.

    For more information on how to create an execution
    task with the appropriate context variables, see the Working with Job
    execution tasks section of the Talend Administration Center User Guide on https://help.talend.com.

tDataprepRun properties for Apache Spark Batch

These properties are used to configure tDataprepRun running in
the Spark Batch Job framework.

The Spark Batch
tDataprepRun component belongs to the
Talend Data Preparation family.

The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.

Basic settings

URL

Type the URL to the Talend Data Preparation web
application, between double quotes.

Email

Type the email address that you use to log in the Talend Data Preparation web application, between double quotes.

Note: If you are using Talend Data Preparation
Cloud, you must use your Talend Integration Cloud login instead.

Password

Click the […] button and type your user password
for the Talend Data Preparation web application, between
double quotes.

When using the default preparation selection properties:

Preparation

To complete the Preparation field, click Choose an existing preparation to select from a list of the
preparations that were previously created in Talend Data Preparation.

components_edit_preparation_button.png

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the
Preparation field.

Version

If you have created several versions of your preparation, you
can choose which one you want to use in the Job. To complete the
Version field, click Choose a Version to
select from the list of existing versions, including the current version of the
preparation.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the
Preparation field.

When using the Dynamic preparation selection:

Dynamic preparation selection

Select this checkbox to define a preparation path and version using context
variables. The preparation will be dynamically selected at runtime.

Preparation path

Use a context variable to define a preparation path. Paths with or without the
initial / are supported.

Preparation version

Use a context variable to define the version of the preparation to use.
Preparation versions are referenced by their number. As a consequence, to execute the
version #2 of a preparation for example, the expected value is 2.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to dynamically retrieve the schema from the preparations
defined by the context variable in the Preparation path field. If
the fetch is successful, any previously configured schema will be overwritten. If the
fetch fails, the current schema is kept.

Advanced settings

Encoding

Select an encoding mode from this list. You can select Custom from the list to enter an encoding method in the field that
appears.

Global Variables

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input flow as well as an
output.

Limitations

  • If the dataset is updated after the tDataprepRun component has been configured, the
    schema needs to be fetched again.

  • If a context variable was used in the URL of the
    dataset, you cannot use the components_edit_preparation_button.png button to edit the preparation
    directly in Talend Data Preparation.

  • The Make as header and Delete
    row
    functions, as well as any modification of a
    single cell, are ignored by the tDatarepRun
    component. These functions only affect a single row or cell and
    are thus not compatible with a Big Data context. In the list of
    existing preparations to choose from, a warning is displayed
    next to preparations that include incompatible actions.

Using the tDataprepRun component to apply a preparation
to a data sample in an Apache Spark Batch Job

This scenario applies only to a subscription-based Talend solution with Big data.

The tDataprepRun component allows you to reuse an existing
preparation made in Talend Data Preparation,
directly in a Big Data Job. In other words, you can operationalize the process of applying a
preparation to input data with the same model.

The following scenario creates a simple Job that :

  • Reads a small sample of customer data,
  • applies an existing preparation on this data,
  • shows the result of the execution in the console.
Use_Case_tDataprepRun_spark_batch_1.png

This assumes that a preparation has been created beforehand, on a dataset with
the same schema as your input data for the Job. In this case, the existing preparation is
called datapreprun_spark. This simple preparation puts the customer last
names into upper case and applies a filter to isolate the customers from California, Texas and
Florida.

Use_Case_tDataprepRun_spark_batch_2.png
The sample data reads as
follows:
Note: The sample data is created for demonstration purposes only.

Prerequisite: ensure that the Spark
cluster has been properly installed and is running.

Adding and linking the components

  1. In the Integration perspective of the Studio, create an empty
    Spark Batch Job from the Job Designs node
    in the Repository tree view.

    For further information about how to create a Spark Batch Job, see
    Talend Big Data Getting Started Guide.

  2. Drop the following components from the Palette onto the design workspace:
    tHDFSConfiguration, tFixedFlowInput, tDataprepRun and tLogRow.
  3. Connect the last three components using Row > Main links.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.
  1. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.

    You can launch
    your Job without any further configuration.

  3. Clear the Use local mode check box to display the
    list of the available Hadoop distributions and from this list, select
    the distribution corresponding to your Spark cluster to be used.

    If you cannot find the distribution corresponding to yours from this
    drop-down list, this means the distribution you want to connect to is not officially
    supported by
    Talend
    . In this situation, you can select Custom, then select the Spark
    version
    of the cluster to be connected and click the
    [+] button to display the dialog box in which you can
    alternatively:

    1. Select Import from existing
      version
      to import an officially supported distribution as base
      and then add other required jar files which the base distribution does not
      provide.

    2. Select Import from zip to
      import the configuration zip for the custom distribution to be used. This zip
      file should contain the libraries of the different Hadoop/Spark elements and the
      index file of these libraries.

      In
      Talend

      Exchange, members of
      Talend
      community have shared some ready-for-use configuration zip files
      which you can download from this Hadoop configuration
      list and directly use them in your connection accordingly. However, because of
      the ongoing evolution of the different Hadoop-related projects, you might not be
      able to find the configuration zip corresponding to your distribution from this
      list; then it is recommended to use the Import from
      existing version
      option to take an existing distribution as base
      to add the jars required by your distribution.

      Note that custom versions are not officially supported by

      Talend
      .
      Talend
      and its community provide you with the opportunity to connect to
      custom versions from the Studio but cannot guarantee that the configuration of
      whichever version you choose will be easy. As such, you should only attempt to
      set up such a connection if you have sufficient Hadoop and Spark experience to
      handle any issues on your own.

    For a step-by-step example about how to connect to a custom
    distribution and share this connection, see Connecting to a custom Hadoop distribution.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its
    Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar
    files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution
    you need to connect to and its version.
  3. In the NameNode URI field, enter the location of the
    machine hosting the NameNode service of the cluster. If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; if this WebHDFS is secured
    with SSL, the scheme should be swebhdfs and you need to use
    a tLibraryLoad in the Job to load the library required by
    the secured WebHDFS.
  4. In the Username field, enter the
    authentication information used to connect to the HDFS system to be used. Note
    that the user name must be the same as you have put in the Spark configuration tab.

Configuring the input data and the preparation

Loading the sample data

  1. In the design workspace, select the
    tFixedFlowInput component and click the Component tab to define its basic settings.

    Use_Case_tDataprepRun_spark_batch_3.png

  2. Click the […] button next to Edit schema to open the schema editor.
  3. Click the [+] button to add the schema
    columns as shown in this image.

    Use_Case_tDataprepRun_spark_batch_4.png

    This schema is the same as the dataset originally used to create the
    datapreprun_spark preparation in
    Talend Data Preparation.

  4. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  5. In the Mode area, select the Use Inline Content radio button and paste the
    above-mentioned sample data about customers into the Content field that is displayed.
  6. In the Field separator
    field, enter a semicolon (;).

Accessing the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun
    and click the Component tab to define its
    basic settings.

    Use_Case_tDataprepRun_spark_batch_5.png

  2. In the URL field, type
    the URL of the Talend Data Preparation web application, between double quotes. Port 9999 is the default port for Talend Data Preparation.
  3. In the Username and
    Password fields, enter your Talend Data Preparation connection
    information, between double quotes.

    Note: If you are using Talend Data Preparation Cloud, you must use your Talend Integration Cloud login instead of your
    Talend Data Preparation email.

  4. Click Choose an existing
    preparation
    to display a list of the prepations available in
    Talend Data Preparation, and select
    datapreprun_spark.

    Use_Case_tDataprepRun_spark_batch_6.png

    This scenario assumes that a preparation with a compatible schema has been
    created beforehand.

  5. Click Fetch Schema to retrieve the schema of
    the preparation.

    The output schema of the tDataprepRun component now
    reflects the changes made with each preparation step. The schema takes into
    account columns that were added or removed for example.

Executing the Job

The tLogRow component is used to present the
execution result of the Job.

  1. In the design workspace, select the tLogRow
    component and click the Component tab to define
    its basic settings.
  2. Select the Table radio button to present the
    result in a table.
  3. Save your Job and press F6 to execute it.
  4. You can now check the execution result in the console of the Run view.

    Use_Case_tDataprepRun_spark_batch_7.png

The preparation made in Talend Data Preparation has been applied to the
sample data and only the customer from California, Florida and Texas remain.

For the sake of this example, we used a small data sample, but the Spark
Batch version of the tDataprepRun component can be used with high
volume of data.

tDataprepRun properties for Apache Spark Streaming

These properties are used to configure tDataprepRun running in
the Spark Streaming Job framework.

The Spark Streaming
tDataprepRun component belongs to
the Talend Data Preparation family.

The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.

Basic settings

URL

Type the URL to the Talend Data Preparation web
application, between double quotes.

Email

Type the email address that you use to log in the Talend Data Preparation web application, between double quotes.

Note: If you are using Talend Data Preparation
Cloud, you must use your Talend Integration Cloud login instead.

Password

Click the […] button and type your user password
for the Talend Data Preparation web application, between
double quotes.

When using the default preparation selection properties:

Preparation

To complete the Preparation field, click Choose an existing preparation to select from a list of the
preparations that were previously created in Talend Data Preparation.

components_edit_preparation_button.png

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the
Preparation field.

Version

If you have created several versions of your preparation, you
can choose which one you want to use in the Job. To complete the
Version field, click Choose a Version to
select from the list of existing versions, including the current version of the
preparation.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the
Preparation field.

When using the Dynamic preparation selection:

Dynamic preparation selection

Select this checkbox to define a preparation path and version using context
variables. The preparation will be dynamically selected at runtime.

Preparation path

Use a context variable to define a preparation path. Paths with or without the
initial / are supported.

Preparation version

Use a context variable to define the version of the preparation to use.
Preparation versions are referenced by their number. As a consequence, to execute the
version #2 of a preparation for example, the expected value is 2.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Fetch Schema

Click this button to dynamically retrieve the schema from the preparations
defined by the context variable in the Preparation path field. If
the fetch is successful, any previously configured schema will be overwritten. If the
fetch fails, the current schema is kept.

Advanced settings

Encoding

Select an encoding mode from this list. You can select Custom from the list to enter an encoding method in the field that
appears.

Global Variables

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input flow as well as an
output.

Limitations

  • If the dataset is updated after the
    tDataprepRun component has been configured,
    the schema needs to be fetched again.

  • If a context variable was used in the URL of the
    dataset, you cannot use the components_edit_preparation_button.png button to edit the preparation
    directly in Talend Data Preparation.

  • The Make as header and Delete
    row
    functions, as well as any modification of a
    single cell, are ignored by the tDatarepRun
    component. These functions only affect a single row or cell and
    are thus not compatible with a Big Data context. In the list of
    existing preparations to choose from, a warning is displayed
    next to preparations that include incompatible actions.

Using the tDataprepRun component to apply a preparation to a data
sample in an Apache Spark Streaming Job

This scenario applies only to Talend Real-time Big Data Platform or Talend Data Fabric.

The tDataprepRun component allows you to reuse an existing
preparation made in Talend Data Preparation,
directly in a Big Data Job. In other words, you can operationalize the process of applying a
preparation to input data with the same model.

The following scenario creates a simple Job that :

  • Reads a small sample of customer data,
  • applies an existing preparation on this data,
  • shows the result of the execution in the console.
Use_Case_tDataprepRun_spark_batch_1.png

This assumes that a preparation has been created beforehand, on a dataset with the same
schema as your input data for the Job. In this case, the existing preparation is called
datapreprun_spark. This simple preparation puts the customer last names
into upper case and applies a filter to isolate the customers from California, Texas and
Florida.

Use_Case_tDataprepRun_spark_batch_2.png
The sample data reads as
follows:
Note: The sample data is created for demonstration purposes only.

Prerequisite: ensure that the Spark
cluster has been properly installed and is running.

Adding and linking the components

  1. In the Integration perspective of the Studio, create an empty
    Spark Streaming Job from the Job Designs node in the
    Repository tree view.

    For further information about how to create a Spark Streaming Job, see
    Talend Real-Time Big Data Getting Started Guide.

  2. Drop the following components from the Palette onto the
    design workspace: tHDFSConfiguration,
    tFixedFlowInput, tDataprepRun and tLogRow.
  3. Connect the last three components using Row > Main links.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.
  1. Click Run to open its view and then click the
    Spark Configuration tab to display its view
    for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to
    run the Job in. Each processor of the local machine is used as a Spark
    worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the
    configuration components such as tS3Configuration or
    tHDFSConfiguration that provides connection
    information to a remote file system, if you have placed these components
    in your Job.

    You can launch
    your Job without any further configuration.

  3. Clear the Use local mode check box to display the
    list of the available Hadoop distributions and from this list, select
    the distribution corresponding to your Spark cluster to be used.

    If you cannot find the distribution corresponding to yours from this
    drop-down list, this means the distribution you want to connect to is not officially
    supported by
    Talend
    . In this situation, you can select Custom, then select the Spark
    version
    of the cluster to be connected and click the
    [+] button to display the dialog box in which you can
    alternatively:

    1. Select Import from existing
      version
      to import an officially supported distribution as base
      and then add other required jar files which the base distribution does not
      provide.

    2. Select Import from zip to
      import the configuration zip for the custom distribution to be used. This zip
      file should contain the libraries of the different Hadoop/Spark elements and the
      index file of these libraries.

      In
      Talend

      Exchange, members of
      Talend
      community have shared some ready-for-use configuration zip files
      which you can download from this Hadoop configuration
      list and directly use them in your connection accordingly. However, because of
      the ongoing evolution of the different Hadoop-related projects, you might not be
      able to find the configuration zip corresponding to your distribution from this
      list; then it is recommended to use the Import from
      existing version
      option to take an existing distribution as base
      to add the jars required by your distribution.

      Note that custom versions are not officially supported by

      Talend
      .
      Talend
      and its community provide you with the opportunity to connect to
      custom versions from the Studio but cannot guarantee that the configuration of
      whichever version you choose will be easy. As such, you should only attempt to
      set up such a connection if you have sufficient Hadoop and Spark experience to
      handle any issues on your own.

    For a step-by-step example about how to connect to a custom
    distribution and share this connection, see Connecting to a custom Hadoop distribution.

Configuring a Spark stream for your Apache Spark streaming Job

Define how often your Spark Job creates and processes micro batches.
  1. In the Batch size field, enter the time
    interval at the end of which the Job reviews the source data to identify changes and
    processes the new micro batches.
  2. If needs be, select the Define a streaming
    timeout
    check box and in the field that is displayed, enter the time frame
    at the end of which the streaming Job automatically stops running.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its
    Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar
    files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution
    you need to connect to and its version.
  3. In the NameNode URI field, enter the location of the
    machine hosting the NameNode service of the cluster. If you are using WebHDFS, the location should be
    webhdfs://masternode:portnumber; if this WebHDFS is secured
    with SSL, the scheme should be swebhdfs and you need to use
    a tLibraryLoad in the Job to load the library required by
    the secured WebHDFS.
  4. In the Username field, enter the
    authentication information used to connect to the HDFS system to be used. Note
    that the user name must be the same as you have put in the Spark configuration tab.

Configuring the input data and the preparation

Loading the sample data

  1. In the design workspace, select the tFixedFlowInput
    component and click the Component tab to define its basic
    settings.

    Use_Case_tDataprepRun_spark_stream_3.png

  2. Click the […] button next to Edit schema to open the schema editor.
  3. Click the [+] button to add the schema
    columns as shown in this image.

    Use_Case_tDataprepRun_spark_batch_4.png

    This schema is the same as the dataset originally used to create the
    datapreprun_spark preparation in
    Talend Data Preparation.

  4. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  5. In the Streaming area, enter your preferred value for
    the Input repetition interval (ms) field. In this
    example, the default value, 5000 is used.
  6. In the Mode area, select the Use Inline Content radio button and paste the
    above-mentioned sample data about customers into the Content field that is displayed.
  7. In the Field separator
    field, enter a semicolon (;).

Accessing the preparation from Talend Data Preparation

  1. In the design workspace, select tDataprepRun and click
    the Component tab to define its basic settings.

    Use_Case_tDataprepRun_spark_batch_5.png

  2. In the URL field, type
    the URL of the Talend Data Preparation web application, between double quotes. Port 9999 is the default port for Talend Data Preparation.
  3. In the Username and
    Password fields, enter your Talend Data Preparation connection
    information, between double quotes.

    Note: If you are using Talend Data Preparation Cloud, you must use your Talend Integration Cloud login instead of your
    Talend Data Preparation email.

  4. Click Choose an existing
    preparation
    to display a list of the prepations available in
    Talend Data Preparation, and select
    datapreprun_spark.

    Use_Case_tDataprepRun_spark_batch_6.png

    This scenario assumes that a preparation with a compatible schema has been
    created beforehand.

  5. Click Fetch Schema to retrieve the schema of
    the preparation.

    The output schema of the tDataprepRun component now
    reflects the changes made with each preparation step. The schema takes into
    account columns that were added or removed for example.

Executing the Job

The tLogRow component is used to present the
execution result of the Job.

  1. In the design workspace, select the tLogRow
    component and click the Component tab to define
    its basic settings.
  2. Select the Table radio button to present the
    result in a table.
  3. Save your Job and press F6 to execute it.
  4. You can now check the execution result in the console of the Run view.

    Use_Case_tDataprepRun_spark_stream_7.png

The preparation made in Talend Data Preparation has been applied to the
sample data and only the customer from California, Florida and Texas remain.

For the sake of this example, we used a small data sample, but the Spark
Batch version of the tDataprepRun component can be used with high
volume of data.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x