tDataprepRun

Applies a preparation made using Talend Data Preparation in a standard Data Integration
Job.

tDataprepRun fetches a preparation made
using Talend Data Preparation and applies it to
a set of data.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

Standard: see tDataprepRun Standard properties.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.
Spark Batch: see tDataprepRun properties for Apache Spark Batch.

The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.
Spark Streaming: see tDataprepRun properties for Apache Spark Streaming.

The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.

tDataprepRun Standard properties

These properties are used to configure tDataprepRun running in
the Standard Job framework.

The Standard
tDataprepRun component belongs to the
Talend Data Preparation family.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.

Basic settings

URL	Type the URL to the Talend Data Preparation web application, between double quotes.
Username	Type the email address that you use to log in the Talend Data Preparation web application, between double quotes. Note: If you are using Talend Data Preparation Cloud, you must use your Talend Integration Cloud login instead.
Password	Click the […] button and type your user password for the Talend Data Preparation web application, between double quotes.

When using the default preparation selection properties:

Preparation	To complete the Preparation field, click one of the following: Choose an existing preparation to select from a list of the preparations that were previously created in Talend Data Preparation. Or create a new one to create a new preparation based on your input data.
	Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the Preparation field.
Version	If you have created several versions of your preparation, you can choose which one you want to use in the Job. To complete the Version field, click Choose a Version to select from the list of existing versions, including the current version of the preparation.
Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window. Click Sync columns to retrieve the schema from the previous component connected in the Job.
Fetch Schema	Click this button to retrieve the schema from the preparation defined in the Preparation field.

When using the Dynamic preparation selection:

Dynamic preparation selection	Select this checkbox to define a preparation path and version using context variables. The preparation will be dynamically selected at runtime.
Preparation path	Use a context variable to define a preparation path. Paths with or without the initial `/` are supported.
Preparation version	Use a context variable to define the version of the preparation to use. Preparation versions are referenced by their number. As a consequence, to execute the version #2 of a preparation for example, the expected value is `2`.
Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window. Click Sync columns to retrieve the schema from the previous component connected in the Job.
Fetch Schema	Click this button to dynamically retrieve the schema from the preparations defined by the context variable in the Preparation path field. If the fetch is successful, any previously configured schema will be overwritten. If the fetch fails, the current schema is kept.

Advanced settings

Limit Preview	Specify the number of rows to which you want to limit the preview.
tStatCatcher Statistics	Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component is an intermediary step. It requires an input flow as well as an output.
Limitations	If the dataset is updated after the tDataprepRun component has been configured, the schema needs to be fetched again. If a context variable was used in the URL of the dataset, you cannot use the button to edit the preparation directly in Talend Data Preparation.

Using the tDataprepRun component to prepare data from a
database in a Talend Job

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing
preparation made in Talend Data Preparation,
directly in a data integration Job. In other words, you can operationalize the process of
applying a preparation to input data with the same model.

The following scenario creates a simple Job
that :

retrieves a table from a MySQL database, that holds some employee-related
data,
applies an existing preparation on this data,
outputs the prepared data into an Excel file.

This assumes that a preparation has been created beforehand, on a dataset
with the same schema as your input data for the Job. In this case, the existing preparation is
called datapreprun_scenario. This simple preparation puts the employees
last names into upper case and isolate the employees with a salary greater than 1500$.

Adding and linking the components

In the Integration perspective of the Studio, create an empty
Standard Job from the Job Designs node in
the Repository tree view.
Drop the following components from the Palette onto the design workspace: tMysqlInput, tDataprepRun and
tFileOutputExcel.
Connect the three components using Row > Main links.

Configuring the components

Retrieving the data from the database

In the design workspace, select tMysqlInput
and click the Component tab to define its
basic settings.
In the Property Type list, select Built-in to set the database connection details
manually.
In the DB Version list, select the version of
MySQL you are using, MySQL 5 in this
example.
In the Host, Port, Database, Username and Password
fields, enter the MySQL connection details and the user authentication data for
the database, between double quotes.
In the Table Name field, type the name of the
table to be read, between double quotes.
In the Query field,
enter your database query between double quotes. In this example, the query is
select * from employees to retrieve all of the
information from the table employees, in the
test database.
Click Guess schema to automatically retrieve
the schema from the database table or enter the schema manually by clicking the
[…] button next to Edit
schema.

Make sure that the schema of the tMysqlInput component matches the schema expected by the
tDataprepRun component. In other
words, the input schema must be the same as the dataset upon which the
preparation was made in the first place.

Accessing the preparation from Talend Data Preparation

In the design workspace, select tDataprepRun
and click the Component tab to define its
basic settings.
In the URL field, type
the URL of the Talend Data Preparation web application, between double quotes. Port 9999 is the default port for Talend Data Preparation.
In the Username and
Password fields, enter your Talend Data Preparation connection
information, between double quotes.

Note: If you are using Talend Data Preparation Cloud, you need to use your Talend Integration Cloud login instead of your
Talend Data Preparation email.
Click Choose an existing
preparation to display a list of the prepations available in
Talend Data Preparation, and select
datapreprun_scenario.

This scenario assumes that a preparation with a compatible schema has been
created beforehand.
Click Fetch Schema to retrieve the schema of
the preparation.

The output schema of the tDataprepRun component now
reflects the changes made with each preparation step. The schema takes into
account columns that were added or removed for example.

Outputting the preparation into an Excel file

In the design workspace, select tFileOutputExcel and click the Component tab to define its basic settings.
In the File Name field, enter the location
where you want to save the result of the preparation.
Click Sync columns to
retrieve the new output schema, inherited from the tDataprepRun
component.

Saving and executing the Job

Save your Job and press F6 to execute it.
You can now open the Excel file containing the result of the preparation
applied on your data from the MySQL database.

Using the tDataprepRun component to dynamically select a
preparation at runtime according to the input

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing preparation
made in Talend Data Preparation, directly in a data
integration, Spark Batch or Spark Streaming Job. In other words, you can operationalize the
process of applying a preparation to input data with the same model.

By default, the tDataprepRun component retrieves
preparations using their technical id. However, the dynamic preparation selection feature
allows you to call a preparation via its path in Talend Data Preparation. Through the use of the Dynamic
preparation selection check box and some variables, it is then possible to
dynamically select a preparation at runtime, according to runtime data or metadata.

In case you wanted to operationalize preparations in a Talend Job using the regular preparation selection
properties, you would actually need several Jobs: one for each preparation to apply on a
specific dataset. By retrieving the correct preparation according to the input file name, you
will be able to dynamically run more than one preparation on your source data, in a single
Job.

The following scenario creates a Job that:

Scans the content of a folder containing several datasets
Creates a dynamic path to your CSV files
Dynamically retrieves the preparations according to the input file name and
applies them on the data
Outputs the prepared data into a Redshift database

In this example, .csv datasets with data from two of your
clients are locally stored in a folder called customers_files. Each of
your clients datasets have their specific naming convention and are stored in dedicated sub
folders. All the datasets in the customers_files folder have identical
schemas, or data model.

A customers folder has also been created in Talend Data Preparation, containing two preparations. These two
distinct preparation are each aimed at cleaning data from your two different customers.

The purpose of customer_1_preparation for example is to isolate a
certain type of email addresses, while customer_2_preparation aims at
cleansing invalid values and formatting the data. In this example, the preparations names are
based on the two sub folders names customer_1 and
customer_2, with _preparation as suffix.

Just like the input schema that all four dataset have in common, all of your output data must
also share the same model. For this reason, you cannot have one preparation that modifies the
schema by adding a column for example, while the other does not.

By following this scenario, a single Job will allow you to use the appropriate preparation,
depending on whether the dataset extracted from the local customers_files
folder belongs to customer 1 or customer 2.

Designing the Job

In the Integration perspective of the Studio, create an
empty Standard, Spark Batch or Spark Streaming Job from the Job
Designs node in the Repository tree
view.
Drop the following components from the palette onto the
design workspace: two tFileList, a
tFileInputDelimited, a tDataprepRun and a
tRedshiftOutput.
Connect the two tFileList and the
tFileInputDelimited components using Row > Iterate links.
Connect the tFileInputDelimited,
tDataprepRun and tRedshiftOutput
components using Row > Main links.

Configuring the components

Reading the input files from your local folder

In the design workspace, select
tFileList_1 and click the
Component tab to define its basic settings.

This first tFileList will read the
customers_files folder, and retrieve the path of the
two sub folders so that they can be reused later.
In the Directory field, enter the path to the
customers_files folder, containing the customers datasets,
in their respective sub folders.
Click the + button in the
Filemask table to add a new line and rename it
*, between double quotes.
In the design workspace, select tFileList_2 and click
the Component tab to define its basic settings.

This second tFileList will read the four
.csv datasets contained in the two sub folders and
retrieve their file paths.
To fill the Directory field with the expression that
will dynamically retrieve the input files paths, drag it from the
tFileList_1 list of expressions in the
Outline panel.
Check the Includes subdirectories check box.
Click the + button in the
Filemask table to add a new line and rename it
*.csv, between double quotes.
In the design workspace, select the tFileInputDelimited and
click the Component tab to define its basic
settings.
To fill the File name/Stream field with the expression
that will dynamically retrieve the input files paths, drag it from the
tFileList_2 list of expressions in the
Outline panel.
Enter the Row Separator and Field
Separator that correspond to your datasets, between double
quotes.
Click the Edit schema button to define the columns of
the source datasets and their data type.

The schema is the same for all the datasets from the
customers_files folder. Make sure that this schema
matches the schema expected by the tDataprepRun
component. In other words, the input schema must be the same as the datasets
upon which the preparations were made in the first place.

Dynamically selecting the preparation from Talend Data Preparation

In the design workspace, select tDataprepRun and click the
Component tab to define its basic settings.
In the URL field, type the URL of the Talend Data Preparation web application, between
double quotes. Port 9999 is the default port for Talend Data Preparation.
In the Username and Password
fields, enter your Talend Data Preparation connection information, between
double quotes.

Note: If you are using Talend Data Preparation Cloud, you need to use your Talend Data Preparation Cloud login instead of
your Talend Data Preparation
email.
Select the Dynamic preparation selection check box to
dynamically define the preparations with their paths in Talend Data Preparation rather than their technical
ids.
In the Preparation path field, enter
"customers/"+((String)globalMap.get("tFileList_2_CURRENT_FILE"))+"_preparation".

This expression is made of three distinct parts. In the path,
customers is the folder in Talend Data Preparation where the preparations are
kept. As for the preparations names, because they are partly reused from the
local sub folders names, you will use this expression to retrieve those sub
folders name from the tFileList_1, and attach the
_preparation suffix.
In the Preparation version field, type
current between double quotes, in order to use the
current version of the preparation.
Click Sync columns to retrieve the schema of the
previous component.

Outputting the result of the preparation into a database

In the design workspace, select tRedshiftOutput and click the
Component tab to define its basic settings.
In the Property Type list, select
Built-in to set the database connection details
manually.
In the Host, Port,
Database, Schema,
Username and Password fields, enter the
Redshift connection details and the user authentication data for the database,
between double quotes
To fill the Table field, with the expression that will
dynamically reuse the input file name and give it to the table, drag it from the
tFileList_2 list of expressions in the
Outline panel.
From the Action on table drop-down list, select
Create table if does not exist.
From the Action on data drop-down list, select
Insert.
Click Sync columns to retrieve the new output schema,
inherited from the tDataprepRun component.

Saving and executing the Job

Save your Job.
Press F6 to execute it.

The four datasets have been dynamically prepared with a single Job, using the
preparation dedicated to each of your customers and the output of those four
preparations has been sent to a Redshift database, in four new tables.

Using the tDataprepRun component to promote a Job
leveraging a preparation across environments

This scenario applies only to a subscription-based Talend solution.

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration, Spark Batch
or Spark Streaming Job. In other words, you can operationalize the process of applying a
preparation to input data with the same model.

A good practice when using Talend Data Preparation is to set up at least two environments to
work with: a development one, and a production one for example. When a preparation is ready on
the development environment, you can use the Import/Export Preparation
feature to promote it to the production environment, that has a different URL. For more
information, see Promoting a preparation across environments.

Following this logic, you will likely find yourself with a preparation that has the same name
on different environments. The thing is that preparations are not actually identified by their
name, but rather by a technical id, such as
prepid=faf4fe3e-3cec-4550-ae0b-f1ce108f83d5. As a consequence, what you
really have is two dinstinct preparations, each with its specific id.

In case you wanted to operationalize this recipe in a Talend Job using the regular preparation selection
properties, you would actually need two Jobs: one for the preparation on the development
environment, with a specific url and id, and a second one for the production environment, with
different parameters.

Through the use the Dynamic preparation selection checkbox and some
context variables, you will be able to use a single Job to run your preparation, regardless of
the environment. Indeed, the dynamic preparation selection relies on the preparation path in
Talend Data Preparation, and not on the preparation id.

You will be able to use a single Job definition to later deploy on your development or
production environment

The following scenario creates a simple Job that:

Receives data from a local CSV file containing customers data
Dynamically retrieves an existing preparation based on its path and
environment
Applies the preparation on the input data
Outputs the prepared data into a MySQL database.

In this example, the customers_leads preparation has
been created beforehand in Talend Data Preparation. This
simple preparation was created on a dataset that has the same schema as the CSV file used as
input for this Job, and its purpose is to remove invalid values from your customers data.

Designing the Job

In the Integration perspective of Talend Studio, create an empty Standard Job
from the Job Designs node in the
Repository tree view.
Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tDataprepRun and tMysqlOutput.
Connect the three components using Row > Main links.

Creating contexts for your different environments

For your Job to be usable in differents situations, you will create two distincts
contexts.

Talend Studio allows you to create a set of
variables that you can regroup in a context. In this example, you will create a context
called Development and another one called
Production. Each of these contexts will contain the variables to
be used for the tDataprepRun configuration, according to the
target environment.

Click the Contexts tab of your
Job.
Click the + button on the top right of the variables
table to create a new context.
Select the Default context and click
Edit… to rename it
Development.
Click New… to create a new context called
Production.
Click the + button on the bottom left of the variables
table to create a new variable called
parameter_URL.

The parameter_ prefix is mandatory, in order for the
variable to be retrievable in Talend Integration Cloud, if you use the Cloud
version of Talend Data Preparation.
On the newly created URL line, enter the URLs of your
development and production instances of Talend Data Preparation in the corresponding
columns.
Repeat the last two steps in order to create the
parameter_USERNAME and
parameter_PASSWORD variables, that will store your
Talend Data Preparation credentials depending
on the environment.

These two different contexts will be available at execution time, when deploying your
Job.

Configuring the components

Retrieving the data from a CSV file

The input data retrieved by the component must have same model as the dataset on
wich the preparation to operationalize was created.

In the design workspace, select the
tFileInputDelimited component, and click the
Component tab to define its basic settings.

Fields marked with a * are mandatory.
In the Property Type list, select
Built-in.
In the File name/Stream field, enter the path to the
file containing the input data on which you want to apply the preparation.
Define the characters to be used as Row Separator and
Field Separator.
Cick the Edit schema button to define the columns of the
source dataset and their data type.

Make sure that the schema of the tFileInputDelimited
component matches the schema expected by the
tDataprepRun component. In other words, the input
schema must be the same as the dataset upon which the preparation was made
in the first place.

Dynamically selecting the preparation from Talend Data Preparation

In the design workspace, select tDataprepRun
and click the Component tab to define its
basic settings.
In the URL field, enter
context.parameter_URL to reuse one of the values
previously set for the URL context variable.
In the Username and
Password fields, respectively enter
context.parameter_USERNAME and
context.parameter_PASSWORD to reuse one of the values
previously set for the USERNAME and PASSWORD
context variables.
Select the Dynamic preparation selection checkbox to
define a preparation with its path in Talend Data Preparation rather than its technical
id.
In the Preparation path field, enter the path to the
customers_leads preparation you want to apply on the
.csv file.

Your preparation must have the same path on your Talend Data Preparation development environment
and production environment.
In the Preparation version field, type
current between double quotes, in order to use the
current version of the preparation.
Click Fetch Schema to retrieve the schema of the
preparation.

Outputting the result of the preparation into a database

The result of the preparation will be exported to a MySQL database.

In the design workspace, select tDataprepRun and click the
Component tab to define its basic
settings.
In the Property Type list, select Built-in to set the
database connection details manually.
In the DB Version list, select the version of MySQL you are
using, MySQL 5 in this example.
In the Host, Port,
Database, Username and
Password fields, enter the MySQL connection details and the
user authentication data for the database, between double quotes.
In the Table Name field, type the name of the table to be
write in, between double quotes.
From the Action on table drop-down list, select
Create table if does not exist.
From the Action on data drop-down list, select
Insert.
Click Sync columns to retrieve the new output schema,
inherited from the tDataprepRun component.

Deploying the Job to production

Now that the Job has been designed, you can deploy it to production and use it to
operationalize your preparation on real data.

Save your Job.
To promote your Job to production, export it to Talend Administration Center.
Create an execution task with the production context variables and configure
your preferred schedule.

For more information on how to create an execution
task with the appropriate context variables, see the Working with Job
execution tasks section of the Talend Administration Center User Guide on https://help.talend.com.