Using the tDataprepRun component to dynamically select a
preparation at runtime according to the input
This scenario applies only to a subscription-based Talend solution.
The tDataprepRun component allows you to reuse an existing preparation
made in Talend Data Preparation, directly in a data
integration, Spark Batch or Spark Streaming Job. In other words, you can operationalize the
process of applying a preparation to input data with the same model.
By default, the tDataprepRun component retrieves
preparations using their technical id. However, the dynamic preparation selection feature
allows you to call a preparation via its path in Talend Data Preparation. Through the use of the Dynamic
preparation selection check box and some variables, it is then possible to
dynamically select a preparation at runtime, according to runtime data or metadata.
In case you wanted to operationalize preparations in a Talend Job using the regular preparation selection
properties, you would actually need several Jobs: one for each preparation to apply on a
specific dataset. By retrieving the correct preparation according to the input file name, you
will be able to dynamically run more than one preparation on your source data, in a single
Job.
The following scenario creates a Job that:
- Scans the content of a folder containing several datasets
- Creates a dynamic path to your CSV files
- Dynamically retrieves the preparations according to the input file name and
applies them on the data - Outputs the prepared data into a Redshift database
In this example, .csv datasets with data from two of your
clients are locally stored in a folder called customers_files. Each of
your clients datasets have their specific naming convention and are stored in dedicated sub
folders. All the datasets in the customers_files folder have identical
schemas, or data model.
A customers folder has also been created in Talend Data Preparation, containing two preparations. These two
distinct preparation are each aimed at cleaning data from your two different customers.
The purpose of customer_1_preparation for example is to isolate a
certain type of email addresses, while customer_2_preparation aims at
cleansing invalid values and formatting the data. In this example, the preparations names are
based on the two sub folders names customer_1 and
customer_2, with _preparation as suffix.
Just like the input schema that all four dataset have in common, all of your output data must
also share the same model. For this reason, you cannot have one preparation that modifies the
schema by adding a column for example, while the other does not.
By following this scenario, a single Job will allow you to use the appropriate preparation,
depending on whether the dataset extracted from the local customers_files
folder belongs to customer 1 or customer 2.