tFSFilterColumns

Warning

This component will be available in the Palette of the studio on the condition that you have
subscribed to the relevant edition of one of the Talend solutions
with Big Data.

tFSFilterColumns Properties

Component family	FileScale	Note that this component is deprecated.
Function	tFSFilterColumns makes specified changes to the defined table based on column name mapping. This component has real-time capabilities for filtering large scale files. To optimize performance, the component usually sorts data before processing it.
Purpose	Helps homogenizing schemas in a file either by removing unwanted columns or adding new ones.
Basic settings	Schema type and Edit Schema	A schema is a row description. It defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.
		Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and job flowcharts. Related topic: see Talend Studio User Guide.
		Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Property type	Either Built-in or Repository. Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.
		Built-in: No property data stored centrally.
		Repository: Select the repository file where Properties are stored. The fields that follow are pre-filled using the fetched data.
	Input File Name	Name of the file holding the data you want to filter.
	Output File Name	Name of the file where you want to write the filtered data.
	Record separator (char)	Character, string or regular expression to separate records (lines).
	Field separator (char)	Character, string or regular expression to separate fields in a record.
	Header	Number of records to be skipped in the beginning of the file.
	Footer	Number of records to be skipped at the end of the file.
	Column configuration	Column: List of the columns to be filtered. Keep it: Select the check box next to the column name you want to maintain in the file. If you clear the check box, the column will be deleted.
Advanced settings	Generate FSLang File	Select this check box to generate the FSLang file corresponding to your Job and click the three-dot button next to the FSLang File Name field to specify its path and its name.
	Assign FileScale Path	Select this check box and then click the three-dot button next to the FileScale Path field to select the FileScale program executable file required to execute the component.
	Specify Number of Process Child	Select this check box and enter the number of child processes to use for carrying out the aggregation.
	Sort results	Select this check box to sort the results.
	Custom FileScale Parameter (separated by,)	Enter the parameters for any specific operation you want to add to the FileScale executable call.
	tStatCatcher Statistics	Select this check box to gather the job processing metadata at a job level as well as at each component level.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component handles files therefore it does not require input and output data flows. It is used to filter columns in large scale files.
Limitation	Limitation is imposed by limits of physical memory and CPU architectures. For example, total length of processed files cannot exceed file system limit for LargeFile support (maximum value of 64 signed bits).

Scenario: Combining filtering and sorting processes in a large scale file

Warning

Make sure that you have unzipped and saved locally the FileScale
executable file delivered by

Talend

. You must define the path of this executable file in the
Advanced settings
view of each of the FileScale components.

This scenario describes a Job that uses several FileScale components which are
connected with the FS Combine link. Using this link,
the different FileScale components will process data simultaneously and thus optimize
performance. This scenario uses the following FileScale components:

tFSFilterColumns: to filter columns in a table based on column name mapping,
tFSFilterRows: to define a row filter on the table to keep only the rows that have
a value greater than or equal to 5000.0,
tFSUnique: to recuperate only one occurrence
of the name record, in case duplicates are present,
tFSSort: to sort data in alphabetical order
based on the name column.

To design such a scenario, complete the following:

Drop the following components from the Palette to the design workspace: tRowGenerator, tFileOutputDelimited, tFSFilterColumns, tFSFilterRows, tFSUnique and
tFSSort.

Connect tRowGenerator first to tFileOutputDelimited using a Row > Main link and then to tFSFilterColumns using an OnSubjobOk link.
Connect all the FileScale components together using the Row > FS Combine link.

Note

As the FS Combine link joins all data processes
designed in the Job and executes them simultaneously, the sequence of the FileScale
components is of no importance. You can order them the way you like.

Warning

In order for such scenario to execute correctly, you must define the same
input and output files in all the FileScale components.

In this scenario, the tRowGenerator component will
generate data according to the schema you define in the component Basic settings view and send it to the input file. If data generation is
errorless, data in the input file will be processed simultaneously by the four FileScale
components and thus optimize performance.

Click tRowGenerator to display its Basic settings view and define the component
properties.
Click the three-dot button next to RowGenerator
Editor to open the component editor where you can define your
schema.

In the upper half of the editor, click the plus button to add the columns you
want to write in the input file.
Define the schema and set the parameters of the columns.

In this scenario, the input file contains nine columns:
id, name,
dateOfBirth, amount and five other
columns that hold extra information.
If required, click the Preview tab in the
lower half of the editor to display the corresponding view and then click the
View button to display a sample of the
generated data.
Click OK to validate your schema and close
the tRowGenerator editor.
Click tFileOutputDelimited to display its
Basic settings view and define the
component properties.

Set the tFileOutputDelimited
properties.

For this scenario, we have defined two context variables:
filename for the input file name and
folder for the input file path. For more information about
context variables, check Talend Studio User
Guide.

Put your pointer in the File Name field and
then press Ctrl + Space and select the context
variable for the file path (context.folder).
Enter +"/"+ to separate the file path from the file name and then
press Ctrl+Space to display the context
variable list.
Select the context variable for the file name
(context.filename).
If required, click the Edit schema button to
view the schema you defined in the editor or modify it.
Click tFSFilterColumns to open its Basic settings view and define the component
properties.

Set schema and property type to Built-In.
Click the Edit schema button to display a
dialog box. Here you can define your column schema. This schema must corresponds
to the input file schema.
Click OK to close the schema dialog
box.

The defined column schema displays in the Column
configuration table.
Set the input file path and name using the variable contexts you define
earlier by pressing Ctrl + Space and selecting
the variables from the list. Separate the two variables by
+"/"+.
Click the three-dot button next to the Output File
Name and browse to the output file where you want to write the
processed data.

In this example, we have defined a context variable for the output file the
same way we did for the input file.
Define the record and field separators and then the header and footer of the
file, if any.
In the Column configuration table, select the
check boxes of the columns you want to write in the output file, id,
name, dateOfBirth and amount in this
scenario.
Click the Advanced settings tab to display
the advanced settings view.

Select the Generate FSLang File check box to
display the FSLang File Name field and then
browse to where you want to generate the FSLang file corresponding to your Job.

In this example, we have defined a context variable for the FSLang output
file as we did for the input and output files.
Select the Assign FileScale Path check box to
display the FileScale Path field and then
browse to the FileScale executable file delivered by Talend.
Click tFSFilterRows to display the Basic settings view and define the component
properties.

Click the Edit schema button to display the
schema dialog box. Here you can define your column schema. This schema must
corresponds to the input file schema.

From the panel to the right, delete the output column that do not map to the
column filter you defined in tFSFilterColumns,
all extraInfo columns in this scenario.
Click OK to close the schema dialog
box.
Define the components properties as you did with
tFSFilterColumns. Use the same input and output files.
Select a logical operator from the corresponding list, And in this scenario.
In the Conditions table, click the plus
button to add a line to the table.
Click in the line and then select the column on which you want to base your
row filter.

In this scenario, we want write out only the rows where the
amount value is greater than or equal to 5000.0.
Click the Advanced settings tab to display
the advanced settings view.
Select the Assign FileScale Path check box to
display the FileScale Path field and then
browse to the FileScale executable file delivered by Talend.
Click tFSUnique to display the Basic settings view and define the component
properties.

Define the components properties as you did with
tFSFilterRows. Use the same input and output files.
In the Unique key table, select the check box
of the column you want to use as a key attribute.

In this scenario, we want to recuperate only one occurrence of each name in
the name column.
If you want to take into account upper and lower cases, select the
corresponding Case Sensitive check box.
Click the Advanced settings tab to display
the advanced settings view.
Select the Assign FileScale Path check box to
display the FileScale Path field and then
browse to the FileScale executable file delivered by Talend.
Click tFSSort to display its Basic settings view and define the component
properties.

Define the components properties as you did with
tFSUnique. Use the same input and output files.
In the Criteria table, click the plus button
to add a line to the table.
Click in the Schema column and then select
the column by which you want to sort data.
Click in the Sort type column and select the
sort type.
Click in the Order type column and select the
order type.

In this scenario, you want to output data alphabetically in ascending order based on
the name column.

Click the Advanced settings tab to display
the advanced settings view.
Select the Assign FileScale Path check box to
display the FileScale Path field and then
browse to the FileScale executable file delivered by Talend.
Save your job and press F6 to execute it.

A progress bar displays in the design workspace to show the completed percentage of
the simultaneous operations. This progress bar will make it evident how the huge input
data is processed at a very high speed.

When the percentage progress bar reaches 100%, the data is written in the output file.
A simple comparison between the input data and the processed data will show us
that:

The output file has only the id, name, dateOfBirth and
amount columns. Only the rows where the
amount value is greater than or equal to 5000.0 are listed.
There is only one occurrence of each of the names and data records are written in an
alphabetical ascending order.

The below two captures show a sample of the large scale data before and after
processing.

Input data:

Output data:

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tFSFilterColumns – Docs for ESB 5.x

tFSFilterColumns

Warning

tFSFilterColumns Properties

Scenario: Combining filtering and sorting processes in a large scale file

Warning

Note

Warning

My Website Links

Tags