August 17, 2023

tFSFilterColumns – Docs for ESB 5.x

tFSFilterColumns

tFSFilterColumns_icon32.png

Warning

This component will be available in the Palette of the studio on the condition that you have
subscribed to the relevant edition of one of the Talend solutions
with Big Data.

tFSFilterColumns Properties

Component family

FileScale

Note that this component is deprecated.

Function

tFSFilterColumns makes specified
changes to the defined table based on column name mapping. This
component has real-time capabilities for filtering large scale
files. To optimize performance, the component usually sorts data
before processing it.

Purpose

Helps homogenizing schemas in a file either by removing unwanted
columns or adding new ones.

Basic settings

Schema type and Edit
Schema

A schema is a row description. It defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

 

Repository: You have already
created the schema and stored it in the Repository. You can reuse it
in various projects and job flowcharts. Related topic: see
Talend Studio
User Guide.

 

 

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: No property data stored
centrally.

 

 

Repository: Select the repository
file where Properties are stored. The fields that follow are
pre-filled using the fetched data.

 

Input File Name

Name of the file holding the data you want to filter.

 

Output File Name

Name of the file where you want to write the filtered data.

 

Record separator (char)

Character, string or regular expression to separate records
(lines).

 

Field separator (char)

Character, string or regular expression to separate fields in a
record.

 

Header

Number of records to be skipped in the beginning of the
file.

 

Footer

Number of records to be skipped at the end of the file.

 

Column configuration

Column: List of the columns to be
filtered.

Keep it: Select the check box next
to the column name you want to maintain in the file. If you clear
the check box, the column will be deleted.

Advanced settings

Generate FSLang File

Select this check box to generate the FSLang file corresponding to
your Job and click the three-dot button next to the FSLang File Name field to specify its
path and its name.

 

Assign FileScale Path

Select this check box and then click the three-dot button next to
the FileScale Path field to select
the FileScale program executable file required to execute the
component.

 

Specify Number of Process Child

Select this check box and enter the number of child processes to
use for carrying out the aggregation.

 

Sort results

Select this check box to sort the results.

 

Custom FileScale Parameter (separated by,)

Enter the parameters for any specific operation you want to add to
the FileScale executable call.

 

tStatCatcher Statistics

Select this check box to gather the job processing metadata at a
job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component handles files therefore it does not require input
and output data flows. It is used to filter columns in large scale
files.

Limitation

Limitation is imposed by limits of physical memory and CPU
architectures. For example, total length of processed files cannot
exceed file system limit for LargeFile support (maximum value of 64
signed bits).

Scenario: Combining filtering and sorting processes in a large scale file

Warning

Make sure that you have unzipped and saved locally the FileScale
executable file delivered by


Talend

. You must define the path of this executable file in the
Advanced settings
view of each of the FileScale components.

This scenario describes a Job that uses several FileScale components which are
connected with the FS Combine link. Using this link,
the different FileScale components will process data simultaneously and thus optimize
performance. This scenario uses the following FileScale components:

  • tFSFilterColumns: to filter columns in a table based on column name mapping,

  • tFSFilterRows: to define a row filter on the table to keep only the rows that have
    a value greater than or equal to 5000.0,

  • tFSUnique: to recuperate only one occurrence
    of the name record, in case duplicates are present,

  • tFSSort: to sort data in alphabetical order
    based on the name column.

To design such a scenario, complete the following:

  • Drop the following components from the Palette to the design workspace: tRowGenerator, tFileOutputDelimited, tFSFilterColumns, tFSFilterRows, tFSUnique and
    tFSSort.

Use_Case_FSFilterColumns.png
  • Connect tRowGenerator first to tFileOutputDelimited using a Row > Main link and then to tFSFilterColumns using an OnSubjobOk link.

  • Connect all the FileScale components together using the Row > FS Combine link.

Note

As the FS Combine link joins all data processes
designed in the Job and executes them simultaneously, the sequence of the FileScale
components is of no importance. You can order them the way you like.

Warning

In order for such scenario to execute correctly, you must define the same
input and output files in all the FileScale components.

In this scenario, the tRowGenerator component will
generate data according to the schema you define in the component Basic settings view and send it to the input file. If data generation is
errorless, data in the input file will be processed simultaneously by the four FileScale
components and thus optimize performance.

  • Click tRowGenerator to display its Basic settings view and define the component
    properties.

  • Click the three-dot button next to RowGenerator
    Editor
    to open the component editor where you can define your
    schema.

Use_Case_FSFilterColumns1.png
  • In the upper half of the editor, click the plus button to add the columns you
    want to write in the input file.

  • Define the schema and set the parameters of the columns.

    In this scenario, the input file contains nine columns:
    id, name,
    dateOfBirth, amount and five other
    columns that hold extra information.

  • If required, click the Preview tab in the
    lower half of the editor to display the corresponding view and then click the
    View button to display a sample of the
    generated data.

  • Click OK to validate your schema and close
    the tRowGenerator editor.

  • Click tFileOutputDelimited to display its
    Basic settings view and define the
    component properties.

Use_Case_FSFilterColumns2.png
  • Set the tFileOutputDelimited
    properties.

For this scenario, we have defined two context variables:
filename for the input file name and
folder for the input file path. For more information about
context variables, check Talend Studio User
Guide
.

  • Put your pointer in the File Name field and
    then press Ctrl + Space and select the context
    variable for the file path (context.folder).

  • Enter +"/"+ to separate the file path from the file name and then
    press Ctrl+Space to display the context
    variable list.

  • Select the context variable for the file name
    (context.filename).

  • If required, click the Edit schema button to
    view the schema you defined in the editor or modify it.

  • Click tFSFilterColumns to open its Basic settings view and define the component
    properties.

Use_Case_FSFilterColumns3.png
  • Set schema and property type to Built-In.

  • Click the Edit schema button to display a
    dialog box. Here you can define your column schema. This schema must corresponds
    to the input file schema.

  • Click OK to close the schema dialog
    box.

    The defined column schema displays in the Column
    configuration
    table.

  • Set the input file path and name using the variable contexts you define
    earlier by pressing Ctrl + Space and selecting
    the variables from the list. Separate the two variables by
    +"/"+.

  • Click the three-dot button next to the Output File
    Name
    and browse to the output file where you want to write the
    processed data.

    In this example, we have defined a context variable for the output file the
    same way we did for the input file.

  • Define the record and field separators and then the header and footer of the
    file, if any.

  • In the Column configuration table, select the
    check boxes of the columns you want to write in the output file, id,
    name, dateOfBirth
    and amount in this
    scenario.

  • Click the Advanced settings tab to display
    the advanced settings view.

Use_Case_FSFilterColumns4.png
  • Select the Generate FSLang File check box to
    display the FSLang File Name field and then
    browse to where you want to generate the FSLang file corresponding to your Job.

    In this example, we have defined a context variable for the FSLang output
    file as we did for the input and output files.

  • Select the Assign FileScale Path check box to
    display the FileScale Path field and then
    browse to the FileScale executable file delivered by Talend.

  • Click tFSFilterRows to display the Basic settings view and define the component
    properties.

Use_Case_FSFilterColumns5.png
  • Click the Edit schema button to display the
    schema dialog box. Here you can define your column schema. This schema must
    corresponds to the input file schema.

Use_Case_FSFilterColumns7.png
  • From the panel to the right, delete the output column that do not map to the
    column filter you defined in tFSFilterColumns,
    all extraInfo columns in this scenario.

  • Click OK to close the schema dialog
    box.

  • Define the components properties as you did with
    tFSFilterColumns
    . Use the same input and output files.

  • Select a logical operator from the corresponding list, And in this scenario.

  • In the Conditions table, click the plus
    button to add a line to the table.

  • Click in the line and then select the column on which you want to base your
    row filter.

    In this scenario, we want write out only the rows where the
    amount value is greater than or equal to 5000.0.

  • Click the Advanced settings tab to display
    the advanced settings view.

  • Select the Assign FileScale Path check box to
    display the FileScale Path field and then
    browse to the FileScale executable file delivered by Talend.

  • Click tFSUnique to display the Basic settings view and define the component
    properties.

Use_Case_FSFilterColumns6.png
  • Define the components properties as you did with
    tFSFilterRows
    . Use the same input and output files.

  • In the Unique key table, select the check box
    of the column you want to use as a key attribute.

    In this scenario, we want to recuperate only one occurrence of each name in
    the name column.

  • If you want to take into account upper and lower cases, select the
    corresponding Case Sensitive check box.

  • Click the Advanced settings tab to display
    the advanced settings view.

  • Select the Assign FileScale Path check box to
    display the FileScale Path field and then
    browse to the FileScale executable file delivered by Talend.

  • Click tFSSort to display its Basic settings view and define the component
    properties.

Use_Case_FSFilterColumns8.png
  • Define the components properties as you did with
    tFSUnique
    . Use the same input and output files.

  • In the Criteria table, click the plus button
    to add a line to the table.

  • Click in the Schema column and then select
    the column by which you want to sort data.

  • Click in the Sort type column and select the
    sort type.

  • Click in the Order type column and select the
    order type.

In this scenario, you want to output data alphabetically in ascending order based on
the name column.

  • Click the Advanced settings tab to display
    the advanced settings view.

  • Select the Assign FileScale Path check box to
    display the FileScale Path field and then
    browse to the FileScale executable file delivered by Talend.

  • Save your job and press F6 to execute it.

A progress bar displays in the design workspace to show the completed percentage of
the simultaneous operations. This progress bar will make it evident how the huge input
data is processed at a very high speed.

When the percentage progress bar reaches 100%, the data is written in the output file.
A simple comparison between the input data and the processed data will show us
that:

The output file has only the id, name, dateOfBirth and
amount columns. Only the rows where the
amount value is greater than or equal to 5000.0 are listed.
There is only one occurrence of each of the names and data records are written in an
alphabetical ascending order.

The below two captures show a sample of the large scale data before and after
processing.

Input data:

Use_Case_FSFilterColumns10.png

Output data:

Use_Case_FSFilterColumns9.png

Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x