August 17, 2023

tFSAggregate – Docs for ESB 5.x

tFSAggregate

tFSAggregate_icon32.png

Warning

This component will be available in the Palette of the studio on the condition that you have
subscribed to the relevant edition of one of the Talend solutions
with Big Data.

tFSAggregate properties

Component family

FileScale

Note that this component is deprecated.

Function

tFSAggregate performs an
aggregation based on one or more columns. For each output line, an
aggregation key is provided, as well as the result of the
corresponding aggregation operation (count, avg or sum).

This component has high speed capabilities for aggregating large
scale files.

Purpose

tFSAggregate helps setting
metrics and statistics based on values or calculations from one or
several large scale files.

Basic settings

Schema type and Edit Schema

A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

 

Repository: You have already
created the schema and stored it in the Repository. You can reuse it
in various projects and job flowcharts. Related topic: see
Talend Studio
User Guide.

 

 

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio
User Guide.

 

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: No property data stored
centrally.

 

 

Repository: Select the repository
file where Properties are stored. The fields that follow are
pre-filled in using the fetched data.

 

Input File Name

Name of the file holding the data you want to collect.

 

Output File Name

Name of the file where you want to write the collected
data.

 

Record separator (char)

Character, string or regular expression to separate records
(lines).

 

Field separator (char)

Character, string or regular expression to separate fields in a
record.

 

Header

Number of records to be skipped in the beginning of the
file.

 

Footer

Number of records to be skipped at the end of the file.

 

Group by

Column: List of the columns of
the input file.

Key Attribute: Select the check box
next to the name of the column(s) you want to use to regroup
data.

 

Operations

Additional Output Column: Enter
the name you want to use for the column containing the results of
the aggregation operation.

 

 

Function: Select the type of the
aggregation operation to perform:

count: calculates the number of
lines,

avg: calculates the average,

sum: calculates the sum.

 

 

Input Column: Select the input
column from which the values are collected for the aggregation
operation.

Advanced settings

Generate FSLang File

Select this check box to generate the FSLang file corresponding to
your Job and click the three-dot button next to the FSLang File Name field to specify its
path and its name.

 

Assign FileScale Path

Select this check box and then click the three-dot button next to
the FileScale Path field to select
the FileScale program executive file required to execute the
component.

 

Specify Number of Process Child

Select this check box and enter the number of child processes to
use for carrying out the aggregation.

 

Sort results

Select this check box to sort the results.

 

Custom FileScale Parameter (separated
by,)

Enter the parameters for any specific operation you want to add to
the FileScale executable call.

 

tStatCatcher Statistics

Select this check box to gather the job processing metadata at a
job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component handles files therefore it does not require input
and output data flows. It is used to aggregate data from large scale
files.

Limitation

Limitation is imposed by the limits of physical memory and Central
Processing Unit architectures. For example, total length of
processed files cannot exceed file system limit for LargeFile
support (maximum value of 64 signed bits).

Scenario: Aggregating data from a large file

Warning

Make sure that you have unzipped and saved locally the FileScale
executable file delivered by
Talend.
You must define the path of this executable file in the
Advanced settings view of tFSAggregate.

This scenario describes a Job that uses the tFSAggregate
component to aggregate, in high speed, very big customers’ data according to
the States in which they are based, calculating the number of customers that have been
aggregated for each State and the average of their incomes.

In this scenario, we have already stored the input schemas of the large input file in
the Repository. For more information about storing schema metadata in the Repository,
see Talend Studio
User Guide.

The input file contains nine columns: id,
CustomerName, CustomerAddress,
idState, id2,
RegTime, RegisterTime,
Sum1 and Sum2.

Use_Case_tFSAggregate1.png
  • In the Repository tree view, expand Metadata and the file node where you have stored the
    input schemas and drop the relevant metadata onto the design workspace.

    The [Components] dialog box displays.

Use_Case_tFSSort2.png
  • Select tFSAggregate from the list and click
    OK to close the dialog box.

    The tFSAggregate component displays in the
    design workspace.

  • Double-click tFSAggregate to display its
    Basic settings view.

Use_Case_tFSAggregate2.png

All tFSAggregate property fields are automatically
filled in. If you did not define your input schemas locally in the Repository, fill in
the details manually after selecting Built-in in the
Schema Type and Property
Type
fields.

  • In the Output File Name field, browse to the
    output file you want to write the aggregated data in.

  • In the Group by table, select the check
    box(es) next to the column name(s) you want to use to regroup the data. You can
    select multiple columns as aggregation set if you want to regroup data based on
    multiple criteria. For this scenario, we want to use the
    idState column to regroup the data.

Use_Case_tFSAggregate3.png
  • In the Operations table, click the plus
    button to add two columns that will hold the results of the aggregation
    operation.

Use_Case_tFSAggregate4.png
  • In the first line of the Additional Output
    column
    list, enter a name for the first additional output column,
    count in this scenario.

  • Click in the first line of the Function list
    and select the aggregation operation you want to perform, count in this scenario.

  • Click in the first line of the Input Column
    list and select the column from which the input values are to be taken,
    id in this scenario.

Thus, the column count will be added to the output file and will
contain the number of the id of the customers regrouped by
State.

  • In the second line of the Additional Output
    column
    list, enter a name for the second additional output
    column, avg in this scenario.

  • Click in the second line of the Function list
    and select the aggregation operation you want to perform, avg in this scenario.

  • Click in the second line of the Input Column
    list and select the column from which the input values are to be taken,
    Sum1 in this scenario.

Thus, the column avg will be added to the output file and will
contain the average of the Sum1 of the customers.

  • Click the Advanced settings tab to display
    the advanced settings view and then select the Assign
    FileScale Path
    check box to display the FileScale Path field and then browse to the executable file
    delivered by Talend.

  • Save your Job and press F6 to execute it.

Use_Case_tFSAggregate5.png

A progress bar displays below the tFSAggregate
component in the design workspace to show the completed percentage of the aggregation
operation. This progress bar will make it evident how the huge input data is aggregated
at a very high speed.

When the percentage progress bar reaches 100%, the specified data is regrouped and
written in the two new defined output columns.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x