tAggregateRow

Receives a flow and aggregates it based on one or more columns.

For each output line, are provided the aggregation key and the relevant
result of set operations (min, max, sum…).

tAggregateRow helps to provide a set of metrics based on values or
calculations.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Standard: see tAggregateRow Standard properties.

The component in this framework is available in all Talend
products.
MapReduce: see tAggregateRow MapReduce properties (deprecated).

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Spark Batch:
see tAggregateRow properties for Apache Spark Batch.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Spark Streaming:
see tAggregateRow properties for Apache Spark Streaming.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Storm: see tAggregateRow Storm properties (deprecated).

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

tAggregateRow Standard properties

These properties are used to configure tAggregateRow running
in the Standard Job framework.

The Standard
tAggregateRow component belongs to the Processing
family.

The component in this framework is available in all Talend
products.

Basic settings

Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window. This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables.
	Built-In: You create and store the schema locally for this component only.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Group by	Define the aggregation sets, the values of which will be used for calculations.
	Output Column: Select the column label in the list offered based on the schema structure you defined. You can add as many output columns as you wish to make more precise aggregations. Ex: Select Country to calculate an average of values for each country of a list or select Country and Region if you want to compare one country’s regions with another country’ regions.
	Input Column: Match the input column label with your output columns, in case the output label of the aggregation set needs to be different.
Operations	Select the type of operation along with the value to use for the calculation and the output field.
	Output Column: Select the destination field in the list.
	Function: Select the operator among: count: calculates the number of rows min: selects the minimum value max: selects the maximum value avg: calculates the average sum: calculates the sum first: returns the first value last: returns the last value list: lists values of an aggregation by multiple keys. list (object): lists Java values of an aggregation by multiple keys count (distinct): counts the number of the distinct rows standard deviation: calculates the variability of a set of value. union (geometry): makes the union of a set of Geometry objects population standard deviation: calculates the spread of a data distribution. Use this function if the data to be calculated is considered a population on its own. This calculation supports 39 decimal places. sample standard deviation: calculates the spread of a data distribution. Use this function if the data to be calculated is considered a sample from a larger population. This calculation supports 39 decimal places.
	Input column: Select the input column from which the values are taken to be aggregated.
	Ignore null values: Select the check boxes corresponding to the names of the columns for which you want the NULL value to be ignored.

Advanced settings

Delimiter(only for list operation)	Enter the delimiter you want to use to separate the different operations.
Use financial precision, this is the max precision for “sum” and “avg” operations, checked option heaps more memory and slower than unchecked.	Select this check box to use a financial precision. This is a max precision but consumes more memory and slows the processing. Warning: We advise you to use the BigDecimal type for the output in order to obtain precise results.
Check type overflow (slower)	Checks the type of data to ensure that the Job doesn’t crash.
Check ULP (Unit in the Last Place), ensure that a value will be incremented or decremented correctly, only float and double types. (slower)	Select this check box to ensure the most precise results possible for the Float and Double types.
tStatCatcher Statistics	Check this box to collect the log data at component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step. Usually the use of tAggregateRow is combined with the tSortRow component.

Aggregating values and sorting data

This example shows you how to use Talend components to aggregate the
students’ comprehensive scores and then sort the aggregated scores based on the student
names.

Creating a Job for aggregating and sorting data

Create a Job to aggregate the students’ comprehensive scores using the
tAggregateRow component, then sort the aggregated data
using the tSortRow component, finally display the
aggregated and sorted data on the console.

Create a new Job and add a tFixedFlowInput component, a tAggregateRow component, a tSortRow
component, and a tLogRow component by typing their
names in the design workspace or dropping them from the Palette.
Link the tFixedFlowInput component to
the tAggregateRow component using a Row > Main
connection.
Do the same to link the tAggregateRow
component to the tSortRow component, and the
tSortRow component to the tLogRow component.

Configuring the Job for aggregating and sorting data

Configure the Job to aggregate the students’ comprehensive scores
using the tAggregateRow component and then sort
the aggregated data using the tSortRow
component.

Double-click the tFixedFlowInput
component to open its Basic settings view.
Click the

button next to Edit schema to
open the schema dialog box and define the schema by adding two columns, name of String type and score of Double type. When done, click OK to save the changes and close the schema dialog box.
In the Mode area, select Use Inline Content (delimited file) and in the Content field displayed, enter the following input
data:

Peter;92 James;93 Thomas;91 Peter;94 James;96 Thomas;95 Peter;96 James;92 Thomas;98 Peter;95 James;96 Thomas;93 Peter;98 James;97 Thomas;95

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Peter;92
James;93
Thomas;91
Peter;94
James;96
Thomas;95
Peter;96
James;92
Thomas;98
Peter;95
James;96
Thomas;93
Peter;98
James;97
Thomas;95
Double-click the tAggregateRow
component to open its Basic settings view.
Click the

button next to Edit schema to
open the schema dialog box and define the schema by adding five columns, name of String type, and sum, average, max, and min of
Double type.

When done, click OK to save the changes and close
the schema dialog box.
Add one row in the Group by table by
clicking the

button below it, and select name from both the Output column
and Input column position column fields to group
the input data by the name column.
Add four rows in the Operations table
and define the operations to be carried out. In this example, the operations are
sum, average, max, and min. Then select score from all four Input column
position column fields to aggregate the input data based on
it.
Double-click the tSortRow component to
open its Basic settings view.
Add one row in the Criteria table and
specify the column based on which the sort operation is performed. In this example,
it is the name column. Then select alpha from the sort num or
alpha? column field and asc from
the Order asc or desc? column field to sort the
aggregated data in ascending alphabetical order.
Double-click the tLogRow component to
open its Basic settings view, and then select
Table (print values in cells of a table) in the
Mode area for better readability of the
result.

Executing the Job to aggregate and sort data

After setting up the Job and configuring the components used in the
Job for aggregating and sorting data, you can then execute the Job and verify the Job
execution result.

Press Ctrl + S to save the Job.
Press F6 to execute the Job.

As shown above, the students’ comprehensive scores are
aggregated and then sorted in ascending alphabetical order based on the student
names.

Aggregating values based on dynamic schema

Here is an example of using the tAggregateRow component to aggregate some task assignment data in a CSV file
based on a dynamic schema column.

This scenario applies only to subscription-based Talend products.

Creating a Job for aggregating values based on dynamic schema

Create a Job to aggregate some task assignment data in a CSV file
based on a dynamic schema column using the tAggregateRow component, then display the aggregated data on the console and
write it into an output CSV file.

Create a new Job and add a tFileInputDelimited component, a tAggregateRow component, a tLogRow
component, and a tFileOutputDelimited component by
typing their names in the design workspace or dropping them from the Palette.
Link the tFileInputDelimited component
to the tAggregateRow component using a Row > Main
connection.
Do the same to link the tAggregateRow
component to the tLogRow component, and the
tLogRow component to the tFileOutputDelimited component.

Configuring the Job for aggregating values based on dynamic schema

Configure the Job to aggregate some task assignment data in a CSV
file based on a dynamic schema column using the tAggregateRow component.

Then this Job displays the aggregated data on the console
using the tLogRow component and writes it into an
output CSV file using the tFileOutputDelimited
component.

Double-click the tFileInputDelimited
component to open its Basic settings view.
In the File name/Stream field, specify
the path to the CSV file that holds the following task assignment data, D:/tasks.csv in this example.

task;team;status task1;team1;done task2;team2;done task3;team1;done task4;team2;pending task5;team1;pending task6;team2;pending

1
2
3
4
5
6
7

task;team;status
task1;team1;done
task2;team2;done
task3;team1;done
task4;team2;pending
task5;team1;pending
task6;team2;pending
In the Header field, enter the number
of rows to be skipped in the beginning of the file, 1 in this example.

Note that the dynamic schema feature is only supported in the Built-In mode and requires the input file to have a
header row.
Click the

button next to Edit schema to
open the schema dialog box and define the schema by adding two columns, task of String type and other of Dynamic type. When done, click OK to save the changes and close the schema dialog box.

Note that the dynamic column must be defined in the last row of the schema. For
more information about dynamic schema, see
Talend Studio User
Guide.
Double-click the tAggregateRow
component, and on its Basic settings view, click
the Sync columns button to retrieve the schema from
the preceding component.
Add one row in the Group by table by
clicking the

button below it, and select other from both the Output column
and Input column position column fields to group
the input data by the other dynamic
column.

Note that the dynamic column aggregation can be carried out only for the grouping
operation.
Add one row in the Operations table
and define the operation to be carried out. In this example, the operation function
is list. Then select task from both the Output column
and Input column position column fields to list the
entries in the task column in the grouping result.
Double-click the tLogRow component to
open its Basic settings view, and then select
Table (print values in cells of a table) in the
Mode area for better readability of the
result.
Double-click the tFileOutputDelimited
component to open its Basic settings view, and in
the File Name field, specify the path to the CSV
file into which the aggregated data will be written, D:/tasks_aggregated.csv in this example.
Select the Include Header check box to
include the header of each column in the CSV file.

Executing the Job to aggregate values based on dynamic schema

After setting up the Job and configuring the components used in the
Job for aggregating the task assignment data based on a dynamic schema column, you can then
execute the Job and verify the Job execution result.

Press Ctrl + S to save the Job.
Press F6 to execute the Job.

As shown above, the task assignment data is aggregated based on
the other dynamic column, and the aggregated data
is displayed on the console and written into the output CSV file.