July 30, 2023

tAggregateRow – Docs for ESB 7.x

tAggregateRow

Receives a flow and aggregates it based on one or more columns.

For each output line, are provided the aggregation key and the relevant
result of set operations (min, max, sum…).

tAggregateRow helps to provide a set of metrics based on values or
calculations.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tAggregateRow Standard properties

These properties are used to configure tAggregateRow running
in the Standard Job framework.

The Standard
tAggregateRow component belongs to the Processing
family.

The component in this framework is available in all Talend
products
.

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

This
component offers the advantage of the dynamic schema feature. This allows you to
retrieve unknown columns from source files or to copy batches of columns from a source
without mapping each column individually. For further information about dynamic schemas,
see
Talend Studio

User Guide.

This
dynamic schema feature is designed for the purpose of retrieving unknown columns of a
table and is recommended to be used for this purpose only; it is not recommended for the
use of creating tables.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Group by

Define the aggregation sets, the values of which will be
used for calculations.

 

Output Column: Select the column
label in the list offered based on the schema structure you defined. You
can add as many output columns as you wish to make more precise
aggregations.

Ex: Select Country to calculate an average of values for
each country of a list or select Country and Region if you want to
compare one country’s regions with another country’ regions.

 

Input Column: Match the input
column label with your output columns, in case the output label of the
aggregation set needs to be different.

Operations

Select the type of operation along with the value to use
for the calculation and the output field.

 

Output Column: Select the
destination field in the list.

 

Function: Select the operator among:

  • count: calculates the number of rows

  • min: selects the minimum value

  • max: selects the maximum value

  • avg: calculates the average

  • sum: calculates the sum

  • first: returns the first value

  • last: returns the last value

  • list: lists values of an aggregation by multiple keys.

  • list (object): lists Java values of an aggregation by multiple keys

  • count (distinct): counts the number of the distinct rows

  • standard deviation: calculates the
    variability of a set of value.

  • union (geometry): makes the union of a set of Geometry objects

  • population standard deviation: calculates the
    spread of a data distribution. Use this function if the data to be
    calculated is considered a population on its own. This calculation
    supports 39 decimal places.
  • sample
    standard deviation
    : calculates the spread of a data
    distribution. Use this function if the data to be calculated is
    considered a sample from a larger population. This calculation
    supports 39 decimal places.

 

Input column: Select the input
column from which the values are taken to be aggregated.

 

Ignore null values: Select the
check boxes corresponding to the names of the columns for which you want
the NULL value to be ignored.

Advanced settings

Delimiter(only for list
operation)

Enter the delimiter you want to use to separate the
different operations.

Use financial precision, this is the max
precision for “sum” and “avg” operations, checked option heaps more
memory and slower than unchecked.

Select this check box to use a financial precision. This
is a max precision but consumes more memory and slows the
processing.

Warning:

We advise you to use the BigDecimal type for the
output in order to obtain precise results.

Check type overflow (slower)

Checks the type of data to ensure that the Job doesn’t
crash.

Check ULP (Unit in the Last Place), ensure
that a value will be incremented or decremented correctly, only
float and double types. (slower)

Select this check box to ensure the most precise results
possible for the Float and Double types.

tStatCatcher Statistics

Check this box to collect the log data at component level. Note that this
check box is not available in the Map/Reduce version of the component.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component handles flow of data therefore it requires
input and output, hence is defined as an intermediary step. Usually the
use of tAggregateRow is combined
with the tSortRow component.

Aggregating values and sorting data

This example shows you how to use Talend components to aggregate the
students’ comprehensive scores and then sort the aggregated scores based on the student
names.

Creating a Job for aggregating and sorting data

Create a Job to aggregate the students’ comprehensive scores using the
tAggregateRow component, then sort the aggregated data
using the tSortRow component, finally display the
aggregated and sorted data on the console.

tAggregateRow_1.png

  1. Create a new Job and add a tFixedFlowInput component, a tAggregateRow component, a tSortRow
    component, and a tLogRow component by typing their
    names in the design workspace or dropping them from the Palette.
  2. Link the tFixedFlowInput component to
    the tAggregateRow component using a Row > Main
    connection.
  3. Do the same to link the tAggregateRow
    component to the tSortRow component, and the
    tSortRow component to the tLogRow component.

Configuring the Job for aggregating and sorting data

Configure the Job to aggregate the students’ comprehensive scores
using the tAggregateRow component and then sort
the aggregated data using the tSortRow
component.

  1. Double-click the tFixedFlowInput
    component to open its Basic settings view.
  2. Click the

    tAggregateRow_2.png

    button next to Edit schema to
    open the schema dialog box and define the schema by adding two columns, name of String type and score of Double type. When done, click OK to save the changes and close the schema dialog box.

  3. In the Mode area, select Use Inline Content (delimited file) and in the Content field displayed, enter the following input
    data:

  4. Double-click the tAggregateRow
    component to open its Basic settings view.

    tAggregateRow_3.png

  5. Click the

    tAggregateRow_2.png

    button next to Edit schema to
    open the schema dialog box and define the schema by adding five columns, name of String type, and sum, average, max, and min of
    Double type.

    tAggregateRow_5.png

    When done, click OK to save the changes and close
    the schema dialog box.
  6. Add one row in the Group by table by
    clicking the

    tAggregateRow_6.png

    button below it, and select name from both the Output column
    and Input column position column fields to group
    the input data by the name column.

  7. Add four rows in the Operations table
    and define the operations to be carried out. In this example, the operations are
    sum, average, max, and min. Then select score from all four Input column
    position
    column fields to aggregate the input data based on
    it.
  8. Double-click the tSortRow component to
    open its Basic settings view.

    tAggregateRow_7.png

  9. Add one row in the Criteria table and
    specify the column based on which the sort operation is performed. In this example,
    it is the name column. Then select alpha from the sort num or
    alpha?
    column field and asc from
    the Order asc or desc? column field to sort the
    aggregated data in ascending alphabetical order.
  10. Double-click the tLogRow component to
    open its Basic settings view, and then select
    Table (print values in cells of a table) in the
    Mode area for better readability of the
    result.

Executing the Job to aggregate and sort data

After setting up the Job and configuring the components used in the
Job for aggregating and sorting data, you can then execute the Job and verify the Job
execution result.

  1. Press Ctrl + S to save the Job.
  2. Press F6 to execute the Job.

    tAggregateRow_8.png

As shown above, the students’ comprehensive scores are
aggregated and then sorted in ascending alphabetical order based on the student
names.

Aggregating values based on dynamic schema

Here is an example of using the tAggregateRow component to aggregate some task assignment data in a CSV file
based on a dynamic schema column.

This scenario applies only to subscription-based Talend products.

Creating a Job for aggregating values based on dynamic schema

Create a Job to aggregate some task assignment data in a CSV file
based on a dynamic schema column using the tAggregateRow component, then display the aggregated data on the console and
write it into an output CSV file.

tAggregateRow_9.png

  1. Create a new Job and add a tFileInputDelimited component, a tAggregateRow component, a tLogRow
    component, and a tFileOutputDelimited component by
    typing their names in the design workspace or dropping them from the Palette.
  2. Link the tFileInputDelimited component
    to the tAggregateRow component using a Row > Main
    connection.
  3. Do the same to link the tAggregateRow
    component to the tLogRow component, and the
    tLogRow component to the tFileOutputDelimited component.

Configuring the Job for aggregating values based on dynamic schema

Configure the Job to aggregate some task assignment data in a CSV
file based on a dynamic schema column using the tAggregateRow component.

Then this Job displays the aggregated data on the console
using the tLogRow component and writes it into an
output CSV file using the tFileOutputDelimited
component.

  1. Double-click the tFileInputDelimited
    component to open its Basic settings view.
  2. In the File name/Stream field, specify
    the path to the CSV file that holds the following task assignment data, D:/tasks.csv in this example.

  3. In the Header field, enter the number
    of rows to be skipped in the beginning of the file, 1 in this example.

    Note that the dynamic schema feature is only supported in the Built-In mode and requires the input file to have a
    header row.
  4. Click the

    tAggregateRow_2.png

    button next to Edit schema to
    open the schema dialog box and define the schema by adding two columns, task of String type and other of Dynamic type. When done, click OK to save the changes and close the schema dialog box.

    Note that the dynamic column must be defined in the last row of the schema. For
    more information about dynamic schema, see
    Talend Studio User
    Guide
    .
  5. Double-click the tAggregateRow
    component, and on its Basic settings view, click
    the Sync columns button to retrieve the schema from
    the preceding component.

    tAggregateRow_11.png

  6. Add one row in the Group by table by
    clicking the

    tAggregateRow_6.png

    button below it, and select other from both the Output column
    and Input column position column fields to group
    the input data by the other dynamic
    column.

    Note that the dynamic column aggregation can be carried out only for the grouping
    operation.
  7. Add one row in the Operations table
    and define the operation to be carried out. In this example, the operation function
    is list. Then select task from both the Output column
    and Input column position column fields to list the
    entries in the task column in the grouping result.
  8. Double-click the tLogRow component to
    open its Basic settings view, and then select
    Table (print values in cells of a table) in the
    Mode area for better readability of the
    result.
  9. Double-click the tFileOutputDelimited
    component to open its Basic settings view, and in
    the File Name field, specify the path to the CSV
    file into which the aggregated data will be written, D:/tasks_aggregated.csv in this example.
  10. Select the Include Header check box to
    include the header of each column in the CSV file.

Executing the Job to aggregate values based on dynamic schema

After setting up the Job and configuring the components used in the
Job for aggregating the task assignment data based on a dynamic schema column, you can then
execute the Job and verify the Job execution result.

  1. Press Ctrl + S to save the Job.
  2. Press F6 to execute the Job.

    tAggregateRow_13.png

As shown above, the task assignment data is aggregated based on
the other dynamic column, and the aggregated data
is displayed on the console and written into the output CSV file.

tAggregateRow MapReduce properties (deprecated)

These properties are used to configure tAggregateRow running in the MapReduce Job framework.

The MapReduce
tAggregateRow component belongs to the Processing family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Group by

Define the aggregation sets, the values of which will be used for
calculations.

 

Output Column: Select the column
label in the list offered based on the schema structure you defined.
You can add as many output columns as you wish to make more precise
aggregations.

Ex: Select Country to calculate an average of values for each
country of a list or select Country and Region if you want to
compare one country’s regions with another country’ regions.

 

Input Column: Match the input
column label with your output columns, in case the output label of
the aggregation set needs to be different.

Operations

Select the type of operation along with the value to use for the
calculation and the output field.

 

Output Column: Select the
destination field in the list.

 

Function: Select the operator among:

  • count: calculates the number of rows

  • min: selects the minimum value

  • max: selects the maximum value

  • avg: calculates the average

  • sum: calculates the sum

  • list: lists values of an aggregation by multiple keys.

  • list (object): lists Java values of an aggregation by multiple keys

  • count (distinct): counts the number of the distinct rows

  • standard deviation: calculates the
    variability of a set of value.

  • union (geometry): makes the union of a set of Geometry objects

Some functions that are available in a traditional ETL Job, such as first or last, are not available in MapReduce Jobs because these functions does not make sense in a distributed environment.

 

Input column: Select the input
column from which the values are taken to be aggregated.

 

Ignore null values: Select the
check boxes corresponding to the names of the columns for which you
want the NULL value to be ignored.

Advanced settings

Delimiter(only for list operation)

Enter the delimiter you want to use to separate the different
operations.

Use financial precision, this is the max precision for
“sum” and “avg” operations, checked option heaps more memory and
slower than unchecked.

Select this check box to use a financial precision. This is a max
precision but consumes more memory and slows the processing.

Warning:

We advise you to use the BigDecimal type for the
output in order to obtain precise results.

Check type overflow (slower)

Checks the type of data to ensure that the Job doesn’t
crash.

Check ULP (Unit in the Last Place), ensure that a value
will be incremented or decremented correctly, only float and
double types. (slower)

Select this check box to ensure the most precise results possible
for the Float and Double types.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

In a
Talend
Map/Reduce Job, this component is used as an intermediate
step and other components used along with it must be Map/Reduce components, too. They
generate native Map/Reduce code that can be executed directly in Hadoop.

For further information about a
Talend
Map/Reduce Job, see the sections
describing how to create, convert and configure a
Talend
Map/Reduce Job of the

Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs, and non Map/Reduce Jobs.

Related scenarios

No scenario is available for the Map/Reduce version of this component yet.

tAggregateRow properties for Apache Spark Batch

These properties are used to configure tAggregateRow running in the Spark Batch Job framework.

The Spark Batch
tAggregateRow component belongs to the Processing family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Group by

Define the aggregation sets, the values of which will be used for
calculations.

 

Output Column: Select the column
label in the list offered based on the schema structure you defined. You
can add as many output columns as you wish to make more precise
aggregations.

Ex: Select Country to calculate an average of values for each country
of a list or select Country and Region if you want to compare one
country’s regions with another country’ regions.

 

Input Column: Match the input column
label with your output columns, in case the output label of the
aggregation set needs to be different.

Operations

Select the type of operation along with the value to use for the
calculation and the output field.

 

Output Column: Select the destination
field in the list.

 

Function: Select the operator among:

  • count: calculates the number of rows

  • count (distinct): counts the number of the distinct rows

  • min: selects the minimum value

  • max: selects the maximum value

  • avg: calculates the average

  • sum: calculates the sum

  • population standard deviation: calculates the
    spread of a data distribution. Use this function if the data to be
    calculated is considered a population on its own. This calculation
    supports 39 decimal places.
  • sample
    standard deviation
    : calculates the spread of a data
    distribution. Use this function if the data to be calculated is
    considered a sample from a larger population. This calculation
    supports 39 decimal places.

Some functions that are available in a traditional ETL Job, such as first or last, are not available in Spark Jobs because these functions does not make sense in a distributed environment.

 

Input column: Select the input column
from which the values are taken to be aggregated.

 

Ignore null values: Select the check
boxes corresponding to the names of the columns for which you want the
NULL value to be ignored.

Advanced settings

Use financial precision, this is the max precision for “sum”
and “avg” operations, checked option heaps more memory and slower
than unchecked.

Select this check box to use a financial precision. This is a max
precision but consumes more memory and slows the processing.

Warning:

We advise you to use the BigDecimal type for the output in
order to obtain precise results.

Check type overflow (slower)

Checks the type of data to ensure that the Job doesn’t crash.

Check ULP (Unit in the Last Place), ensure that a value will
be incremented or decremented correctly, only float and double
types. (slower)

Select this check box to ensure the most precise results possible for
the Float and Double types.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

tAggregateRow properties for Apache Spark Streaming

These properties are used to configure tAggregateRow running in the Spark Streaming Job framework.

The Spark Streaming
tAggregateRow component belongs to the Processing family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Group by

Define the aggregation sets, the values of which will be used for
calculations.

 

Output Column: Select the column
label in the list offered based on the schema structure you defined. You
can add as many output columns as you wish to make more precise
aggregations.

Ex: Select Country to calculate an average of values for each country
of a list or select Country and Region if you want to compare one
country’s regions with another country’ regions.

 

Input Column: Match the input column
label with your output columns, in case the output label of the
aggregation set needs to be different.

Operations

Select the type of operation along with the value to use for the
calculation and the output field.

 

Output Column: Select the destination
field in the list.

 

Function: Select the operator among:

  • count: calculates the number of rows

  • count (distinct): counts the number of the distinct rows

  • min: selects the minimum value

  • max: selects the maximum value

  • avg: calculates the average

  • sum: calculates the sum

  • standard deviation: calculates the
    variability of a set of value.

Some functions that are available in a traditional ETL Job, such as first or last, are not available in Spark Jobs because these functions does not make sense in a distributed environment.

 

Input column: Select the input column
from which the values are taken to be aggregated.

 

Ignore null values: Select the check
boxes corresponding to the names of the columns for which you want the
NULL value to be ignored.

Advanced settings

Use financial precision, this is the max precision for “sum”
and “avg” operations, checked option heaps more memory and slower
than unchecked.

Select this check box to use a financial precision. This is a max
precision but consumes more memory and slows the processing.

Warning:

We advise you to use the BigDecimal type for the output in
order to obtain precise results.

Check type overflow (slower)

Checks the type of data to ensure that the Job doesn’t crash.

Check ULP (Unit in the Last Place), ensure that a value will
be incremented or decremented correctly, only float and double
types. (slower)

Select this check box to ensure the most precise results possible for
the Float and Double types.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional
Talend
data
integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

For a related scenario, see Analyzing a Twitter flow in near real-time.

tAggregateRow Storm properties (deprecated)

These properties are used to configure tAggregateRow running in the Storm Job framework.

The Storm
tAggregateRow component belongs to the Processing family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

The Storm framework is deprecated from Talend 7.1 onwards. Use Talend Jobs for Apache Spark Streaming to accomplish your Streaming related tasks.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Group by

Define the aggregation sets, the values of which will be used for
calculations.

 

Output Column: Select the column
label in the list offered based on the schema structure you defined.
You can add as many output columns as you wish to make more precise
aggregations.

Ex: Select Country to calculate an average of values for each
country of a list or select Country and Region if you want to
compare one country’s regions with another country’ regions.

 

Input Column: Match the input
column label with your output columns, in case the output label of
the aggregation set needs to be different.

Operations

Select the type of operation along with the value to use for the
calculation and the output field.

 

Output Column: Select the
destination field in the list.

 

Function: Select the operator
among: count, min, max, avg, sum, first, last, list, list(objects),
count(distinct), standard deviation.

 

Input column: Select the input
column from which the values are taken to be aggregated.

 

Ignore null values: Select the
check boxes corresponding to the names of the columns for which you
want the NULL value to be ignored.

Advanced settings

Delimiter(only for list operation)

Enter the delimiter you want to use to separate the different
operations.

Use financial precision, this is the max precision for
“sum” and “avg” operations, checked option heaps more memory and
slower than unchecked.

Select this check box to use a financial precision. This is a max
precision but consumes more memory and slows the processing.

Warning:

We advise you to use the BigDecimal type for the
output in order to obtain precise results.

Check type overflow (slower)

Checks the type of data to ensure that the Job doesn’t
crash.

Check ULP (Unit in the Last Place), ensure that a value
will be incremented or decremented correctly, only float and
double types. (slower)

Select this check box to ensure the most precise results possible
for the Float and Double types.

Usage

Usage rule

If you have subscribed to one of the
Talend
solutions with Big Data, you can also
use this component as a Storm component. In a
Talend
Storm Job, this component is used as
an intermediate step and other components used along with it must be Storm components, too.
They generate native Storm code that can be executed directly in a Storm system.

The Storm version does not support the use of the global variables.

You need to use the Storm Configuration tab in the
Run view to define the connection to a given Storm
system for the whole Job.

This connection is effective on a per-Job basis.

For further information about a
Talend
Storm Job, see the sections
describing how to create and configure a
Talend
Storm Job of the
Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional
Talend
data
integration Jobs.

Related scenarios

No scenario is available for the Storm version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x