tUniqRow

Ensures data quality of input or output flow in a Job.

Compares entries and sorts out duplicate entries from the input flow.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

Standard: see tUniqRow Standard properties.

The component in this framework is generally available.
MapReduce: see tUniqRow MapReduce properties.

The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.
Spark Batch: see tUniqRow properties for Apache Spark Batch.

The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.
Spark Streaming: see tUniqRow properties for Apache Spark Streaming.

The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.

tUniqRow Standard properties

These properties are used to configure tUniqRow running in the Standard Job framework.

The Standard
tUniqRow component belongs to the Data Quality family.

The component in this framework is generally available.

Basic settings

Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window. This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables.
	Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.
Unique key	In this area, select one or more columns to carry out deduplication on the particular column(s) – Select the Key attribute check box to carry out deduplication on all the columns – Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key	Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.
Use of disk (suitable for processing large row set)	Select this check box to enable generating temporary files on the hard disk when processing a large amount of data. This helps to prevent Job execution failure caused by memory overflow. With this check box selected, you need also to define: – Buffer size in memory: Select the number of rows that can be buffered in the memory before a temporary file is to be generated on the hard disk. – Directory for temp files: Set the location where the temporary files should be stored. Warning: Make sure that you specify an existing directory for temporary files; otherwise your Job execution will fail.
Ignore trailing zeros for BigDecimal	Select this check box to ignore trailing zeros for BigDecimal data.
tStatCatcher Statistics	Select this check box to gather the job processing metadata at a job level as well as at each component level.

Global Variables

Global Variables	NB_UNIQUES: the number of unique rows. This is an After variable and it returns an integer. NB_DUPLICATES: the number of duplicate rows. This is an After variable and it returns an integer. ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

NB_UNIQUES: the number of unique rows. This is an After
variable and it returns an integer.

NB_DUPLICATES: the number of duplicate rows. This is an
After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step.

Scenario 1: Deduplicating entries

In this five-component Job, we will sort entries on an input name list, find out
duplicated names, and display the unique names and the duplicated names on the Run console.

Setting up the Job

Drop a tFileInputDelimited, a tSortRow, a tUniqRow, and two tLogRow
components from the Palette to the design
workspace, and name the components as shown above.
Connect the tFileInputDelimited
component, the tSortRow component, and the
tUniqRow component using Row > Main
connections.
Connect the tUniqRow component and the
first tLogRow component using a Main > Uniques connection.
Connect the tUniqRow component and the
second tLogRow component using a Main > Duplicates connection.

Configuring the components

Double-click the tFileInputDelimited
component to display its Basic settings
view.
Click the […] button next to the
File Name field to browse to your input
file.
Define the header and footer rows. In this use case, the first row of the
input file is the header row.
Click Edit schema to define the schema
for this component. In this use case, the input file has five columns:
Id, FirstName,
LastName, Age, and
City. Then click OK to propagate the schema and close the schema
editor.
Double-click the tSortRow component to
display its Basic settings view.
To rearrange the entries in the alphabetic order of the names, add two
rows in the Criteria table by clicking the
plus button, select the FirstName and
LastName columns under Schema
column, select alpha as the sorting
type, and select the sorting order.
Double-click the tUniqRow component to
display its Basic settings view.
In the Unique key area, select the
columns on which you want deduplication to be carried out. In this use case,
you will sort out duplicated names.
In the Basic settings view of each of the
tLogRow components, select the
Table option to view the Job execution
result in table mode.

Saving and executing the Job

Press Ctrl+S to save your Job.
Run the Job by pressing F6 or clicking
the Run button on the Run tab.

The unique names and duplicated names are displayed in different tables on
the Run console.

Scenario 2: Deduplicating entries based on dynamic schema

This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.

In this use case, we will use a Job similar to the one in the scenario described
earlier to deduplicate the input entries about several families, so that only one person
per family stays on the name list. As all the components in this Job support the dynamic
schema feature, we will leverage this feature to save the time of configuring individual
columns of the schemas.

Setting up the Job

Drop these components from the Palette to
the design workspace: tFileInputDelimited,
tExtractDynamicFields, tUniqRow, tFileOutputDelimited, and tLogRow, and name the components as shown above to better
identify their roles in the Job.
Connect the component labelled People,
the component labelled Split_Column, and
the component labelled Deduplicate using
Row > Main connections.
Connect the component labelled Deduplicate and the component labelled Unique_Families using a Main > Uniques
connection.
Connect the component labelled Deduplicate and the component labelled Duplicated_Families using a Main > Duplicates connection.

Configuring the components

Double-click the component labelled People to display its Basic
settings view.

Warning:

The dynamic schema feature is only supported in Built-In mode and requires the input file
to have a header row.
Click the […] button next to the
File Name/Stream field to browse to
your input file.
Define the header and footer rows. In this use case, the first row of the
input file is the header row.
Click Edit schema to define the schema
for this component.

In this use case, the input file has five columns:
FirstName, LastName,
HouseNo, Street,
and City. However, as we can leverage the advantage of
the dynamic schema feature, we simply define one dynamic column in the
schema, Dyna in this example.

To do so :
1. Add a new line by clicking the [+] button.
2. Type Dyna in the Column field.
3. Select Dynamic from the Type list.
4. Then, click OK to propagate the
  schema and close the [Schema]
  dialog box.
Double-click the component labelled Split_Column to display its Basic
settings view.

We will use this component to split the dynamic column of the input schema
into two columns, one for the first name and the other for the family
related information. To do so:
1. Click Edit schema to open the
  [Schema] dialog box.
2. In the output panel, click the [+] button to add two columns for the output schema,
  and name them FirstName and
  FamilyInfo
  respectively.
3. Select String from the Type list for the FirstName column to extract this column from the
  input schema to carry the first name of each person on the name
  list.
4. Select Dynamic from the Type list for the FamilyInfo column so that this column will carry the
  rest information of each person on the name list: the last name,
  house number, street and city, which all together will identify a
  family.
5. Then, click OK to propagate the
  schema and close the [Schema]
  dialog box.
Double-click the component labelled Deduplicate to display its Basic
settings view.
In the Unique key area, select the
Key attribute check box for the
FamilyInfo column to carry out
deduplication on the family information.
In the Basic settings view of the
tFileOutputDelimited component, which
is labelled Deduplicated_Families, define
the output file path, select the Include
header check box, and leave the other settings as they
are.
In the Basic settings view of the
tLogRow component, which is labelled
Duplicated_Families, select the
Table option to view the Job execution
result in table mode.

Saving and executing the Job

Press Ctrl+S to save your Job.
Run the Job by pressing F6 or clicking
the Run button on the Run tab.

The information of duplicated families is displayed on the Run console, and only one person per family stays
on the name list in the output file.