August 15, 2023

Scenario: Grouping customer numerical data into clusters on HDFS (deprecated) – Docs for ESB 6.x

Scenario: Grouping customer numerical data into clusters on HDFS

This scenario applies only to a subscription-based Talend solution with Big data.

The scenario is inspired from a research paper on model-based clustering. Its data can
be found at Wholesale
customers Data Set
. The research paper is available at Enhancing
the selection of a model-based clustering with external categorical
. This scenario is included in the Data Quality
project you can import into your
Talend Studio
For further information, see the
Talend Studio User

The Job in this scenario connects to a given Hadoop distributed file system (HDFS),
groups customers of a “wholesale distributor” into two clusters using the algorithms in
tMahoutClustering and outputs data on a given

The data set has 440 samples that refer to clients of a wholesale distributor. It
includes the annual spending in monetary units on diverse product categories like fresh
and grocery products or milk.

The data set refers to customers from different channels – Horeca
(Hotel/Restaurant/Cafe) or Retail (sale of goods in small quantities) channel, and from
different regions (Lisbon/Oporto/other).


This Job uses:

  • tMahoutClustering to compute the clusters for
    the input data set.

  • two tAggregateRow components to count the
    number of clients in both clusters based on the region and
    channel columns.

  • three tMap components to map the channel and
    region input flows into two separate output flows. The components are also used
    to map the single clusterID column received from tMahoutClustering to two-column data flow that feed
    the region and the channel clusters.

  • two tHDFSOutput components to write data to
    HDFS in two output files.

Prerequisites: Before being able to use the tMahoutClustering component, you must have a functional
Hadoop system.

Document get from Talend
Thank you for watching.
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x