Scenario: Grouping customer numerical data into clusters on HDFS
(deprecated)
This scenario applies only to a subscription-based Talend solution with Big data.
The scenario is inspired from a research paper on model-based clustering. Its data can
be found at Wholesale
customers Data Set. The research paper is available at Enhancing
the selection of a model-based clustering with external categorical
variables. This scenario is included in the Data Quality
Demos project you can import into your
Talend Studio
.
For further information, see the
Talend Studio User
Guide.
The Job in this scenario connects to a given Hadoop distributed file system (HDFS),
groups customers of a “wholesale distributor” into two clusters using the algorithms in
tMahoutClustering and outputs data on a given
HDFS.
The data set has 440 samples that refer to clients of a wholesale distributor. It
includes the annual spending in monetary units on diverse product categories like fresh
and grocery products or milk.
The data set refers to customers from different channels – Horeca
(Hotel/Restaurant/Cafe) or Retail (sale of goods in small quantities) channel, and from
different regions (Lisbon/Oporto/other).
This Job uses:
-
tMahoutClustering to compute the clusters for
the input data set. -
two tAggregateRow components to count the
number of clients in both clusters based on the region and
channel columns. -
three tMap components to map the channel and
region input flows into two separate output flows. The components are also used
to map the single clusterID column received from tMahoutClustering to two-column data flow that feed
the region and the channel clusters. -
two tHDFSOutput components to write data to
HDFS in two output files.
Prerequisites: Before being able to use the tMahoutClustering component, you must have a functional
Hadoop system.