August 17, 2023

tPigCoGroup – Docs for ESB 5.x

tPigCoGroup

tpigcogroup_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.

tPigCoGroup Properties

Component family

Big Data / Pig

 

Function

This component groups data from as many inputs as needed incoming
from its preceding Pig components and aggregates the grouped data
using some given function before sending the data to the next Pig
component.

Purpose

The tPigCoGroup component
performs the Pig COGROUP operation to group and aggregate data
incoming from multiple Pig flows.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields to be processed and passed on
to the next component. The schema is either Built-In or
stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

 

Built-In: You create and store the schema locally for this
component only. Related topic: see Talend Studio
User Guide.

 

 

Repository: You have already created the schema and
stored it in the Repository. You can reuse it in various projects and Job designs. Related
topic: see Talend Studio User Guide.

 

Group by

Click the [+] button to add one
or more columns of the input flows to this Group by table so as to set these columns as group
condition.

 

Output mapping

This table is automatically filled with the output schema you have
defined using the Schema field.
Then complete this table to configure how the grouped data is
aggregated in the output flow:

Function: select the function you
need to use to aggregate a given column.

Source schema: select the input
flow from which you aggregate the data.

Expression: select the column to be
aggregated and if needed, edit expressions

Advanced settings

Group optimization

Select the Pig algorithm depending on the situation of the input
data and the loader you are using to optimize the COGROUP operation.

For further information, see Apache’s documentation about
Pig.

 

Use partitioner

Select this check box to call a Hadoop partitioner in order to partition records and
return the reduce task or tasks that each record should go
to.

Note that this partitioner class must be registered in the Register jar table provided
by the tPigLoad component that
starts the current Pig process.

Increase parallelism

Select this check box to set the number of reduce tasks for the
MapReduce Jobs.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the
Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is commonly used as intermediate step together with
input component and output component.

Prerequisites

The Hadoop distribution must be properly installed, so as to guarantee the interaction
with Talend Studio. The following list presents MapR related information for
example.

  • Ensure that you have installed the MapR client in the machine where the Studio is,
    and added the MapR client library to the PATH variable of that machine. According
    to MapR’s documentation, the library or libraries of a MapR client corresponding to
    each OS version can be found under MAPR_INSTALL
    hadoophadoop-VERSIONlib
    ative
    . For example, the library for
    Windows is lib
    ativeMapRClient.dll
    in the MapR
    client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

    Without adding the specified library or libraries, you may encounter the following
    error: no MapRClient in java.library.path.

  • Set the -Djava.library.path argument, for example, in the Job Run VM arguments area
    of the Run/Debug view in the [Preferences] dialog box. This argument provides to the Studio the
    path to the native library of that MapR client. This allows the subscription-based
    users to make full use of the Data viewer to view
    locally in the Studio the data stored in MapR. For further information about how to
    set this argument, see the section describing how to view data of Talend Big Data Getting Started Guide.

For further information about how to install a Hadoop distribution, see the manuals
corresponding to the Hadoop distribution you are using.

Limitation

Knowledge of Pig scripts is required.

Scenario: aggregating data from two relations using COGROUP

In this scenario, a four-component Job is designed to aggregate two relations on top of a given Hadoop cluster.

use_case-tpigcogroup1.png

The two relations used in this scenario consist of the following sample data:

  1. This
    relation is composed of three columns that read owner, pet and age (of the owners).

  2. This
    relation provides a list of students’ names alongside their friends, of
    which some are pet owners displayed in the first relation. Therefore, the
    schema of this relation contains two columns: student and friend.

Before replicating this scenario, you need to write the sample data into the HDFS system of
the Hadoop cluster to be used. To do this, you can use tHDFSOutput. For further information about this component, see tHDFSOutput.

The data used in this scenario is inspired by the examples that Pig’s documentation
uses to explain the GROUP and the GOGROUP operators. For related information, please see
Apache’s documentation for Pig.

Linking the components

  1. In the Integration perspective of the Studio, create
    an empty Job from the Job Designs node in
    the Repository tree view.

    For further information about how to create a Job, see Talend Studio User Guide.

  2. In the workspace, enter the name of the component to be used and select this component
    from the list that appears. In this scenario, the components are two
    tPigLoad components, a tPigCoGroup component and a tPigStoreResult component. One of the two tPigLoad components is used as the main loading
    component to connect to the Hadoop cluster to be used.

  3. Connect the main tPigLoad component to tPigCoGroup using the Row >
    Main
    link.

  4. Do the same to connect the second tPigLoad component to tPigCoGroup. The Lookup
    label appears over this link.

  5. Repeat the operation to connect tPigCoGroup to tPigStoreResult.

Reading data into the Pig flow

Reading the owner-pet sample data

  1. Double-click the main tPigLoad component to open its
    Component view.

    use_case-tpigcogroup2.png
  2. Click the […] button next to Edit
    schema
    to open the schema editor and click the [+] button three times to add three rows.

  3. In the Column column, rename the new rows to owner, pet
    and age, respectively, and in the
    Type column of the age row, select Integer.

  4. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  5. In the Mode area, select Map/Reduce to use the remote Hadoop cluster to be
    used.

  6. In the Distribution and the Version lists, select the Hadoop distribution you are using.
    In this example, HortonWorks Data Platform V2.1.0
    (Baikal)
    is selected.

  7. In the Load function list, select PigStorage. Then, the corresponding parameters to be set
    appear.

  8. In the NameNode URI and the Resource manager fields, enter the locations of
    those services, respectively.

  9. Select the Set Resourcemanager scheduler address check box
    and enter the URI of this service in the field that is displayed. This
    allows you to use the Scheduler service defined in the Hadoop cluster to be
    used. If this service is not defined in your cluster, you can ignore this
    step.

  10. In the User name field, enter the name of the user having the
    appropriate right to write data in the cluster. In this example, it is
    hdfs.

  11. In the Input file URI field, enter the path pointing to the
    relation you need to read data from. As explained previously, the relation
    to be read here is the one containing the owner and pet sample data.

  12. In the Field separator field, enter the separator of the data
    to be read. In this example, it is semicolon (;).

Loading the student-friend sample data

  1. Double-click the second tPigLoad component to open its Component view.

    use_case-tpigcogroup4.png
  2. Click the […] button next to Edit schema to open the schema editor.

  3. Click the [+] button twice to add two rows and in the
    Column column, rename them to student and friend, respectively.

    use_case-tpigcogroup5.png
  4. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.

  5. In the Mode area, select Map/Reduce.

    This component reuses the Hadoop connection you have configured in
    that main tPigLoad component. Therefore, the Distribution
    and the Version fields have been
    automatically filled with the values from that main loading component.

  6. In the Load function field, select the
    PigStorage function to read the source
    data.

  7. In the Input file URI field, enter the directory where
    the source data is stored. As explained previously, this data is from the
    second relation containing the student and friend sample data.

Aggregating the relations

  1. Double-click tPigCoGroup to open its
    Component view.

    use_case-tpigcogroup6.png
  2. Click the […] button next to Edit
    schema
    to open the schema editor.

  3. Click the [+] button five times to add five rows and in the
    Column column, rename them to owner_friend, age, pet_number,
    pet and student, respectively.

  4. In the Type column of the age row, select Integer.

  5. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.

  6. In the Group by table, click the [+] button once to add one row.

  7. Then you need to set the grouping condition in this Group by
    table to aggregate the two input relations. In each column representing the
    input relation, click the newly added row and select the column you need to
    use to compose the grouping condition. In this scenario, the owner column from the owner-pet relation and
    the friend column from the student-friend
    relation are selected because they have common records. Based on these
    columns, the two relations are aggregated into bags.

    The bags regarding the record Alice might read as
    follow:

  8. In the Output mapping table, the output schema you defined
    previously has been automatically fed into the Column column. You need to complete this table to define how
    the grouped bags are aggregated into the schema of the output relation. The
    following list provides more details about how this aggregation is
    configured for this scenario:

    • The owner_friend column is
      used to receive the literal records incoming from the columns
      that are used as the grouping condition. For this reason, select
      the EMPTY function from the
      Function drop-down list so
      that the incoming records stay as is. Then select row1 from the Source schema list and owner from the Expression list to read the records from the
      corresponding input column; you can as well select row2 and friend, the records to be received are the same
      because the owner column and
      the friend column are joined
      when they are used as grouping condition.

      Note that the label row1 is
      the ID of the input link and thus may be different in your
      scenario.

    • The age column is used to
      receive the age data. As shown in the example bags in the
      previous step, the age of an owner repetitively appears in one
      of the bags after the grouping. You can select the AVG function from the Function list to make the average of
      the repetitive values such that this age appears only once in
      the final result. Then select row1 from the Source
      schema
      list and age from the Expression list.

    • The pet_number column is
      used to receive how many pets an owner has. Select the COUNT function from the Function list to perform this
      calculation. Then select row1
      from the Source schema list and
      pet from the Expression list.

    • The pet column and the
      student column are used
      to receive the grouped records from the input pet and student columns, respectively. So select
      EMPTY for both of them and
      from the Source schema list of
      each, select the corresponding input schema and from the
      Expression list, the
      corresponding column.

Writing the aggregated data

  1. Double-click tPigStoreResult to open its
    Component view.

    use_case-tpigcogroup7.png
  2. If this component does not have the same schema of the preceding component, a warning
    icon appears. In this case, click the Sync
    columns
    button to retrieve the schema from the
    preceding one and once done, the warning icon disappears.

  3. In the Result folder URI field, enter the path in HDFS
    pointing to the location you want to write the result in.

  4. Select the Remove result directory if exists check
    box.

  5. From the Store function list, select PigStorage.

  6. In the Field separator field, enter the separator you want to
    use. In this scenario, enter a coma (,).

Executing the Job

Then you can press F6 to run this Job.

Once done, check the result from the HDFS system you are using.

use_case-tpigcogroup8.png

You can read, for example, that the pet owner Alice is
17 years old, has 3 pets, a cat, a goldfish and a turtle and two of her friends are
Mark and Cindy.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x