August 17, 2023

tFSPartitionFile – Docs for ESB 5.x

tFSPartitionFile

tFSPartitionFile_icon32.png

Warning

This component will be available in the Palette of the studio on the condition that you have
subscribed to the relevant edition of one of the Talend solutions
with Big Data.

tFSPartitionFile Properties

Component family

FileScale

Note that this component is deprecated.

Function

tFSPartitionFile enables you to
partition mass data from an input file based on the hash or
round-robin partitioning method before writing it to an output file.
This will facilitates the management of very large tables.

This component has real-time capabilities for partitioning large
scale files. To optimize performance, the component usually sorts
data before processing it.

Purpose

Helps partitioning mass data before writing it in an output
file.

Basic settings

Schema type and Edit
Schema

A schema is a row description, it defines the number of fields to be processed and
passed on to the next component. The schema is either Built-in or stored remotely in the
Repository.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

 

 

Repository: You have already
created the schema and stored it in the Repository. You can reuse it
in various projects and Job flowcharts. Related topic: see Talend Studio User
Guide
.

 

 

Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

Property type

Either Built-in or Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: No property data stored
centrally.

 

 

Repository: Select the repository
file where Properties are stored. The fields that follow are
pre-filled using the fetched data.

 

Input File Name

Name of the file holding the data you want to partition.

 

Output File Name

Name of the file where you want to write the partitioned
data.

Note

The generated set of the output files will be postfixed with
an auto-increment number depending on the number of partitions
you define.

 

Record separator (char)

Character, string or regular expression to separate records
(lines).

 

Field separator (char)

Character, string or regular expression to separate fields in a
record.

 

Header

Number of records to be skipped in the beginning of the
file.

 

Footer

Number of records to be skipped at the end of the file.

 

Number of partitions

Number of partitions in the file.

 

Partition

Select from the list the partition method you want to use:

Round-robin: The records are
partitioned on a round-robin basis so that each partition contains a
more or less equal number of rows and load balancing is achieved.
Because there is no partition key, rows are distributed randomly
across all partitions.

Hash: The records are hashed into
partitions based on the value of a key column or columns selected
from the file schema.

Select the Partition Key check
box that corresponds to the column(s) you want to use as a base to
partition data.

Advanced settings

Generate FSLang File

Select this check box to generate the FSLang file corresponding to
your Job and click the three-dot button next to the FSLang File Name field to specify its
path and its name.

 

Assign FileScale Path

Select this check box and then click the three-dot button next to
the FileScale Path field to select
the FileScale program executable file required to execute the
component.

 

Specify Number of Process Child

Select this check box and enter the number of child processes to
use for carrying out the operation.

 

Sort results

Select this check box to sort the results.

 

Custom FileScale Parameter (separated by,)

Enter the parameters for any specific operation you want to add to
the FileScale executable call.

 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a
Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component handles files therefore it does not require input
and output data flows. It is used to partition data in large scale
files.

Limitation

Limitation is imposed by limits of physical memory and CPU
architectures. For example, total length of processed files cannot
exceed file system limit for LargeFile support (maximum value of 64
signed bits).

Scenario: Partitioning mass data based on the hash method before writing it to an
output file

Warning

Make sure that you have unzipped and saved locally the FileScale
executable file delivered by


Talend

. You must define the path of this executable file in the
Advanced settings
view of
tFSPartitionFile
.

This scenario describes a Job that uses the tFSPartitionFile
component to partition, in high speed, very big data according to the hash
method and using two of the input columns as partition keys.

  • Drop the following components from the Palette to the design workspace: tRowGenerator, tFileOutputDelimited and tFSPartitionFile.

Use_Case_FSPartitionFile.png
  • Connect tRowGenerator first to tFileOutputDelimited using a Row > Main link and then to tFSPartitionFile using an OnSubjobOk link.

In this scenario, the tRowGenerator component will
generate data according to the schema you define in the component Basic settings view and send it to the input file. If data generation is
errorless, the tFSPartitionFile component will
partition data into six subsets according to the defined partition method.

  • Click tRowGenerator to display its Basic settings view and define the component
    properties.

  • Click the three-dot button next to RowGenerator
    Editor
    to open the component editor where you can define your
    schema.

Use_Case_FSPartitionFile1.png
  • In the upper half of the editor, click the plus button to add the columns you
    want to write in the input file.

  • Define the schema and set the parameters of the columns.

    In this scenario, the input file contains five columns:
    id, firstname,
    lastname, city and
    age.

Warning

Make sure to define the length of your columns. Otherwise, an error
message will display when executing your Job.

  • If required, click the Preview tab in the
    lower half of the editor to display the corresponding view and then click the
    View button to display a sample of the
    generated data.

  • Click OK to validate your schema and close
    the tRowGenerator editor.

  • Click tFileOutputDelimited to display its
    Basic settings view and define the
    component properties.

Use_Case_FSPartitionFile2.png
  • Set the tFileOutputdelimited
    properties.

For this scenario, we want to define a context variables for the input file path. You
can create context variables in different ways. For more information about how to create
and use context variables, see Talend Studio User Guide. In
this example, we want to define context variables directly from the component
view.

  • Place your pointer in the field that you want to parameterize, File Name in this example, and then press F5.

    A dialog box displays.

Use_Case_FSPartitionFile3.png
  • Give a name to this new variable and select its type from the Type list.

  • In the Default value field, type in the
    context value you want to use for the input file path.

  • Click Finish to validate your changes and
    close the dialog box.

    The newly created variable is displayed in the File
    Name
    field and in the Contexts
    view.

  • In the Basic settings view of tFileOutputDelimited, click the Edit schema button to display the schema you defined in the
    editor and modify it if required.

  • Click OK to close the dialog box.

  • Click tFSPartitionFile to open its Basic settings view and define the component
    properties.

Use_Case_FSPartitionFile4.png
  • Set schema and property type to Built-In.

  • Click the Edit schema button to display a
    dialog box. Here you can define your column schema. This schema must corresponds
    to the input file schema.

  • Click OK to close the schema dialog
    box.

    The defined column schema displays in the Partition
    configuration
    table.

  • Set the input and output file names using the variable context you define
    earlier by pressing Ctrl + Space and selecting
    the variable from the list.

  • Define the record and field separators and then the header and footer of the
    file, if any.

  • In the Number of partitions field, enter a
    number of the data subsets you want to create, six in this scenario.

  • From the Partition list, select the partition
    method you want to use, Hash in this
    scenario.

  • In the Partition configuration table, select
    the check boxes that correspond to the input columns you want to use as
    partition keys, id and firstname in
    this scenario.

  • Click the Advanced settings tab to display
    the advanced settings view.

Use_Case_FSPartitionFile5.png
  • Select the Assign FileScale Path check box to
    display the FileScale Path field and then
    browse to the executable file delivered by Talend.

  • Save your Job and press F6 to execute it.

A progress bar displays below the tFSPartition
component in the design workspace to show the completed percentage of the operation.
This progress bar will make it evident how the huge input data is partitioned at a very
high speed.

When the percentage progress bar reaches 100%, the data is partitioned into six
subsets as defined in the component settings and written in the defined output files as
shown in the below capture.

Use_Case_FSPartitionFile6.png

The generated set of the output files is postfixed with an auto-increment number
depending on the number of partitions you define, six in this scenario.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x