tFSPartitionFile

Warning

This component will be available in the Palette of the studio on the condition that you have
subscribed to the relevant edition of one of the Talend solutions
with Big Data.

tFSPartitionFile Properties

Component family	FileScale	Note that this component is deprecated.
Function	tFSPartitionFile enables you to partition mass data from an input file based on the hash or round-robin partitioning method before writing it to an output file. This will facilitates the management of very large tables. This component has real-time capabilities for partitioning large scale files. To optimize performance, the component usually sorts data before processing it.
Purpose	Helps partitioning mass data before writing it in an output file.
Basic settings	Schema type and Edit Schema	A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.
		Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job flowcharts. Related topic: see Talend Studio User Guide.
		Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.
	Property type	Either Built-in or Repository. Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.
		Built-in: No property data stored centrally.
		Repository: Select the repository file where Properties are stored. The fields that follow are pre-filled using the fetched data.
	Input File Name	Name of the file holding the data you want to partition.
	Output File Name	Name of the file where you want to write the partitioned data. Note The generated set of the output files will be postfixed with an auto-increment number depending on the number of partitions you define.
	Record separator (char)	Character, string or regular expression to separate records (lines).
	Field separator (char)	Character, string or regular expression to separate fields in a record.
	Header	Number of records to be skipped in the beginning of the file.
	Footer	Number of records to be skipped at the end of the file.
	Number of partitions	Number of partitions in the file.
	Partition	Select from the list the partition method you want to use: Round-robin: The records are partitioned on a round-robin basis so that each partition contains a more or less equal number of rows and load balancing is achieved. Because there is no partition key, rows are distributed randomly across all partitions. Hash: The records are hashed into partitions based on the value of a key column or columns selected from the file schema. Select the Partition Key check box that corresponds to the column(s) you want to use as a base to partition data.
Advanced settings	Generate FSLang File	Select this check box to generate the FSLang file corresponding to your Job and click the three-dot button next to the FSLang File Name field to specify its path and its name.
	Assign FileScale Path	Select this check box and then click the three-dot button next to the FileScale Path field to select the FileScale program executable file required to execute the component.
	Specify Number of Process Child	Select this check box and enter the number of child processes to use for carrying out the operation.
	Sort results	Select this check box to sort the results.
	Custom FileScale Parameter (separated by,)	Enter the parameters for any specific operation you want to add to the FileScale executable call.
	tStatCatcher Statistics	Select this check box to gather the Job processing metadata at a Job level as well as at each component level.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component handles files therefore it does not require input and output data flows. It is used to partition data in large scale files.
Limitation	Limitation is imposed by limits of physical memory and CPU architectures. For example, total length of processed files cannot exceed file system limit for LargeFile support (maximum value of 64 signed bits).

Scenario: Partitioning mass data based on the hash method before writing it to an
output file

Warning

Make sure that you have unzipped and saved locally the FileScale
executable file delivered by

Talend

. You must define the path of this executable file in the
Advanced settings
view of
tFSPartitionFile
.

This scenario describes a Job that uses the tFSPartitionFile
component to partition, in high speed, very big data according to the hash
method and using two of the input columns as partition keys.

Drop the following components from the Palette to the design workspace: tRowGenerator, tFileOutputDelimited and tFSPartitionFile.

Connect tRowGenerator first to tFileOutputDelimited using a Row > Main link and then to tFSPartitionFile using an OnSubjobOk link.

In this scenario, the tRowGenerator component will
generate data according to the schema you define in the component Basic settings view and send it to the input file. If data generation is
errorless, the tFSPartitionFile component will
partition data into six subsets according to the defined partition method.

Click tRowGenerator to display its Basic settings view and define the component
properties.
Click the three-dot button next to RowGenerator
Editor to open the component editor where you can define your
schema.

In the upper half of the editor, click the plus button to add the columns you
want to write in the input file.
Define the schema and set the parameters of the columns.

In this scenario, the input file contains five columns:
id, firstname,
lastname, city and
age.

Warning

Make sure to define the length of your columns. Otherwise, an error
message will display when executing your Job.

If required, click the Preview tab in the
lower half of the editor to display the corresponding view and then click the
View button to display a sample of the
generated data.
Click OK to validate your schema and close
the tRowGenerator editor.
Click tFileOutputDelimited to display its
Basic settings view and define the
component properties.

Set the tFileOutputdelimited
properties.

For this scenario, we want to define a context variables for the input file path. You
can create context variables in different ways. For more information about how to create
and use context variables, see Talend Studio User Guide. In
this example, we want to define context variables directly from the component
view.

Place your pointer in the field that you want to parameterize, File Name in this example, and then press F5.

A dialog box displays.

Give a name to this new variable and select its type from the Type list.
In the Default value field, type in the
context value you want to use for the input file path.
Click Finish to validate your changes and
close the dialog box.

The newly created variable is displayed in the File
Name field and in the Contexts
view.
In the Basic settings view of tFileOutputDelimited, click the Edit schema button to display the schema you defined in the
editor and modify it if required.
Click OK to close the dialog box.
Click tFSPartitionFile to open its Basic settings view and define the component
properties.

Set schema and property type to Built-In.
Click the Edit schema button to display a
dialog box. Here you can define your column schema. This schema must corresponds
to the input file schema.
Click OK to close the schema dialog
box.

The defined column schema displays in the Partition
configuration table.
Set the input and output file names using the variable context you define
earlier by pressing Ctrl + Space and selecting
the variable from the list.
Define the record and field separators and then the header and footer of the
file, if any.
In the Number of partitions field, enter a
number of the data subsets you want to create, six in this scenario.
From the Partition list, select the partition
method you want to use, Hash in this
scenario.
In the Partition configuration table, select
the check boxes that correspond to the input columns you want to use as
partition keys, id and firstname in
this scenario.
Click the Advanced settings tab to display
the advanced settings view.

Select the Assign FileScale Path check box to
display the FileScale Path field and then
browse to the executable file delivered by Talend.
Save your Job and press F6 to execute it.

A progress bar displays below the tFSPartition
component in the design workspace to show the completed percentage of the operation.
This progress bar will make it evident how the huge input data is partitioned at a very
high speed.

When the percentage progress bar reaches 100%, the data is partitioned into six
subsets as defined in the component settings and written in the defined output files as
shown in the below capture.

The generated set of the output files is postfixed with an auto-increment number
depending on the number of partitions you define, six in this scenario.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tFSPartitionFile – Docs for ESB 5.x

tFSPartitionFile

Warning

tFSPartitionFile Properties

Note

Scenario: Partitioning mass data based on the hash method before writing it to an
output file

Warning

Warning

My Website Links

Tags

tFSPartitionFile

Warning

tFSPartitionFile Properties

Note

Scenario: Partitioning mass data based on the hash method before writing it to an output file

Warning

Warning

Scenario: Partitioning mass data based on the hash method before writing it to an
output file