August 17, 2023

tMDMBulkLoad – Docs for ESB 5.x

tMDMBulkLoad

tMDMBulkLoad.png

tMDMBulkLoad properties

Component family

Talend
MDM

 

Function

tMDMBulkLoad writes XML
structured master data into the MDM hub in bulk mode.

Purpose

This component uses bulk mode to write data so that big batches of
data or data of high complexity can be quickly uploaded onto the MDM
server.

Basic settings

Schema and Edit
Schema

A schema is a row description, it defines the number of fields
that will be processed and passed on to the next component. The
schema is either built-in or remote in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Edit schema to make changes to the schema. If the
current schema is of the Repository type, three options are
available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this option
    to change the schema to Built-in for local
    changes.

  • Update repository connection: choose this option to change
    the schema stored in the repository and decide whether to propagate the changes to
    all the Jobs upon completion. If you just want to propagate the changes to the
    current Job, you can select No upon completion and
    choose this schema metadata again in the [Repository
    Content]
    window.

Click Sync columns to collect the
schema from the previous component.

 

 

Built-in: You create the schema
and store it locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

 

Repository: You have already
created the schema and stored it in the Repository. You can reuse it
in various projects and Job designs. Related topic: see
Talend Studio User
Guide
.

 

XML field

Select the name of the column in which you want to write the XML
data.

 

URL

Type in the URL required to access the MDM server.

 

Username and
Password

Type in the user authentication data for the MDM server.

To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings.

 

Version (deprecated)

Type in the name of the Version of master data you want to connect
to, for which you have the required user rights.

Leave this field empty if you want to display the default Version of master
data.

 

Data Model

Type in the name of the data model against which the data to be
written is validated.

 

Data Container

Type in the name of the data container where you want to write the
master data.

 

Entity

Type in the name of the entity that holds the data record(s) you
want to write.

Type

Select Master or Staging to specify the database on which
the action should be performed.

 

Validate

Select this check box to validate the data you want to write onto
the MDM server against validation rules defined for the current data
model.

Note that for the PROVISIONING Data Container, validation checks
will always be performed on incoming records, regardless of whether
or not this check box is selected.

For more information on how to set the validation rules, see
Talend Studio User
Guide
.

Warning

If you need faster loading performance, do not select
this check box.

 

Generate ID

Select this check box to generate an ID number for all of the data
written.

Warning

If you need faster loading performance, do not select
this check box.

 

Commit size

Type in the row count of each batch to be written onto the MDM
server.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the processing metadata at the Job
level as well as at each component level.

Connections

Outgoing links (from this component to another):

Row: Main,

Trigger: Run if; On Component Ok;
On Component Error, On Subjob Ok, On Subjob Error.

Incoming links (from one component to this one):

Row: Main

Trigger: Run if, On Component Ok,
On Component Error, On Subjob Ok, On Subjob Error

For further information regarding connections, see
Talend Studio User
Guide
.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component needs always an incoming link to offer XML
structured data. If your data offered is not yet in the XML
structure, you need use components like tWriteXMLField to transform this data into the XML
structure. For further information about tWriteXMLField, see tWriteXMLField.

Enhancing the MDM bulk data load

The information below concerns only MDM used with eXist (deprecated).

As XML parsing is a CPU and memory consuming process, it is not really compatible
with large datasets. The following Scenario: Scenario: Loading records into a business entity, which shows
how to use the tMDMBulkLoad component, has some
limitations because it cannot work with large dataset, for the time being at least.

An alternative scenario in which you process the dataset file per bulk load
iterations can be designed as the following:

MDM_Data_Import.jpg

In such a scenario, the tMDMBulkLoad component
waits for XML data as an input. You must manually format this incoming data to match
the entity schema defined in the MDM perspective of Talend Studio. Most of the time, the data you want to import is
in a flat “format”, and you have to transform it into XML.

As XML parsing is memory consuming, you can workaround this problem by splitting
your source file into several files using the tAdvancedFileOutputXML component. To do this, you select the
Split output in several files option in the
Advaced settings view of the component and then
set the rows in each output file through a context variable
(context.chunkSize), for example.

MDM_Data_Import1.png

The XML schema you must define in the XML editor of this component should be an
exact match of the business entity defined in the MDM perspective
of Talend Studio. The XML schema in the editor must
represent a single <root> element which contains all the other
elements, so that you can loop on each of the element. The path of the file should
be defined in a temporary folder.

Use a tFileList component to read all the XML
files that have just been created. This component enables you to parallelize the
process. Connect it to a tFileInputXML component
using the Iterate link.

MDM_Data_Import2.png

Note

For the Iterate link, it is recommended that
you set as many threads as the number of the physical cores of the computer. You
can achieve that using
Runtime.getRuntime().availableProcessors()

The tFileInputXML component will read the data
from the XML files you have created, by defining a loop on the elements, and getting
all the nodes that are already formatted as XML. You must then select the Get Nodes check box.

MDM_Data_Import3.png

Finally, you must setup the tMDMBulkLoad
component as the following:

MDM_Data_Import4.png

Note

Ensure that you set the commit size to the same value you defined in the
tAdvancedfileOutputXML, the
context.chunkSize context variable.

The tFiledelete component in such a scenario will
delete all the temporary data at the end of the Job.

Tips for better-performance bulk commit with Talend XML database

The below tables will help you decide what batch size to use in the tMDMBulkLoad component and the expected bulk load time
for each choice.

Note

The information below concerns only MDM used with Talend XML database.

The table below lists recommended batch size according to the database size and
the corresponding expected bulk load time.

Database size (in XML documents)

Optimal batch size (in XML documents)

Expected bulk load time

5000 000

20 000

42 min

10 000 000

40 000

1 h 30 min

15 000 000

59 000

2 h 20 min

50 000 000

200 000

10 h

Note

Batch size depends only on database size and not on the size of the XML
document.

The above bulk load times were observed on the following environment:

-Intel Xeon X3430 @2.40 Ghz

-8 Gb RAM

-Windows Server 2008 64 bits SP1

The prerequisites for an optimal performance of bulk load on an MDM Server are
(in ascending order of importance):

-I/O speed

-CPU speed

-Amount of RAM

For optimal performance, you should perform MDM bulk load with 2 threads (each
thread with the optimal batch size) and without XML validation. The Validate check box in the basic settings view of the
tMDMBulkLoad component must be unchecked; for
further information, see tMDMBulkLoad properties.

For disk usage, it is recommended that available space is three times the database
size after bulk load is completed. For example, for XML documents of 1.5 Kb, the
database size after bulk load is:

Database size (in XML documents)

database size on disk

5000 000

10 Gb

10 000 000

20 Gb

15 000 000

30 Gb

50 000 000

100 Gb

For example, for a 5 million document database and 1.5 Kb XML documents, we
recommend a disk with 30 Gb available space.

Scenario: Loading records into a business entity

This scenario describes a Job that loads records into the ProductFamily
business entity defined by a specific data model in the MDM hub.

Prerequisites of this Job:

  • The Product data container: this data container is used
    to separate the product master data domain from the other master data
    domains.

  • The Product data model: this data model is used to define
    the attributes, validation rules, user access rights and relationships of the
    entities of interest. Thus it defines the attributes of the
    ProductFamily business entity.

  • The ProductFamily business entity: this business entity
    contains Id, Name, both defined by the
    Product data model.

For further information about how to create a data container, a data model, and a
business entity along with its attributes, see Talend Studio User
Guide
.

The Job in this scenario uses three components.

Use_Case_MDMBulkLoad1.png
  • tFixedFlowInput: this component generates the
    records to be loaded into the ProductFamily business
    entity. In the real case, your records to be loaded are often voluminous and
    stored in a specific file, while in order to simplify the replication of this
    scenario, this Job uses tFixedFlowInput to
    generate four sample records.

  • tWriteXMLField: this component transforms the
    incoming data into XML structure.

  • tMDMBulkLoad: this component writes the
    incoming data into the ProductFamily business entity in
    bulk mode, generating ID value for each of the record data.

Warning

For the time being, tWriteXMLField has some limitations when used with very large
datasets. Another scenario is possible to enhance the MDM bulk data load. For
further information, see Enhancing the MDM bulk data load.

To replicate this scenario, proceed as follows:

  • Drop tFixedFlowInput, tWriteXMLField and tMDMBulkLoad
    onto the design workspace.

  • Right click tFixedFlowInput to open its
    contextual menu.

  • Select Row > Main to connect tFixedFlowInput to the following component using
    Main link.

  • Do the same to link the other components.

  • Double click tFixedFlowInput to open its
    Basic settings view.

Use_Case_MDMBulkLoad2.png
  • Click the three-dot button next to Edit
    schema
    to open the schema editor.

Use_Case_MDMBulkLoad3.png
  • In the schema editor, click the plus button to add one row.

  • In the schema editor, click the new row and type in the new name:
    family.

  • Click OK.

  • In the Mode area of the Basic settings view, select the Use inline
    table
    option.

  • Under the inline table, click the plus button four times to add four rows in
    the table.

  • In the inline table, click each of the added rows and type in their names
    between the quotation marks: Shirts,
    Hats, Pets,
    Mugs.

  • Double click tWriteXMLField to open its
    Basic settings view.

Use_Case_MDMBulkLoad4.png
  • Click the three-dot button next to the Edit
    schema
    field to open the schema editor where you can add a row by
    clicking the plus button.

Use_Case_MDMBulkLoad5.png
  • Click the newly added row to the right view of the schema editor and type in
    the name of the output column where you want to write the XML content. In this
    example, type in xmlRecord.

  • Click OK to validate this output schema and
    close the schema editor.

  • In the popped up dialog box, click OK to
    propagate this schema to the following component.

  • On the Basic settings view, click the
    three-dot button next to Configure Xml Tree to
    open the interface that helps to create the XML structure.

Use_Case_MDMBulkLoad6.png
  • In the Link Target area, click
    rootTag and rename it as
    ProductFamily, which is the name of the business entity
    used in this scenario.

  • In the Linker source area, drop
    family to ProductFamily in the
    Link target area.

    A dialog box displays asking what type of operation you want to do.

  • Select Create as sub-element of target node
    to create a sub-element of the ProductFamily node. Then the
    family element appears under the
    ProductFamily node.

  • In the Link target area, click the family node and rename it as Name, which is one of the attributes of the ProductFamily business entity.

  • Right-click the Name node and select from the contextual
    menu Set As Loop Element.

  • Click OK to validate the XML structure you
    defined.

  • Double-click tMDMBulkLoad to open its
    Basic settings view.

Use_Case_MDMBulkLoad7.png
  • In XML Field, click this field and select
    xmlRecord from the drop-down list.

  • In the URL field, enter the bulk loader URL,
    between quotes: for example,
    http://localhost:8080/datamanager/loadServlet.

  • In the Username and Password fields, enter your login and password to connect to the
    MDM server.

  • In the Data Model and the Data Container fields, enter the names corresponding
    to the data model and the data container you need to use. Both are
    Product for this scenario.

  • In the Entity field, enter the name of the
    business entity which the records are to be loaded in. In this example, type in
    ProductFamily.

  • Select the Generate ID check box in order to
    generate ID values for the records to be loaded.

  • In the Commit size field, type in the batch
    size to be written into the MDM hub in bulk mode.

    For more information about the batch size, see
    Tips for better-performance bulk commit with Talend XML database .

  • Press F6 to run the Job.

  • Log into your Talend MDM Web User Interface to check the newly
    added records for the ProductFamily business entity.

Use_Case_MDMBulkLoad8.png

Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x