Component family |
Talend |
|
Function |
tMDMBulkLoad writes XML |
|
Purpose |
This component uses bulk mode to write data so that big batches of |
|
Basic settings |
Schema and Edit |
A schema is a row description, it defines the number of fields Since version 5.6, both the Built-In mode and the Repository mode are Click Edit schema to make changes to the schema. If the
Click Sync columns to collect the |
|
|
Built-in: You create the schema |
|
|
Repository: You have already |
|
XML field |
Select the name of the column in which you want to write the XML |
|
URL |
Type in the URL required to access the MDM server. |
|
Username and |
Type in the user authentication data for the MDM server. To enter the password, click the […] button next to the |
|
Version (deprecated) |
Type in the name of the Version of master data you want to connect Leave this field empty if you want to display the default Version of master |
|
Data Model |
Type in the name of the data model against which the data to be |
|
Data Container |
Type in the name of the data container where you want to write the |
|
Entity |
Type in the name of the entity that holds the data record(s) you |
Type |
Select Master or Staging to specify the database on which |
|
|
Validate |
Select this check box to validate the data you want to write onto Note that for the PROVISIONING Data Container, validation checks For more information on how to set the validation rules, see Warning
If you need faster loading performance, do not select |
|
Generate ID |
Select this check box to generate an ID number for all of the data Warning
If you need faster loading performance, do not select |
|
Commit size |
Type in the row count of each batch to be written onto the MDM |
Advanced settings |
tStatCatcher Statistics |
Select this check box to gather the processing metadata at the Job |
Connections |
Outgoing links (from this component to another): Row: Main,
Trigger: Run if; On Component Ok; Incoming links (from one component to this one): Row: Main
Trigger: Run if, On Component Ok, For further information regarding connections, see |
|
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component needs always an incoming link to offer XML |
The information below concerns only MDM used with eXist (deprecated).
As XML parsing is a CPU and memory consuming process, it is not really compatible
with large datasets. The following Scenario: Scenario: Loading records into a business entity, which shows
how to use the tMDMBulkLoad component, has some
limitations because it cannot work with large dataset, for the time being at least.
An alternative scenario in which you process the dataset file per bulk load
iterations can be designed as the following:
In such a scenario, the tMDMBulkLoad component
waits for XML data as an input. You must manually format this incoming data to match
the entity schema defined in the MDM perspective of Talend Studio. Most of the time, the data you want to import is
in a flat “format”, and you have to transform it into XML.
As XML parsing is memory consuming, you can workaround this problem by splitting
your source file into several files using the tAdvancedFileOutputXML component. To do this, you select the
Split output in several files option in the
Advaced settings view of the component and then
set the rows in each output file through a context variable
(context.chunkSize), for example.
The XML schema you must define in the XML editor of this component should be an
exact match of the business entity defined in the MDM perspective
of Talend Studio. The XML schema in the editor must
represent a single <root>
element which contains all the other
elements, so that you can loop on each of the element. The path of the file should
be defined in a temporary folder.
Use a tFileList component to read all the XML
files that have just been created. This component enables you to parallelize the
process. Connect it to a tFileInputXML component
using the Iterate link.
Note
For the Iterate link, it is recommended that
you set as many threads as the number of the physical cores of the computer. You
can achieve that using
Runtime.getRuntime().availableProcessors()
The tFileInputXML component will read the data
from the XML files you have created, by defining a loop on the elements, and getting
all the nodes that are already formatted as XML. You must then select the Get Nodes check box.
Finally, you must setup the tMDMBulkLoad
component as the following:
Note
Ensure that you set the commit size to the same value you defined in the
tAdvancedfileOutputXML, the
context.chunkSize context variable.
The tFiledelete component in such a scenario will
delete all the temporary data at the end of the Job.
The below tables will help you decide what batch size to use in the tMDMBulkLoad component and the expected bulk load time
for each choice.
Note
The information below concerns only MDM used with Talend XML database.
The table below lists recommended batch size according to the database size and
the corresponding expected bulk load time.
Database size (in XML documents) |
Optimal batch size (in XML documents) |
Expected bulk load time |
5000 000 |
20 000 |
42 min |
10 000 000 |
40 000 |
1 h 30 min |
15 000 000 |
59 000 |
2 h 20 min |
50 000 000 |
200 000 |
10 h |
Note
Batch size depends only on database size and not on the size of the XML
document.
The above bulk load times were observed on the following environment:
-Intel Xeon X3430 @2.40 Ghz
-8 Gb RAM
-Windows Server 2008 64 bits SP1
The prerequisites for an optimal performance of bulk load on an MDM Server are
(in ascending order of importance):
-I/O speed
-CPU speed
-Amount of RAM
For optimal performance, you should perform MDM bulk load with 2 threads (each
thread with the optimal batch size) and without XML validation. The Validate check box in the basic settings view of the
tMDMBulkLoad component must be unchecked; for
further information, see tMDMBulkLoad properties.
For disk usage, it is recommended that available space is three times the database
size after bulk load is completed. For example, for XML documents of 1.5 Kb, the
database size after bulk load is:
Database size (in XML documents) |
database size on disk |
5000 000 |
10 Gb |
10 000 000 |
20 Gb |
15 000 000 |
30 Gb |
50 000 000 |
100 Gb |
For example, for a 5 million document database and 1.5 Kb XML documents, we
recommend a disk with 30 Gb available space.
This scenario describes a Job that loads records into the ProductFamily
business entity defined by a specific data model in the MDM hub.
Prerequisites of this Job:
-
The Product data container: this data container is used
to separate the product master data domain from the other master data
domains. -
The Product data model: this data model is used to define
the attributes, validation rules, user access rights and relationships of the
entities of interest. Thus it defines the attributes of the
ProductFamily business entity. -
The ProductFamily business entity: this business entity
contains Id, Name, both defined by the
Product data model.
For further information about how to create a data container, a data model, and a
business entity along with its attributes, see Talend Studio User
Guide.
The Job in this scenario uses three components.
-
tFixedFlowInput: this component generates the
records to be loaded into the ProductFamily business
entity. In the real case, your records to be loaded are often voluminous and
stored in a specific file, while in order to simplify the replication of this
scenario, this Job uses tFixedFlowInput to
generate four sample records. -
tWriteXMLField: this component transforms the
incoming data into XML structure. -
tMDMBulkLoad: this component writes the
incoming data into the ProductFamily business entity in
bulk mode, generating ID value for each of the record data.
Warning
For the time being, tWriteXMLField has some limitations when used with very large
datasets. Another scenario is possible to enhance the MDM bulk data load. For
further information, see Enhancing the MDM bulk data load.
To replicate this scenario, proceed as follows:
-
Drop tFixedFlowInput, tWriteXMLField and tMDMBulkLoad
onto the design workspace. -
Right click tFixedFlowInput to open its
contextual menu. -
Select Row > Main to connect tFixedFlowInput to the following component using
Main link. -
Do the same to link the other components.
-
Double click tFixedFlowInput to open its
Basic settings view.
-
Click the three-dot button next to Edit
schema to open the schema editor.
-
In the schema editor, click the plus button to add one row.
-
In the schema editor, click the new row and type in the new name:
family. -
Click OK.
-
In the Mode area of the Basic settings view, select the Use inline
table option. -
Under the inline table, click the plus button four times to add four rows in
the table. -
In the inline table, click each of the added rows and type in their names
between the quotation marks: Shirts,
Hats, Pets,
Mugs. -
Double click tWriteXMLField to open its
Basic settings view.
-
Click the three-dot button next to the Edit
schema field to open the schema editor where you can add a row by
clicking the plus button.
-
Click the newly added row to the right view of the schema editor and type in
the name of the output column where you want to write the XML content. In this
example, type in xmlRecord. -
Click OK to validate this output schema and
close the schema editor. -
In the popped up dialog box, click OK to
propagate this schema to the following component. -
On the Basic settings view, click the
three-dot button next to Configure Xml Tree to
open the interface that helps to create the XML structure.
-
In the Link Target area, click
rootTag and rename it as
ProductFamily, which is the name of the business entity
used in this scenario. -
In the Linker source area, drop
family to ProductFamily in the
Link target area.A dialog box displays asking what type of operation you want to do.
-
Select Create as sub-element of target node
to create a sub-element of the ProductFamily node. Then the
family element appears under the
ProductFamily node. -
In the Link target area, click the family node and rename it as Name, which is one of the attributes of the ProductFamily business entity.
-
Right-click the Name node and select from the contextual
menu Set As Loop Element. -
Click OK to validate the XML structure you
defined. -
Double-click tMDMBulkLoad to open its
Basic settings view.
-
In XML Field, click this field and select
xmlRecord from the drop-down list. -
In the URL field, enter the bulk loader URL,
between quotes: for example,
http://localhost:8080/datamanager/loadServlet. -
In the Username and Password fields, enter your login and password to connect to the
MDM server. -
In the Data Model and the Data Container fields, enter the names corresponding
to the data model and the data container you need to use. Both are
Product for this scenario. -
In the Entity field, enter the name of the
business entity which the records are to be loaded in. In this example, type in
ProductFamily. -
Select the Generate ID check box in order to
generate ID values for the records to be loaded. -
In the Commit size field, type in the batch
size to be written into the MDM hub in bulk mode.For more information about the batch size, see
Tips for better-performance bulk commit with Talend XML database . -
Press F6 to run the Job.
-
Log into your Talend MDM Web User Interface to check the newly
added records for the ProductFamily business entity.