Defining data lineage with Atlas
If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce or Spark
Batch Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make
use of Atlas to trace the lineage of given data flow to discover how this data was
generated by a Job.
This linage includes the components used in this Job and the schema changes between the components.
If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5
only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7
only.
With this option activated, you need to set the following parameters:
-
Atlas URL: enter the location of the Atlas to be
connected to. It is often http://name_of_your_atlas_node:port -
In the Username and Password fields, enter the authentication information for access to
Atlas. -
Set Atlas configuration folder: if your Atlas cluster
contains custom properties such as SSL or read timeout, select this check box, and in
the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory.
This way, your Job is enabled to use these custom properties.You need to ask the administrator of your cluster for this configuration file.
For further information about this file, see the Client Configs section in Atlas configuration. -
Die on error: select this check box to stop the Job
execution when Atlas-related issues occur, such as connection issues to Atlas.Otherwise, leave it clear to allow your Job to continue to run.
If you are using Hortonworks Data Platform V2.4, the Studio supports Atlas 0.5
only; if you are using Hortonworks Data Platform.V2.5, the Studio supports Atlas 0.7
only.
The time when you run this Job, the lineage will be automatically generated in Atlas.
When the execution of the Job is done, perform a search in Atlas for the lineage information written by this Job and read the lineage there.
-
the Job itself
-
the components in the Job that are using data schemas, such as tRowGenerator or tSortRow. The connection or configuration components such as
tHDFSConfiguration are not taken into
account since these components do not use schemas.