Defining data lineage with Cloudera Navigator
If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.
This lineage includes the components used in this Job and the schema changes between the components.
With this option activated, you need to set the following parameters:
-
Username and Password: this is the credentials you use to connect to your Cloudera
Navigator. -
Cloudera Navigator URL : enter the location of the
Cloudera Navigator to be connected to. -
Cloudera Navigator Metadata URL: enter the location
of the Navigator Metadata. -
Activate the autocommit option: select this check box
to make Cloudera Navigator generate the lineage of the current Job at the end of the
execution of this Job.Since this option actually forces Cloudera Navigator to generate lineages of
all its available entities such as HDFS files and directories, Hive queries or Pig
scripts, it is not recommended for the production environment because it will slow the
Job. -
Kill the job if Cloudera Navigator fails: select this check
box to stop the execution of the Job when the connection to your Cloudera Navigator fails.Otherwise, leave it clear to allow your Job to continue to run.
-
Disable SSL validation: select this check box to
make your Job to connect to Cloudera Navigator without the SSL validation
process.This feature is meant to facilitate the test of your Job but is not
recommended to be used in a production cluster.
When you run this Job, the lineage will be automatically generated in Cloudera Navigator.
When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.