July 30, 2023

Defining data lineage with Cloudera Navigator – Docs for ESB 7.x

Defining data lineage with Cloudera Navigator

If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.

This lineage includes the components used in this Job and the schema changes between the components.

In the configuration view, which is the Hadoop configuration view of the Run tab for a MapReduce Job and the Spark configuration view of the Run tab for a Spark Batch Job, select the Use Cloudera Navigator check box.

With this option activated, you need to set the following parameters:

  • Username and Password: this is the credentials you use to connect to your Cloudera
    Navigator.

  • Cloudera Navigator URL : enter the location of the
    Cloudera Navigator to be connected to.

  • Cloudera Navigator Metadata URL: enter the location
    of the Navigator Metadata.

  • Activate the autocommit option: select this check box
    to make Cloudera Navigator generate the lineage of the current Job at the end of the
    execution of this Job.

    Since this option actually forces Cloudera Navigator to generate lineages of
    all its available entities such as HDFS files and directories, Hive queries or Pig
    scripts, it is not recommended for the production environment because it will slow the
    Job.

  • Kill the job if Cloudera Navigator fails: select this check
    box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

    Otherwise, leave it clear to allow your Job to continue to run.

  • Disable SSL validation: select this check box to
    make your Job to connect to Cloudera Navigator without the SSL validation
    process.

    This feature is meant to facilitate the test of your Job but is not
    recommended to be used in a production cluster.

When you run this Job, the lineage will be automatically generated in Cloudera Navigator.

When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x