August 15, 2023

tCassandraInput – Docs for ESB 6.x

tCassandraInput

Extracts the desired data from a standard or super column family of a Cassandra
keyspace so as to apply changes to the data.

tCassandraInput allows
you to read data from a Cassandra keyspace and send data in the Talend flow.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

Mapping tables between Cassandra type and Talend data type

The first of the following two tables presents the mapping relationships between
Cassandra type with Cassandra’s new API, Datastax and
Talend
data type
.

Cassandra 2.0 or later versions

Cassandra Type

Talend Data Type

Ascii

String; Character

BigInt

Long

Blob

Byte[]

Boolean

Boolean

Counter

Long

Inet

Object

Int

Integer; Short; Byte

List

List

Map

Object

Set

Object

Text

String; Character

Timestamp

Date

UUID

String

TimeUUID

String

VarChar

String; Character

VarInt

Object

Boolean

Boolean

Float

Float

Double

Double

Decimal

BigDecimal

Cassandra Hector API ( for Cassandra versions older than 2.0)

The following table presents the mapping relationships between Cassandra type with the Hector API and Talend data type.

Cassandra Type

Talend Data Type

BytesType

byte[]

AsciiType

String

UTF8Type

String

IntegerType

Object

Int32Type

Integer

LongType

Long

UUIDType

String

TimeUUIDType

String

DateType

Date

BooleanType

Boolean

FloatType

Float

DoubleType

Double

DecimalType

BigDecimal

tCassandraInput Standard properties

These properties are used to configure tCassandraInput running in the Standard Job framework.

The Standard
tCassandraInput component belongs to the Big Data and the Databases families.

The component in this framework is available when you are using one of the Talend solutions with Big Data.

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the
properties are stored.

Use existing connection

Select this check box and in the Component
List
click the relevant connection component to reuse the connection
details you already defined.

DB Version

Select the Cassandra version you are using.

API type

This drop-down list is displayed only when you have selected the 2.0 version of Cassandra
from the DB version list. From this API type list, you can either select Datastax to use CQL 3 (Cassandra Query Language) with Cassandra, or select
Hector to use CQL 2.

Note that the Hector API is deprecated for the 2.0 or
later version of Cassandra, but it is still available for use in the Studio so that you can
be flexible about the version of the query language to be used with Cassandra 2.0.0.

Along with the evolution of the CQL commands, the parameters to be set in the Basic settings view varies.

Host

Hostname or IP address of the Cassandra server.

Port

Listening port number of the Cassandra server.

Required authentication

Select this check box to provide credentials for the Cassandra
authentication.

This check box appears only if you do not select the Use existing connection check box.

Username

Fill in this field with the username for the Cassandra
authentication.

Password

Fill in this field with the password for the Cassandra
authentication.

To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings.

Keyspace

Type in the name of the keyspace from which you want to read data.

Column family

Type in the name of the column family from which you want to read data.

Schema and
Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

Query

Enter the query statements to be used to read data from the Cassandra
database.

By default, the query is not case-sensitive. This means that at runtime, the column names
you put in the query are always taken in lower case. If you need to make the query
case-sensitive, put the column names in double quotation marks.

The […] button next to this field
allows you to generate the sample code that shows what the pre-defined
variables are for the data to be read and how these variables can be
used.

This feature is available only for the Datastax API of Cassandra 2.0 or a later version.

Column family type

Standard: Column family is of
standard type.

Super: Column family is of super
type.

Include key in output
columns

Select this check box to include the key of the column family in
output columns.

  • Key column: select the key
    column from the list.

Row key type

Select the appropriate
Talend
data type for the
row key from the list.

Row key Cassandra type

Select the corresponding Cassandra type for the row key from the
list.

Warning:

The value of the Default option
varies with the selected row key type. For example, if you select
String from the Row key type list, the value of the
Default option will be
UTF8.

For more information about the mapping table between Cassandra type
and
Talend
data type, see Mapping tables between Cassandra type and Talend data type.

Include super key output
columns

Select this check box to include the super key of the column family in
output columns.

  • Super key column: select the
    desired super key column from the list.

This check box appears only if you select Super from the Column family
type
drop-down list.

Super column type

Select the type of the super column from the list.

Super column Cassandra type

Select the corresponding Cassandra type for the super column from the
list.

For more information about the mapping table between Cassandra type
and Talend data type, see Mapping tables between Cassandra type and Talend data type.

Specify row keys

Select this check box to specify the row keys of the column family
directly.

Row Keys

Type in the specific row keys of the column family in the correct
format depending on the row key type.

This field appears only if you select the Specify row keys check box.

Key start

Type in the start row key of the correct data type.

Key end

Type in the end row key of the correct data type.

Key limit

Type in the number of rows to be read between the start row key and
the end row key.

Specify columns

Select this check box to specify the column names of the column family
directly.

Columns

Type in the specific column names of the column family in the correct
format depending on the column type.

This field appears only if you select the Specify columns check box.

Columns range start

Type in the start column name of the correct data type.

Columns range end

Type in the end column name of the correct data type.

Columns range limit

Type in the number of columns to be read between the start column and
the end column.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job
level as well as at each component level.

Global Variables

Global Variables

NB_LINE: the number of rows read by an input component or
transferred to an output component. This is an After variable and it returns an
integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component always needs an output link.

Scenario: Handling data with Cassandra

This scenario applies only to a Talend solution with Big Data.

This scenario describes a simple Job that reads the employee data from a CSV file,
writes the data to a Cassandra keyspace, then extracts the personal information of some
employees and displays the information on the console.

use_case_cassandrainput.png
This scenario requires six components, which are:

  • tCassandraConnection: opens a connection
    to the Cassandra server.

  • tFileInputDelimited: reads the input
    file, defines the data structure and sends it to the next component.

  • tCassandraOutput: writes the data it
    receives from the preceding component into a Cassandra keyspace.

  • tCassandraInput: reads the data from the
    Cassandra keyspace.

  • tLogRow: displays the data it receives
    from the preceding component on the console.

  • tCassandraClose: closes the connection to
    the Cassandra server.

Dropping and linking the components

  1. Drop the following components from the Palette onto the design workspace: tCassandraConnection, tFileInputDelimited, tCassandraOutput, tCassandraInput, tLogRow
    and tCassandraClose.
  2. Connect tFileInputDelimited to tCassandraOutput using a Row > Main link.
  3. Do the same to connect tCassandraInput to
    tLogRow.
  4. Connect tCassandraConnection to tFileInputDelimited using a Trigger > OnSubjobOk
    link.
  5. Do the same to connect tFileInputDelimited to tCassandraInput and tCassandraInput to tCassandraClose.
  6. Label the components to better identify their functions.

Configuring the components

Opening a Cassandra connection

  1. Double-click the tCassandraConnection
    component to open its Basic settings view
    in theComponent tab.

    use_case_cassandrainput1.png

  2. Select the Cassandra version that you are using from the DB Version list. In this example, it is Cassandra 1.1.2.
  3. In the Server field, type in the hostname
    or IP address of the Cassandra server. In this example, it is localhost.
  4. In the Port field, type in the listening
    port number of the Cassandra server.
  5. If required, type in the authentication information for the Cassandra
    connection: Username and Password.

Reading the input data

  1. Double-click the tFileInputDelimited
    component to open its Component view.

    use_case_cassandrainput2.png

  2. Click the […] button next to the
    File Name/Stream field to browse to the
    file that you want to read data from. In this scenario, the directory is
    D:/Input/Employees.csv. The CSV file
    contains four columns: id, age, name
    and ManagerID.id;age;name;ManagerID
    1;20;Alex;1
    2;40;Peter;1
    3;25;Mark;1
    4;26;Michael;1
    5;30;Christophe;2
    6;26;Stephane;3
    7;37;Cedric;3
    8;52;Bill;4
    9;43;Jack;2
    10;28;Andrews;4

  3. In the Header field, enter 1 so that the first row in the CSV file will be
    skipped.
  4. Click Edit schema to define the data to
    pass on to the tCassandraOutput component.

    use_case_cassandrainput7.png

Writing data to a Cassandra keyspace

  1. Double-click the tCassandraOutput
    component to open its Basic settings view
    in the Component tab.

    use_case_cassandrainput3.png

  2. Type in required information for the connection or use the existing
    connection you have configured before. In this scenario, the Use existing connection check box is
    selected.
  3. In the Keyspace configuration area, type
    in the name of the keyspace: Employee in
    this example, and select Drop keyspace if exists and
    create
    from the Action on
    keyspace
    list.
  4. In the Column family configuration area,
    type in the name of the column family: Employee_Info in this example, and select Drop column family if exists and create from
    the Action on column family list.

    The Define column family structure check
    box appears. In this example, clear this check box.
  5. In the Action on data list, select the
    action you want to carry on, Upsert in
    this example.
  6. Click Sync columns to retrieve the schema
    from the preceding component.
  7. Select the key column of the column family from the Key column list. In this example, it is id.

    If needed, select the Include key in
    columns
    check box.

Reading data from the Cassandra keyspace

  1. Double-click the tCassandraInput
    component to open its Component
    view.

    use_case_cassandrainput4.png

  2. Type in required information for the connection or use the existing
    connection you have configured before. In this scenario, the Use existing connection check box is
    selected.
  3. In the Keyspace configuration area, type
    in the name of the keyspace: Employee in
    this example.
  4. In the Column family configuration area,
    type in the name of the column family: Employee_Info in this example.
  5. Select Edit schema to define the data
    structure to be read from the Cassandra keyspace. In this example, three
    columns id, name and age are
    defined.

    components-use_case_tcassandrainput_schema.png

  6. If needed, select the Include key in output
    columns
    check box, and then select the key column of the
    column family you want to include from the Key
    column
    list.
  7. From the Row key type list, select
    Integer because id is of integer type in this example.

    Keep the Default option for the row key
    Cassandra type because its value will become the corresponding Cassandra
    type Int32 automatically.
  8. In the Query configuration area, select
    the Specify row keys check box and specify
    the row keys directly. In this example, three rows will be read. Next,
    select the Specify columns check box and
    specify the column names of the column family directly. This scenario will
    read three columns from the keyspace: id,
    name and age.
  9. If needed, the Key start and the
    Key end fields allow you to define the
    range of rows, and the Key limit field
    allows you to specify the number of rows within the range of rows to be
    read. Similarly, the Columns range start
    and the Columns range end fields allow you
    to define the range of columns of the column family, and the Columns range limit field allows you to specify
    the number of columns within the range of columns to be read.

Displaying the information of interest

  1. Double-click the tLogRow component to
    open its Component view.
  2. In the Mode area, select Table (print values in cells of a table).

Closing the Cassandra connection

  1. Double-click the tCassandraClose
    component to open its Component
    view.

    use_case_cassandrainput5.png

  2. Select the connection to be closed from the Component List.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.
  2. Execute the Job by pressing F6 or
    clicking Run on the Run tab.

    The personal information of three employees is displayed on the
    console.
    use_case_cassandrainput6.png

tCassandraInput properties for Apache Spark Batch

These properties are used to configure tCassandraInput running in the Spark Batch Job framework.

The Spark Batch
tCassandraInput component belongs to the Databases family.

The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.

Basic settings

Schema and
Edit Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

The schema of this component does not support the Object type and the List type.

Keyspace

Type in the name of the keyspace from which you want to read data.

Column family

Type in the name of the column family from which you want to read data.

Selected column function

Select the columns about which you need to retrieve the TTL (time to live) or the writeTime
property.

The TTL property determines the time for records in a column to expire; the writeTime
property indicates the time when a record was created.

For further information about these properties, see Datastax’s documentation for Cassandra
CQL.

Filter function

Define the filters you need to use to select the records to be processed.

The component generates the WHERE ALLOW FILTERING clause using the filters you put and
thus this filter function is subject to the limit of this Cassandra clause.

Order by clustering column

Select how you need to sort the retrieved records. You can select NONE so as not to sort the data.

Use limit

Select this check box to display the Limit per partition
field, in which you enter the number of the rows to be retrieved starting from the first
row.

Usage

Usage rule

This component is used as a start component and requires an output link..

This component should use one and only one tCassandraConfiguration component present in the same Job to connect to
Cassandra. More than one tCassandraConfiguration components
present in the same Job fail the execution of the Job.

This component, along with the Spark Batch component Palette it belongs to, appears only
when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs.

Spark Connection

You need to use the Spark Configuration tab in
the Run view to define the connection to a given
Spark cluster for the whole Job. In addition, since the Job expects its dependent jar
files for execution, you must specify the directory in the file system to which these
jar files are transferred so that Spark can access these files:

  • Yarn mode: when using Google
    Dataproc, specify a bucket in the Google Storage staging
    bucket
    field in the Spark
    configuration
    tab; when using other distributions, use a
    tHDFSConfiguration
    component to specify the directory.

  • Standalone mode: you need to choose
    the configuration component depending on the file system you are using, such
    as tHDFSConfiguration
    or tS3Configuration.

This connection is effective on a per-Job basis.

Related scenarios

For a scenario about how to use the same type of component in a Spark Batch Job, see Writing and reading data from MongoDB using a Spark Batch Job.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x