Five Steps to an Awesome Data Model in Apache Cassandra™ ― Scotch.io
Five Steps to an Awesome Data Model in Apache Cassandra™
November 18, 20195 Comments2,231 Views
Bookmark
Congratulations, you're starting out on the Cassandra distributed database, a favorite choice among architects and developers for its performance, scalability, continuous uptime, and global data distribution. Whether you plan on using it for ecommerce, IoT, fraud, or anything else, it's important not only to understand the database itself, but also how to create the right data model to fit your application's performance and scalability requirements.
Table of Contents
Fixing a poorly designed data model after an application is in production is an experience that nobody wants to go through. Therefore, it's better to take some time upfront and use a proven methodology. And, that's exactly what you'll learn here. We've broken it down into five steps:
Step 1: Understand your application workflow
Step 2: Model the queries required by the application
Step 3: Design the tables
Step 4: Determine primary keys
Step 5: Use the right data types effectively
Data Modeling in Cassandra vs. Relational Databases
You're likely already familiar with relational databases (RDBMS) such as Oracle, MySQL, and PostgreSQL, so let's start with how Cassandra differs from relational databases when it comes to data modeling:
- Denormalization is expected. With relational databases, designers are usually encouraged to store data in a normalizedform. In Cassandra, storing the same data redundantly in multiple tables is a feature of a good data model.
- Writes are (almost) free. Due to Cassandra's architecture, writes are shockingly fast compared to relational databases.
- No joins. Relational database usually reference fields from multiple tables in a single query by joining tables. With Cassandra, this functionality doesn't exist, so developers must structure their data model accordingly.
- Consistency is tunable. Relational database users usually reference fields from multiple tables in a single query by joining tables. With Cassandra, this functionality doesn't exist, so developers must structure their data model to provide equivalent functionality.
- Indexing is different. With relational databases, queries are usually optimized by simply creating an index on a field. In Cassandra, tables are usually designed to support specific queries, and secondary indexes are useful only in specific circumstances, rather than being a "silver bullet."
How Cassandra Stores Data
Cassandra clusters have multiple nodes, and data is typically stored redundantly across those nodes so that the database continues to operate even when nodes are down. Physical records in the table are spread across each cluster at a location determined by a partition key which identifies the Cassandra node where data and replicas are stored. A Cassandra cluster can be conceptually represented as a ring, where each cluster node is responsible for storing tokens in a range.
BeginnerWebDev.comGet Started w/ JavaScript for free!
Queries that look up records based on the partition key are extremely fast because Cassandra can immediately determine the host holding required data. Since clusters can potentially have hundreds or even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data are distributed across cluster nodes.
Three Data Modeling Best Practices
- Spread data evenly around the cluster. For Cassandra to work optimally, data should be spread as evenly as possible across cluster nodes which is dependent on selecting a good partition key.
- Minimize the number of partitions to read. When Cassandra reads data, it's best to read from as few partitions as possible to avoid impacting performance.
- Anticipate how data and requirements will grow. For example, would you design the data model differently if you had 100 versus millions of transactions per user?
To learn more about Cassandra’s distributed architecture, and how data is stored, check out the free DataStax Academy courses. You will master Cassandra's internal architecture by studying the read path, write path, and compaction. Topics such as consistency, replication, anti-entropy operations, and gossip ensure you have a strong handle on the technology and the data modeling implications.
Five Steps to Building an Awesome Data Model
It’s always helpful to focus on a concrete example. In the sections that follow, data modeling will be discussed in the context of the DataStax’s reference application, KillrVideo , an online video service.
Step 1: Understand your application workflow
With Cassandra, rather than start with the data model, the best practice is to start with the application workflow. This approach is referred to as "query-first design"—building your data model based on what types of queries the database will need to support. For example, in the KillrVideo example below, the sequence of workflow steps matters because it helps us determine that a userid and videoid are required to support subsequent queries, impacting table design.
Step 2: Model the queries required by the application
Taking a query-first approach means not only thinking through the sequence of tasks required, but also helps determine what data will be required when. For example, the entity relationship diagram below shows the entities (users, videos, and comments), the data items, and the relationships required by the KillrVideo applications. Once the application workflow and key data objects are identified, then it's possible to determine the queries the application needs to support.
Step 3: Design the tables
In Cassandra, tables can be grouped into two distinct categories:
- Tables with single-row partitions. These types of tables have primary keys that are also partition keys. They are used to store entities and are usually normalized. It's recommended that the tables be based on the entity for easy reference (e.g., videos).
- Tables with multi-*row partitions. *These types of tables have primary keys that are composed of partition and clustering keys. They are used to store relationships and related entities. Remember that Cassandra doesn't support joins, so structure tables to support queries that relate to multipledata items.
The latest_videos table illustrates what is meant by "query-first design." The application will need to query the most recently uploaded videos every time a user visits the KillrVideo homepage, so this query needs to be very efficient.
Additional clustering columns (added_date and videoid) specify how records are grouped and ordered within each partition. By designing the table in this fashion, queries will touch only one partition for the current day and possibly another partition for the day before. This level of optimization and efficiency helps explain how Cassandra can support applications with enormous numbers of queries over very large data sets.
Use a Chebotko Diagram to Represent Your Schema
A good tool for mapping the data model that supports an application is known as a Chebotko diagram. Designed to develop the logical and physical data models required to support the application, the Chebotko diagram captures the database schema, showing table names, partition key columns (K), clustering key columns (C) and their ordering, static columns (S), and regular columns with data types. The tables are organized based on the application workflow to support specific workflow steps and application queries.
Step 4: Determine primary keys
In Cassandra, tables have a primary key which is made up of a partition key, followed by one or more optional clustering columns that control how rows are laid out in a Cassandra partition. Getting the primary key right for each table is one of the most crucial aspects of designing a good data model.
In the latest_videos table, yyyymmdd is the partition key, and it is followed by two clustering columns, added_date and videoid, ordered in a fashion that supports retrieving the latest videos.
Good examples of unique keys are customer IDs, order IDs, and transaction IDs. Relational databases often use simple auto-incrementing integers to assign unique keys to records, but this approach isn't practical in a distributed system like Cassandra. To ensure that each key is unique, Cassandra supports universally unique identifiers (UUIDs) as a native data type. UUIDs are 128-bit numbers that are guaranteed to be unique within the scope of an application.
Some developers might prefer to devise their own naming schemes to make keys easier to understand, but it's important to think about the impact if the business changes, rendering the scheme obsolete. UUID scan sometimes be more maintainable in the long run.
Step 5: Use the right data types effectively
Cassandra supports a wide variety of data types that will be familiar to most developers including BigInt, Blob, Boolean, Decimal, Double, Float, Inet (IP addresses), Int, Text, VarChar, UUID, TIMEUUID, and more.
It might be tempting to store tags associated with videos in a separate table. When the list of anticipated tags is small however, using a collection data type that stores tags inside the database record can be more efficient. This simplifies the database design and reduces the number of tables required. Collection data types include sets, list, maps, tuple, and nested collection.
Another data type in Cassandra that provides flexibility is a user-defined type (UDT). UDTs can attach multiple data fields—each named and typed—to a single column. In the KillrVideo example, rather than add multiple address-related fields, an address type can be created.
The user-defined address type can now be included in the users table.