Cassandra Data Model
In relational data model we have outer most containers which is call as data base. Each Data base will correspond to a real application for example in an online library application data base name could be library . Data base contains table which could be mapped to real world entity example books in library corresponds to book table with multiple field ( columns ) which talk about book as an entity like name . author, ISBN number etc. Usually each table has a unique identifier (primary key).
In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures: the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database.
A detailed understanding of Apache Cassandra is available in this blog post for your perusal!
In relational databases, we’re used to storing column names as strings but in Cassandra, both row keys and column names can be strings, like relational column names, but they can also be long integers, UUIDs, or any kind of byte array.
In Cassandra, the basic attributes that can be set per key space are:
Replication factor: It refers to the number of nodes that will act as copies (replicas) of each row of data. If your replication factor is 3, then three nodes in the ring will have copies of each row, and this replication is transparent to clients.
Replica placement strategy: refers to how the replicas will be placed in the ring. There are different strategies for determining which nodes will get copies of which keys. These are SimpleStrategy (formerly known as RackUnawareStrategy), Old Network Topology Strategy (formerly known as Rack- AwareStrategy), and NetworkTopologyStrategy (formerly known as Datacenter- ShardStrategy).
Column families: a keyspace is a container for a list of one or more column families. Can be thought of something like this:
[Keyspace][ColumnFamily][Key][Column]
Example of Cassandra Book column family
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
Why Column family is not equivalent to tables in relational data base?
1.) schema-free: Cassandra column family doesn’t follow any schema. You can freely add any column to any column family at any time, depending on your needs.
2.) Comparator: column family has two attributes: a name and a comparator. The comparator value indicates how columns will be sorted when they are returned to you in a query—according to long, byte, UTF8, or other ordering
3.) Data Storage: column families are each stored in separate files on disk, it’s important to keep related columns defined together in the same column family. This make it different from RDBMS tables.
4.) Super Columns: In relational tables defines only columns, and the user supplies the values, which are the rows but Cassandra column family can hold columns, or it can be defined as a super column family.
Columns Verse Super column
A column is the most basic unit of data structure in the Cassandra data model. A column is a triplet of a name, a value, and a clock, which you can think of as a timestamp. Whereas super column is a special kind of column. Both kinds of columns are name/value pairs, but a regular column stores a byte array value, and the value of a super column is a map of sub columns (which store byte array values).
Things to keep in mind while designing Cassandra Column Family
Secondary Indexes: In Relational Data base if you want to find books for hadoop form book table you would write following query:
Select name from book where category=”hadoop”;
When handed a query like this, a relational database will perform a full table scan, inspecting each row’s name column to find the value you’re looking for. But this can become very slow once your table grows very large. So the relational answer to this is to create an index on the name column, which acts as a copy of the data that the relational database can look up very quickly.
Cassandra has different approach for secondry indexes. At a high level, secondary indexes look like normal column families, with the indexed value as the partition key. Cassandra’s secondary indexes are not distributed like normal tables. They are implemented as local indexes.
Each node stores an index of only the data that it stores same thing in Cassandra, you create a second column family that holds the lookup data. Example if Hadoop The Definitive Guide and Hadoop Training in Action records are stored on same node.
category {
key: hadoop {” Hadoop The Definitive Guide”:”” ”Hadoop in Action”:””},
key: cassandra { ” Cassandra: The Definitive Guide”:”” }
}
Materialized View: Materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to a second column family that is created specifically to represent that query.
Design queries first then table: Design queries your application will need, and model the data around that instead of modeling the data first, as you would in the relational world.
Timestamp: Supply a timestamp (or clock) with each query. This is important because Cassandra use timestamps to determine the most recent write value.