Introduction to Cassandra - A Scalable NoSQL Database

Welcome to big data SQL:

No Sql Big data is among the most buzzing words in past few years. Actually Big data technologies are set of tools specially designed and architect to store, process and analyze big data (i.e. data in the order of 1000’s of GB). These tools are specially curved to handle variety of data (i.e. audio data to My SQL tables, jpeg images to web server logs) and velocity (i.e. data with ingestion speed of GB per second). Relational Database management systems (RBMS) are well accepted system in software industry for efficiently storage and retrieval business use-case data from last few decades. The beauty of these system lies in there capability to support low latency data retrievals and transactions. These systems are pretty mature in indexing and reporting data. However they were never design to store data which can be in the order of terabytes, which doesn’t have any predefined (or at least predictable) structure and which keeps on varying every another day.

Hello NO SQL!

No SQL (Not only SQL) is a non-relational and largely distributed database system that enables quick ad-hoc organization of data and analysis of extremely high-volume; disparate data types (i.e. data of practically any type). NoSQL databases can be considered as alternative to relational databases, with scalability, availability, and fault tolerance being key factors. They hold capability to go beyond relational databases (such as Oracle, SQL Server and DB2 databases) in satisfying the needs of today’s modern business applications. A very flexible, schemaless data model, horizontal scalability, distributed architectures, and uses of tools that are “not only” SQL typically characterize this technology.
There are primarily four type of NO Sql DBs. Each of them have their own specific attributes:
1.) Key Values stores: These are simple key value storage where all data within is indexed on key. Examples of these databases are Cassandra, DyanmoDB, Azure Table Storage (ATS), Riak, BerkeleyDB.
2.) Column Stores: These data base are designed for storing data tables as sections of columns of data, rather than as rows of data and offer very high performance and a highly scalable architecture. Examples include: HBase, BigTable and HyperTable.
3.) Document Stores: There are key values storages where values are complex semi structures data also referred as documents. Keys are usually unique and are used to retrieve document oriented information.
4.) Graph database: Based on graph theory, these databases are designed for data whose relations are well represented as a graph and has elements which are interconnected. They are used to store information about networks, such as social connections. Examples include: Neo4J and Polyglot.

CAP theorem and why Cassandra make sense

The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency, Availability, and Partition Tolerance.
Consistency: All database clients will read the same value for the same query, even given concurrent updates i.e. a read see all previously completed writes.
Availability: All database clients will always be able to read and write data.
Partition Tolerance: The database can be split into multiple machines; it can continue functioning in the face of network failure on few machines in the cluster. Brewer’s theorem is that in any given system, you can strongly support only two of the three.

So what does it mean in practical terms to support only two of the three faces of CAP?

CA: To primarily support Consistency and Availability means that the system will block when a network or partition of network fails. So Data base can only be limited to single data center to mitigate this issue.
CP: To primarily support Consistency and Partition Tolerance, data base would be architecture by setting up data shards (i.e. partitioning) in order to scale. Data will be consistent, but you still run the risk of some data becoming unavailable if nodes fail.
AP: To primarily support Availability and Partition Tolerance, your system may return Incorrect data, but the system will always be available for reads and writes, even in the face of network partitioning or failure.

Cassandra makes sense!

Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly available, fault-tolerant, tune-ably consistent, columnoriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. Cassandra lies in CA bucket of CAP Theorem. Important Features of Cassandra which make it special are:
Distributed and Decentralized: Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole. Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster functions exactly the same. There is NO Master NO Slave.
Elastic Scalability: It means that your cluster can seamlessly scale up and scale back down. That actually means that adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual interventions. Vise versa is equally true.
High Availability and Fault Tolerance: Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually losing any data and without bring whole cluster down. In similar fashion you can also improve Cassandra performance by replicating data to multiple data center.
Tuneable Consistency: Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to easily decide the level of consistency you require, in balance with the level of availability. This is controlled by parameters like replication factor and consistency level.