Cassandra Architecture - The Definitive Guide

What is Cassandra Architecture

Cassandra’s architecture is based on the understanding that system and hardware failures occurs eventually. Before talking about Cassandra lets first talk about terminologies used in architecture design.
Node: Is computer (server) where you store your data. It is the basic infrastructure component of Cassandra.
Data center :A collection of related nodes. A data center can be a physical data center or virtual data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations.
Cluster: A cluster contains one or more data centers. It can span physical locations.
Commit log: Is a crash-recovery mechanism.All data is written first to the commit log (file) for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
Table: A collection of ordered columns fetched by row.
SSTable: A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Certification in Bigdata Analytics

Gossip: Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster.
Bloom filter: These are quick, nondeterministic, algorithms for testing whether an element is a member of a set. Bloom filters are accessed after every query.

Cassandra addresses the problem of failures by using a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second. A sequentially written commit log on each node captures write activity to ensure data durability. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. Once the in-memory data structure is full, the data is written to disk in an SSTable data file. All writes are automatically partitioned and replicated throughout the cluster.

Using a process called compaction Cassandra periodically consolidates SSTables, discarding obsolete data and tombstones (an indicator that data was deleted). Client’s read or write requests can be sent to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured

img 1

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.