Top Answers to Cassandra Interview Questions
1. Compare MongoDB with Cassandra.
|Data Model||Document||Google Bigtable like|
|Querying of data||Multi-indexed||Using Key or Scan|
2. What is Cassandra?
Cassandra is one of the most favored NoSQL distributed database management systems by Apache. With its open-source technology, Cassandra is efficiently designed to store and manage large volumes of data without any failure. Highly scalable for Big Data models and originally designed by Facebook, Apache Cassandra is written in Java comprising flexible schemas. Apache Cassandra has no single point of failure. There are various types of NoSQL databases, and Cassandra is a hybrid of column-oriented and key–value store database. The keyspace is the outermost container for an application, and the table or column family in Cassandra is the keyspace entity.
3. List the benefits of using Cassandra.
Unlike traditional or any other database, Apache Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts, and Software Engineers.
- Instead of master–slave architecture, Cassandra is established on a peer-to-peer architecture ensuring no failure.
- It also assures phenomenal flexibility as it allows the insertion of multiple nodes to any Cassandra cluster in any data center. Further, any client can forward its request to any server.
- Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.
- Cassandra is also revered for its strong data replication on nodes capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.
- Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.
- Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval becomes more efficient with column-based data model.
- Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.Find out how Cassandra Versus MongoDB can help you get ahead in your career!
Check out this video on Cassandra Tutorial for Beginners:
4. Explain the concept of tunable consistency in Cassandra.
Tunable consistency is a phenomenal character that makes Cassandra a favored database choice of Developers, Analysts, and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s tunable consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies: eventual consistency and strong consistency.
The former guarantees consistency when no new updates are made on a given data item, i.e., all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.
For strong consistency, Cassandra supports the following condition:
R + W > N where,
N – Number of replicas
W – Number of nodes that need to agree for a successful write
R – Number of nodes that need to agree for a successful read
5. How does Cassandra write?
Cassandra performs the write function by applying two commits: first, it writes to a commit log on the disk, and then it commits to an in-memory structure known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTables (sorted string tables). Cassandra offers faster write performance.
6. Define the management tools in Cassandra.
DataStax OpsCenter: It is the Internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional edition of OpsCenter.
SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, ZooKeeper, and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection, and heartbeat alerting.
Know more about the management tools in Cassandra from this Cassandra Tutorial!
7. Define memtable.
Similar to a table, a memtable is the in-memory/write-back cache space consisting of the content in a key and column format. The data in a memtable is sorted by key, and each column family consists of a distinct memtable that retrieves column data via the key. It stores the writes until it is full, and then flushes them out.
8. What is SSTable? How is it different from other relational tables?
SSTable expands to ‘Sorted String Table,’ which refers to an important data file in Cassandra and accepts regular written memtables. They are stored on disk and exist for each Cassandra table. Exhibiting immutability, SSTables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary, and a bloom filter.
9. Explain the concept of Bloom Filter.
Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native memory) data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.
Learn more about Apache Cassandra from the blog, Cassandra: The Buzzword in Database Management!
10. Explain CAP Theorem.
With a strong requirement to scale systems when additional resources are needed, CAP Theorem plays a major role in maintaining the scaling strategy. It is an efficient way to handle scaling in distributed systems. Consistency, availability, and partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics.
One of them needs to be sacrificed. Consistency guarantees the return of most recent write for the client; availability returns a rational response within minimum time; and in partition tolerance, the system will continue its operations when network partitions occur. The two options available are AP and CP.
11. State the differences between a node, a cluster, and a data center in Cassandra.
There are various components of Cassandra. While a node is a single machine running Cassandra, cluster is a collection of nodes that have similar types of data grouped together. Data centers are useful components when serving customers in different geographical areas. You can group different nodes of a cluster into different data centers.
12. How to write a query in Cassandra?
Using CQL (Cassandra Query Language) we can write queries in Cassandra. Cqlsh is used for interacting with the database.
13. What OS does Cassandra support?
Cassandra supports both Windows and Linux.
14. What is Cassandra Data Model?
Cassandra data model consists of four main components:
Cluster: Made up of multiple nodes and keyspaces
Keyspace: A namespace to group multiple column families, especially one per partition
Column: Consisting of a column name, value, and timestamp
Column Family: Multiple columns with the row key reference
15. What is CQL?
CQL is Cassandra query language to access and query Apache distributed database. It consists of a CQL parser that incites all the implementation details to the server. The syntax of CQL is similar to SQL, but it does not alter the Cassandra data model.
16. Explain the concept of compaction in Cassandra.
Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for data optimization of data structures on the disk. The compaction process is useful during interacting with memtables. There are two types of compaction in Cassandra.
Minor compaction: It gets started automatically when a new SSTable is created. Here, Cassandra condenses all the equally sized SSTables into one.
Major compaction: It is triggered manually using the nodetool. It compacts all SSTables of a column family into one.
17. Does Cassandra support ACID transactions?
Unlike relational databases, Cassandra does not support ACID transactions.
18. Explain Cqlsh.
Cqlsh expands to Cassandra Query Language Shell that configures the CQL interactive terminal. It is a Python-based command-line prompt used on Linux or Windows and executes CQL commands like ASSUME, CAPTURE, CONSISTENCY, COPY, DESCRIBE, and many others. With cqlsh, users can define a schema, insert data, and execute a query.
19. What is Super Column in Cassandra?
Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key–value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore > column family > super column > column data structure in JSON.
Similar to the row keys, super column data entries contain no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever.
20. Define the consistency levels for read operations in Cassandra.
- ALL: Highly consistent. A write must be written to a commitlog and a memtable on all replica nodes in the cluster.
- EACH_QUORUM: A write must be written to a commitlog and a memtable on quorum of replica nodes in all data centers.
- LOCAL_QUORUM: A write must be written to a commitlog and a memtable on quorum of replica nodes in the same center.
- ONE: A write must be written to a commitlog and a memtable of at least one replica node.
- TWO, Three: Same as One but with at least two and three replica nodes, respectively
- LOCAL_ONE: A write must be written for at least one replica node in the local data center.
- SERIAL: Linearizable consistency to prevent unconditional update
- LOCAL_SERIAL: Same as serial but restricted to a local data center
21. What is the difference between Column and Super Column?
Both elements work on the principle of tuples having name and value. However, the former’s value is a string, while the value of the latter is a map of columns with different data types.
Unlike Columns, Super Columns do not contain the third component of timestamp.
22. What is Column Family?
As the name suggests, a column family refers to a structure having an infinite number of rows. Those are referred by a key–value pair, where the key is the name of the column and the value represents the column data. It is much similar to a hashmap in Java or a dictionary in Python. Rememeber, the rows are not limited to a predefined list of columns here. Also, the column family is absolutely flexible with one row having 100 columns while the other having only 2 columns.
23. Define the use of the source command in Cassandra.
Source command is used to execute a file consisting of CQL statements.
24. What is Thrift?
Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The purpose of using Thrift in Cassandra is to facilitate access to the DB across the programming language.
25. Explain Tombstone in Cassandra.
Tombstone is a row marker indicating a column deletion. These marked columns are deleted during compaction. Tombstones are of great significance as Cassandra supports eventual consistency, where the data must respond before any successful operation.
26. On what platforms does Cassandra run?
Since Cassandra is a Java application, it can successfully run on any Java-driven platform or on Java Runtime Environment (JRE) or Java Virtual Machine (JVM). Cassandra also runs on Red Hat, CentOS, Debian, and Ubuntu Linux platforms.
Interested in learning Cassandra? Enroll today in this Cassandra Training!
27. Name the ports that Cassandra uses.
The default settings state that Cassandra uses 7000 port for Cluster Management, 9160 for Thrift Clients, and 8080 for JMX. These are all TCP ports and can be edited in the configuration file: bin/cassandra.in.sh
28. Can you add or remove column families in a working cluster?
Yes, but while doing that we need to keep in mind the following processes:
- Do not forget to clear the commitlog with ‘nodetool drain’
- Turn off Cassandra to ensure that there is no data left in the commitlog
- Delete the SSTable files for the removed CFs
29. What is replication factor in Cassandra?
Replication factor is the measure of the number of data copies existing. It is important to increase the replication factor to log into the cluster.
30. Can we change the replication factor on a live cluster?
Yes, but it will require running repair to alter the replica count of the existing data.
31. How to iterate all rows in a Column Family?
Using get_range_slices. You can start iteration with an empty string, and after each iteration the last key read serves as the start key for the next iteration.