Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling.
10.1 Sharding Introduction
Sharding is a method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.
10.1.1 Purpose of Sharding
Database systems with large data sets and high throughput applications can challenge the capacity of a single server. High query rates can exhaust the CPU capacity of the server. Larger data sets exceed the storage capacity of a single machine. Finally, working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.
To address these issues of scales, database systems have two basic approaches: vertical scaling and sharding. Vertical scaling adds more CPU and storage resources to increase capacity. Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database.
Sharding addresses the challenge of scaling to support high throughput and large data sets:
- Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally.
For example, to insert data, the application only needs to access the shard responsible for that record.
- Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows.
For example, if a database has a 1 terabyte data set, and there are 4 shards, then each shard might hold only 256GB of data. If there are 40 shards, then each shard might hold only 25GB of data.
10.1.2 Sharding in MongoDB
MongoDB supports sharding through the configuration of a sharded clusters.
Sharded cluster has the following components: shards, query routers and config servers.
- Shards store the data. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set .
- Query Routers, or mongos instances, interface with client applications and direct operations to the appropriate shard or shards. The query router processes and targets operations to shards and then returns results to the clients. A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Most sharded clusters have many query routers.
- Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards. Production sharded clusters have exactly 3 config servers.
10.1.3 Data Partitioning
MongoDB distributes data, or shards, at the collection level. Sharding partitions a collection’s data by the shard key.
- Shard Keys –A shard key is either an indexed field or an indexed compound field that exists in every document in the collection. MongoDB divides the shard key values into chunks and distributes the chunks evenly across the shards. To divide the shard key values into chunks, MongoDB uses either range based partitioning or hash based partitioning.
- Range Based Sharding – For range-based sharding, MongoDB divides the data set into ranges determined by the shard key values to provide range based partitioning.
- Hash Based Sharding – For hash based partitioning, MongoDB computes a hash of a field’s value, and then uses these hashes to create chunks. With hash based partitioning, two documents with “close” shard key values are unlikely to be part of the same chunk. This ensures a more random distribution of a collection in the cluster.
10.1.4 Maintaining a Balanced Data Distribution
The addition of new data or the addition of new servers can result in data distribution imbalances within the cluster, such as a particular shard contains significantly more chunks than another shard or a size of a chunk is significantly greater than other chunk size.
MongoDB ensures a balanced cluster using two background process: splitting and the balancer.
- Splitting – Splitting is a background process that keeps chunks from growing too large.
- Balancing – The balancer is a background process that manages chunk migrations. The balancer can run from any of the query routers in a cluster.
For example: if collection users has 100 chunks on shard 1 and 50 chunks on shard 2, the balancer will migrate chunks from shard 1 to shard 2 until the collection achieves balance.
10.2 Sharding Concepts
10.2.1 Sharded Cluster Components
Sharded clusters implement sharding. A sharded cluster consists of the following components:
- Shards – A shard is a MongoDB instance that holds a subset of a collection’s data. Each shard is either a single mongod instance or a replica set. In production, all shards are replica sets.
- Config Servers – Each config server is a mongod instance that holds metadata about the cluster. The metadata maps chunks to shards.
- Routing Instances – Each router is a mongos instance that routes the reads and writes from applications to the shards. Applications do not access the shards directly.
10.2.2 Sharded Cluster Architectures
The following documents introduce deployment patterns for sharded clusters-
- Sharded Cluster Requirements
While sharding is a powerful and compelling feature, sharded clusters have significant infrastructure requirements and increases the overall complexity of a deployment. As a result, only deploy sharded clusters when indicated by application and operational requirements.
Sharding is the only solution for some classes of deployments. Use sharded clusters if:
- the size of your system’s active working set will soon exceed the capacity of your system’s maximum RAM.
- your data set approaches or exceeds the storage capacity of a single MongoDB instance.
- a single MongoDB instance cannot meet the demands of your write operations, and all other approaches have not reduced contention.
If these attributes are not present in your system, sharding will only add complexity to your system without adding much benefit.
- Production Cluster Architecture
In a production cluster, you must ensure that data is redundant and that your systems are highly available. To that end, a production cluster must have the following components:
1. Three Config Servers Each config server must be on separate machines. A single sharded cluster must have exclusive use of its config servers If you have multiple sharded clusters, you will need to have a group of config servers for each cluster.
2. Two or More Replica Sets As Shards These replica sets are the shards.
3. One or More Query Routers (mongos) The mongos instances are the routers for the cluster. Typically, deployments have one mongos instance on each application server.
- Sharded Cluster Test Architecture
For testing and development, you can deploy a minimal sharded clusters cluster. These non-production clusters have the following components:
- One config server
- At least one shard. Shards are either replica sets or a standalone mongod instances.
- One mongos instance.
10.2.3 Sharded Cluster Behavior
These documents address the distribution of data and queries to a sharded cluster as well as specific security and availability considerations for sharded clusters.
- Shard Keys – MongoDB uses the shard key to divide a collection’s data across the cluster’s shards.
- Sharded Cluster High Availability – Sharded clusters provide ways to address some availability concerns.
- Sharded Cluster Query Routing – The cluster’s routers, or mongos instances, send reads and writes to the relevant shard or shards.
- Tag Aware Sharding – Tags associate specific ranges of shard key values with specific shards for use in managing deployment patterns.
10.2.4 Sharding Mechanics
The following points describe sharded cluster processes-
- Sharded Collection Balancing – Balancing distributes a sharded collection’s data cluster to all of the shards.
- Chunk Migration Across Shards – MongoDB migrates chunks to shards as part of the balancing process.
- Chunk Splits in a Sharded Cluster – When a chunk grows beyond the configured size, MongoDB splits the chunk in half.
- Shard Key Indexes – Sharded collections must keep an index that starts with the shard key.
- Sharded Cluster Metadata – The cluster maintains internal metadata that reflects the location of data within the cluster.
10.3 Sharded Cluster Tutorials
The following points provide instructions for administering sharded clusters-
- Deploy a Sharded Cluster – Set up a sharded cluster by creating the needed data directories, starting the required MongoDB instances, and configuring the cluster settings.
- Shard a Collection Using a Hashed Shard Key – Shard a collection based on hashes of a field’s values in order to ensure even distribution over the collection’s shards.
- Add Shards to a Cluster – Add a shard to add capacity to a sharded cluster.
- View Cluster Configuration- View status information about the cluster’s databases, shards, and chunks.
- Remove Shards from an Existing Sharded Cluster – Migrate a single shard’s data and remove the shard.
- Migrate Config Servers with Different Hostnames – Migrate a config server to a new system that uses a new hostname. If possible, avoid changing the hostname and instead use the Migrate Config Servers with the Same Hostname
- Manage Shard Tags – Use tags to associate specific ranges of shard key values with specific shards.
- Sharded Cluster Data Management – Practices that address common issues in managing large sharded data sets.
- Troubleshoot Sharded Clusters – Presents solutions to common issues and concerns relevant to the administration and use of sharded clusters.
Learn more about Cassandra Versus MongoDB in this insightful blog now!