Shard key selection strategy

Author：Chuan Chen 阅读数：19563人阅读分类： MongoDB

Shard Key Selection Strategy

The shard key is the core mechanism in a MongoDB sharded cluster that determines data distribution. A well-chosen shard key ensures balanced cluster load and efficient queries, while a poor choice can lead to performance bottlenecks or uneven data distribution. Understanding the characteristics of shard keys and their impact on business scenarios is crucial.

Basic Characteristics of Shard Keys

The shard key must be an existing field or a compound field in the collection and must meet the following conditions:

Immutable: Once set, the shard key value cannot be modified.
Indexed: The shard key field must have an index (either an existing or newly created index).
Cardinality requirement: The shard key should have sufficiently high cardinality (number of distinct values).

// Example of creating a sharded collection
sh.shardCollection("orders.products", { productId: 1, region: 1 })

Types of Shard Keys and Selection Strategies

Hashed Shard Key

Distributes data evenly by computing a hash of the field value. Suitable for high-write scenarios but inefficient for range queries.

// Creating a hashed shard key
db.products.createIndex({ "productId": "hashed" })
sh.shardCollection("ecommerce.products", { "productId": "hashed" })

Typical use cases:

Random fields like order IDs or user IDs
Write-intensive applications, such as IoT sensor data
Business scenarios that do not require range queries

Range Shard Key

Distributes data based on the natural order of field values. Supports efficient range queries but may cause hotspot issues.

// Creating a range shard key
db.logs.createIndex({ "timestamp": 1 })
sh.shardCollection("analytics.logs", { "timestamp": 1 })

Applicable scenarios:

Time-series data (logs, monitoring data)
Scenarios requiring frequent range queries
Scenarios with naturally increasing data

Compound Shard Key

Combines multiple fields to balance query performance and distribution uniformity.

// Creating a compound shard key
db.orders.createIndex({ "customerId": 1, "orderDate": -1 })
sh.shardCollection("retail.orders", { "customerId": 1, "orderDate": -1 })

Design considerations:

Place high-cardinality fields first
Design field order based on query patterns
Consider the correlation between field values

Core Factors in Shard Key Selection

Data Distribution Balance

Avoid "hotspot shard" issues. For example, using a monotonically increasing _id as the shard key will cause all new writes to concentrate on the last shard.

Bad practice:

// May cause write hotspots
sh.shardCollection("events.tracking", { "_id": 1 })

Improved solution:

// Use a compound shard key to distribute writes
sh.shardCollection("events.tracking", { "deviceId": 1, "timestamp": 1 })

Query Pattern Alignment

The shard key should align with the most frequent query patterns. For example, in an e-commerce platform, querying orders by user is more common than by product:

// Align with user-centric query patterns
sh.shardCollection("ecommerce.orders", { "userId": 1, "status": 1 })

// Queries can be efficiently routed to specific shards
db.orders.find({ "userId": "U1001", "status": "shipped" })

Shard Key Cardinality

Low-cardinality fields can lead to uneven data distribution. For example, using a "gender" field as the shard key can distribute data to at most 2 shards.

// Shard issues caused by low-cardinality fields
sh.shardCollection("users.profiles", { "gender": 1 })

// Improved solution: Combine with high-cardinality fields
sh.shardCollection("users.profiles", { "gender": 1, "userId": 1 })

Shard Key Design for Special Scenarios

Time-Series Data

Timestamps as shard keys should be combined with other fields to prevent hotspots:

// Optimized shard key for time-series data
sh.shardCollection("iot.sensorData", { "sensorId": 1, "timestamp": -1 })

// Time-range queries can still effectively utilize sharding
db.sensorData.find({
  "sensorId": "SENSOR-001",
  "timestamp": { $gte: ISODate("2023-01-01") }
})

Multi-Tenant Systems

Tenant ID should be the first field in the shard key:

// Shard key design for multi-tenant systems
sh.shardCollection("saas.events", { "tenantId": 1, "eventType": 1 })

// Ensure tenant data locality
db.events.find({ "tenantId": "ACME_Corp", "eventType": "login" })

Geospatial Data

Combine geohash with business attributes:

// Shard key for geospatial data
sh.shardCollection("maps.pois", { "geoHash": 1, "category": 1 })

// Support geolocation queries
db.pois.createIndex({ "location": "2dsphere" })

Limitations on Modifying Shard Keys

Once set, MongoDB shard keys cannot be modified, but adjustments can be made indirectly through the following methods:

Create a new collection with a new shard key.
Use ETL tools to migrate data.
Implement a dual-write strategy at the application layer.

// Example of data migration (pseudo-code)
const oldCollection = db.getSiblingDB("app").orders;
const newCollection = db.getSiblingDB("app").orders_new;

oldCollection.find().forEach(doc => {
  newCollection.insert(doc);
});

// Switch collection references at the application layer

Monitoring and Optimizing Shard Keys

Use the following commands to monitor sharded cluster status:

// View shard distribution
db.orders.getShardDistribution()

// Check data balancing status
sh.status()

// Analyze query routing
db.currentOp(true).inprog.forEach(op => {
  if(op.ns == "app.orders") printjson(op);
})

Key monitoring metrics: