Shard key selection strategy
Shard Key Selection Strategy
The shard key is the core mechanism in a MongoDB sharded cluster that determines data distribution. A well-chosen shard key ensures balanced cluster load and efficient queries, while a poor choice can lead to performance bottlenecks or uneven data distribution. Understanding the characteristics of shard keys and their impact on business scenarios is crucial.
Basic Characteristics of Shard Keys
The shard key must be an existing field or a compound field in the collection and must meet the following conditions:
- Immutable: Once set, the shard key value cannot be modified.
- Indexed: The shard key field must have an index (either an existing or newly created index).
- Cardinality requirement: The shard key should have sufficiently high cardinality (number of distinct values).
// Example of creating a sharded collection
sh.shardCollection("orders.products", { productId: 1, region: 1 })
Types of Shard Keys and Selection Strategies
Hashed Shard Key
Distributes data evenly by computing a hash of the field value. Suitable for high-write scenarios but inefficient for range queries.
// Creating a hashed shard key
db.products.createIndex({ "productId": "hashed" })
sh.shardCollection("ecommerce.products", { "productId": "hashed" })
Typical use cases:
- Random fields like order IDs or user IDs
- Write-intensive applications, such as IoT sensor data
- Business scenarios that do not require range queries
Range Shard Key
Distributes data based on the natural order of field values. Supports efficient range queries but may cause hotspot issues.
// Creating a range shard key
db.logs.createIndex({ "timestamp": 1 })
sh.shardCollection("analytics.logs", { "timestamp": 1 })
Applicable scenarios:
- Time-series data (logs, monitoring data)
- Scenarios requiring frequent range queries
- Scenarios with naturally increasing data
Compound Shard Key
Combines multiple fields to balance query performance and distribution uniformity.
// Creating a compound shard key
db.orders.createIndex({ "customerId": 1, "orderDate": -1 })
sh.shardCollection("retail.orders", { "customerId": 1, "orderDate": -1 })
Design considerations:
- Place high-cardinality fields first
- Design field order based on query patterns
- Consider the correlation between field values
Core Factors in Shard Key Selection
Data Distribution Balance
Avoid "hotspot shard" issues. For example, using a monotonically increasing _id
as the shard key will cause all new writes to concentrate on the last shard.
Bad practice:
// May cause write hotspots
sh.shardCollection("events.tracking", { "_id": 1 })
Improved solution:
// Use a compound shard key to distribute writes
sh.shardCollection("events.tracking", { "deviceId": 1, "timestamp": 1 })
Query Pattern Alignment
The shard key should align with the most frequent query patterns. For example, in an e-commerce platform, querying orders by user is more common than by product:
// Align with user-centric query patterns
sh.shardCollection("ecommerce.orders", { "userId": 1, "status": 1 })
// Queries can be efficiently routed to specific shards
db.orders.find({ "userId": "U1001", "status": "shipped" })
Shard Key Cardinality
Low-cardinality fields can lead to uneven data distribution. For example, using a "gender" field as the shard key can distribute data to at most 2 shards.
// Shard issues caused by low-cardinality fields
sh.shardCollection("users.profiles", { "gender": 1 })
// Improved solution: Combine with high-cardinality fields
sh.shardCollection("users.profiles", { "gender": 1, "userId": 1 })
Shard Key Design for Special Scenarios
Time-Series Data
Timestamps as shard keys should be combined with other fields to prevent hotspots:
// Optimized shard key for time-series data
sh.shardCollection("iot.sensorData", { "sensorId": 1, "timestamp": -1 })
// Time-range queries can still effectively utilize sharding
db.sensorData.find({
"sensorId": "SENSOR-001",
"timestamp": { $gte: ISODate("2023-01-01") }
})
Multi-Tenant Systems
Tenant ID should be the first field in the shard key:
// Shard key design for multi-tenant systems
sh.shardCollection("saas.events", { "tenantId": 1, "eventType": 1 })
// Ensure tenant data locality
db.events.find({ "tenantId": "ACME_Corp", "eventType": "login" })
Geospatial Data
Combine geohash with business attributes:
// Shard key for geospatial data
sh.shardCollection("maps.pois", { "geoHash": 1, "category": 1 })
// Support geolocation queries
db.pois.createIndex({ "location": "2dsphere" })
Limitations on Modifying Shard Keys
Once set, MongoDB shard keys cannot be modified, but adjustments can be made indirectly through the following methods:
- Create a new collection with a new shard key.
- Use ETL tools to migrate data.
- Implement a dual-write strategy at the application layer.
// Example of data migration (pseudo-code)
const oldCollection = db.getSiblingDB("app").orders;
const newCollection = db.getSiblingDB("app").orders_new;
oldCollection.find().forEach(doc => {
newCollection.insert(doc);
});
// Switch collection references at the application layer
Monitoring and Optimizing Shard Keys
Use the following commands to monitor sharded cluster status:
// View shard distribution
db.orders.getShardDistribution()
// Check data balancing status
sh.status()
// Analyze query routing
db.currentOp(true).inprog.forEach(op => {
if(op.ns == "app.orders") printjson(op);
})
Key monitoring metrics:
- Data volume differences between shards
- Read/write load on each shard
- Whether queries are effectively routed (
shardName
field inexplain
results)
本站部分内容来自互联网,一切版权均归源网站或源作者所有。
如果侵犯了你的权益请来信告知我们删除。邮箱:cc@cccx.cn