Best practices for choosing a shard key
Basic Concepts of Shard Keys
The shard key is a critical field in a MongoDB sharded cluster that determines how data is distributed. It directly affects data distribution across shards, query performance, and the scalability of the cluster. Once a shard key is chosen, it cannot be changed, so careful consideration is required during selection.
// Example: Specifying a shard key when creating a sharded collection
db.runCommand({
shardCollection: "orders.products",
key: { customerId: 1, orderDate: -1 }
})
Core Principles for Choosing a Shard Key
Cardinality Principle
High-cardinality fields are better suited as shard keys. For example, a user ID is more suitable than a gender field because user IDs are more diverse. Low-cardinality fields can lead to uneven data distribution and create "hotspot" shards.
// Bad example: Low-cardinality field
{ status: 1 } // May only have values like "pending", "shipped", "delivered"
// Good example: High-cardinality field
{ _id: 1 } // ObjectId has high uniqueness
Write Distribution Principle
The shard key should ensure that write operations are evenly distributed across all shards. Using a timestamp as a shard key is generally not ideal because new data will always be written to the last shard.
// May cause write hotspots
{ createdAt: 1 } // All new documents are written to the same shard range
// Better choice
{ _id: "hashed" } // Uses hashed sharding to distribute writes evenly
Query Isolation Principle
The shard key should support common query patterns. Ideally, queries should only target the necessary shards (query isolation).
// Query example
db.orders.find({ customerId: "C12345" })
// If customerId is part of the shard key, MongoDB can query only specific shards
Compound Shard Key Strategy
When a single field cannot meet all requirements, a compound shard key can be used. This typically combines a high-cardinality field with a field that provides query isolation.
// Example of a compound shard key
{
customerId: 1, // Provides query isolation
orderId: 1 // Provides high cardinality
}
Hashed Sharding vs. Range Sharding
Hashed Sharding
Distributes data by hashing the field value. Suitable for randomizing write distribution but does not support range queries.
// Example of hashed sharding
db.runCommand({
shardCollection: "logs.entries",
key: { _id: "hashed" }
})
Range Sharding
Distributes data based on value ranges. Supports efficient range queries but may lead to uneven data distribution.
// Example of range sharding
db.runCommand({
shardCollection: "products.inventory",
key: { category: 1, price: 1 }
})
Relationship Between Shard Keys and Indexes
A shard key automatically creates an index, but additional compound indexes can be created to optimize queries. Queries on sharded collections that include the shard key are more efficient.
// Index automatically created by the shard key
db.products.getIndexes()
// Output will show the shard key index
// Additional indexes can be added to optimize specific queries
db.products.createIndex({ category: 1, rating: -1 })
Real-World Case Studies
E-commerce Order System
An orders table might choose {customerId: 1, orderDate: -1}
as the shard key:
customerId
ensures orders from the same user are on the same or adjacent shardsorderDate
provides a time-based distribution
// Order table sharding configuration
db.runCommand({
shardCollection: "ecommerce.orders",
key: { customerId: 1, orderDate: -1 }
})
IoT Device Data
Sensor data might use {deviceId: "hashed", timestamp: 1}
:
deviceId
hashing ensures even write distributiontimestamp
supports time-range queries
// IoT data sharding configuration
db.runCommand({
shardCollection: "iot.sensorData",
key: { deviceId: "hashed", timestamp: 1 }
})
Common Pitfalls in Shard Key Selection
Monotonically Increasing Fields
Using fields like auto-incrementing IDs or timestamps alone as shard keys can cause all new writes to go to the same shard.
// Problematic shard key
{ autoIncrementId: 1 } // Write hotspot issue
// Improved solution
{ autoIncrementId: "hashed" } // Uses hashed sharding
Inefficient Query Patterns
If common queries do not include the shard key prefix, it can lead to broadcast queries (querying all shards).
// Shard key is {customerId:1, orderDate:1}
db.orders.find({ orderDate: { $gt: ISODate("2023-01-01") } })
// This query will scan all shards because it lacks the customerId condition
Shard Key Adjustment Strategies
Although a shard key cannot be directly modified, it can be adjusted indirectly using the following methods:
- Create a new collection with a new shard key
- Export data from the original collection to the new collection
- Rename the collections
// Example of shard key adjustment process
db.originalCollection.aggregate([{
$out: "tempCollection"
}])
db.runCommand({
shardCollection: "db.tempCollection",
key: { newShardKey: 1 }
})
db.originalCollection.renameCollection("oldCollection")
db.tempCollection.renameCollection("originalCollection")
Monitoring and Optimization
Use MongoDB tools to monitor sharded cluster performance, focusing on:
- Data balance across shards
- Whether queries effectively utilize the shard key
- Presence of hotspot shards
// Check shard distribution
db.collection.getShardDistribution()
// View slow query logs
db.setProfilingLevel(1, { slowms: 50 })
db.system.profile.find().sort({ ts: -1 }).limit(10)
本站部分内容来自互联网,一切版权均归源网站或源作者所有。
如果侵犯了你的权益请来信告知我们删除。邮箱:cc@cccx.cn
上一篇:索引滥用与优化建议
下一篇:高并发写入的优化策略