Best practices for choosing a shard key

Author：Chuan Chen 阅读数：55504人阅读分类： MongoDB

Basic Concepts of Shard Keys

The shard key is a critical field in a MongoDB sharded cluster that determines how data is distributed. It directly affects data distribution across shards, query performance, and the scalability of the cluster. Once a shard key is chosen, it cannot be changed, so careful consideration is required during selection.

// Example: Specifying a shard key when creating a sharded collection
db.runCommand({
  shardCollection: "orders.products",
  key: { customerId: 1, orderDate: -1 }
})

Core Principles for Choosing a Shard Key

Cardinality Principle

High-cardinality fields are better suited as shard keys. For example, a user ID is more suitable than a gender field because user IDs are more diverse. Low-cardinality fields can lead to uneven data distribution and create "hotspot" shards.

// Bad example: Low-cardinality field
{ status: 1 }  // May only have values like "pending", "shipped", "delivered"

// Good example: High-cardinality field
{ _id: 1 }  // ObjectId has high uniqueness

Write Distribution Principle

The shard key should ensure that write operations are evenly distributed across all shards. Using a timestamp as a shard key is generally not ideal because new data will always be written to the last shard.

// May cause write hotspots
{ createdAt: 1 }  // All new documents are written to the same shard range

// Better choice
{ _id: "hashed" }  // Uses hashed sharding to distribute writes evenly

Query Isolation Principle

The shard key should support common query patterns. Ideally, queries should only target the necessary shards (query isolation).

// Query example
db.orders.find({ customerId: "C12345" })

// If customerId is part of the shard key, MongoDB can query only specific shards

Compound Shard Key Strategy

When a single field cannot meet all requirements, a compound shard key can be used. This typically combines a high-cardinality field with a field that provides query isolation.

// Example of a compound shard key
{
  customerId: 1,   // Provides query isolation
  orderId: 1      // Provides high cardinality
}

Hashed Sharding vs. Range Sharding

Hashed Sharding

Distributes data by hashing the field value. Suitable for randomizing write distribution but does not support range queries.

// Example of hashed sharding
db.runCommand({
  shardCollection: "logs.entries",
  key: { _id: "hashed" }
})

Range Sharding

Distributes data based on value ranges. Supports efficient range queries but may lead to uneven data distribution.

// Example of range sharding
db.runCommand({
  shardCollection: "products.inventory",
  key: { category: 1, price: 1 }
})

Relationship Between Shard Keys and Indexes

A shard key automatically creates an index, but additional compound indexes can be created to optimize queries. Queries on sharded collections that include the shard key are more efficient.

// Index automatically created by the shard key
db.products.getIndexes()
// Output will show the shard key index

// Additional indexes can be added to optimize specific queries
db.products.createIndex({ category: 1, rating: -1 })

Real-World Case Studies

E-commerce Order System

An orders table might choose {customerId: 1, orderDate: -1} as the shard key:

customerId ensures orders from the same user are on the same or adjacent shards
orderDate provides a time-based distribution

// Order table sharding configuration
db.runCommand({
  shardCollection: "ecommerce.orders",
  key: { customerId: 1, orderDate: -1 }
})

IoT Device Data

Sensor data might use {deviceId: "hashed", timestamp: 1}:

deviceId hashing ensures even write distribution
timestamp supports time-range queries

// IoT data sharding configuration
db.runCommand({
  shardCollection: "iot.sensorData",
  key: { deviceId: "hashed", timestamp: 1 }
})

Common Pitfalls in Shard Key Selection

Monotonically Increasing Fields

Using fields like auto-incrementing IDs or timestamps alone as shard keys can cause all new writes to go to the same shard.

// Problematic shard key
{ autoIncrementId: 1 }  // Write hotspot issue

// Improved solution
{ autoIncrementId: "hashed" }  // Uses hashed sharding

Inefficient Query Patterns

If common queries do not include the shard key prefix, it can lead to broadcast queries (querying all shards).

// Shard key is {customerId:1, orderDate:1}
db.orders.find({ orderDate: { $gt: ISODate("2023-01-01") } })
// This query will scan all shards because it lacks the customerId condition

Shard Key Adjustment Strategies

Although a shard key cannot be directly modified, it can be adjusted indirectly using the following methods:

Create a new collection with a new shard key
Export data from the original collection to the new collection
Rename the collections

// Example of shard key adjustment process
db.originalCollection.aggregate([{
  $out: "tempCollection"
}])

db.runCommand({
  shardCollection: "db.tempCollection",
  key: { newShardKey: 1 }
})

db.originalCollection.renameCollection("oldCollection")
db.tempCollection.renameCollection("originalCollection")

Monitoring and Optimization

Use MongoDB tools to monitor sharded cluster performance, focusing on:

Data balance across shards
Whether queries effectively utilize the shard key
Presence of hotspot shards

// Check shard distribution
db.collection.getShardDistribution()

// View slow query logs
db.setProfilingLevel(1, { slowms: 50 })
db.system.profile.find().sort({ ts: -1 }).limit(10)

做个网站！

本站部分内容来自互联网,一切版权均归源网站或源作者所有。

如果侵犯了你的权益请来信告知我们删除。邮箱：cc@cccx.cn

上一篇：索引滥用与优化建议

下一篇：高并发写入的优化策略