Monitoring metrics (CPU, memory, disk, network)
Monitoring Metrics (CPU, Memory, Disk, Network)
Database performance monitoring is a critical focus for operations and development teams. As a popular NoSQL database, MongoDB's operational status directly impacts business stability. By monitoring core metrics such as CPU, memory, disk, and network, potential issues can be identified early, and performance can be optimized.
CPU Usage Monitoring
MongoDB's CPU usage reflects the workload of query processing, index building, and other operations. High CPU usage may lead to increased query latency.
Key Metrics:
cpu_usage.user
: Percentage of CPU usage in user spacecpu_usage.system
: Percentage of CPU usage in system spacecpu_usage.nice
: CPU usage by low-priority processesglobalLock.activeClients
: Number of active client connections
Example Code (Node.js to Fetch CPU Metrics):
const { exec } = require('child_process');
exec('mongostat --host=localhost --rowcount=1 --noheaders', (error, stdout) => {
const [_, usr, sys, ..._] = stdout.split(/\s+/);
console.log(`User CPU: ${usr}%, System CPU: ${sys}%`);
});
Common Issue Scenarios:
- Prolonged usage above 80% may indicate the need for query optimization or additional indexes
- High system space usage may be caused by disk I/O waits
- Sudden CPU spikes are often related to complex aggregation queries
Memory Usage Analysis
MongoDB employs a memory-mapped file mechanism, resulting in memory usage patterns that differ significantly from traditional databases.
Core Memory Metrics:
mem.resident
: Resident physical memory size (MB)mem.virtual
: Virtual memory usage (MB)mem.mapped
: Size of memory-mapped files (MB)wiredTiger.cache.bytes
: WiredTiger cache usage
Memory Optimization Recommendations:
- The working set should be smaller than the configured
wiredTiger.cache.size
value - Monitor the
page_faults
metric to detect frequent page faults - For datasets larger than 4GB, configure at least 1GB of WiredTiger cache
Example (Mongo Shell to Check Memory):
db.serverStatus().mem
// Sample Output:
{
"resident" : 1456,
"virtual" : 3254,
"mapped" : 1024,
"supported" : true
}
Disk I/O Performance
Disk performance directly impacts write throughput and data persistence speed, particularly in write-intensive scenarios.
Key Disk Metrics:
disk.io.wait
: Percentage of time spent waiting for I/ObackgroundFlush.average_ms
: Average time taken for disk flushes (ms)wiredTiger.log.syncs
: Number of log sync operationsstorage.free
: Free disk space (GB)
Typical Troubleshooting:
-
When
disk.io.wait
consistently exceeds 50%, consider:- Upgrading disks (replace HDD with SSD)
- Adjusting
journal.commitIntervalMs
- Checking for excessive random writes
-
Use
iostat
for additional diagnostics:
iostat -xm 1 # View device-level I/O statistics
Network Traffic Monitoring
Network bottlenecks can cause replication delays and client timeouts, especially in sharded cluster environments.
Core Network Metrics:
network.bytesIn
: Inbound data volume (bytes)network.bytesOut
: Outbound data volume (bytes)network.numRequests
: Requests per secondrepl.network.getmores
: Count of fetch operations from secondary nodes
Network Optimization Example:
// Configure connection pool size (Node.js driver example)
const MongoClient = require('mongodb').MongoClient;
const url = 'mongodb://localhost:27017?poolSize=20&socketTimeoutMS=360000';
// Monitor current connections
db.serverStatus().connections
// Sample Output:
{
"current" : 42,
"available" : 818,
"totalCreated" : 291
}
Metric Aggregation and Visualization
Combine Prometheus and Grafana for professional-grade monitoring:
Prometheus Configuration Example:
scrape_configs:
- job_name: 'mongodb'
static_configs:
- targets: ['mongodb-exporter:9216']
metrics_path: /metrics
Key Grafana Dashboard Charts:
- Combined CPU/Memory/Disk trend graph
- QPS statistics by operation type
- Replica set member latency heatmap
- Connection pool utilization dashboard
Alert Rule Configuration
Set reasonable threshold-based alerts based on actual business needs:
Typical Alert Rules:
- CPU > 90% for 5 consecutive minutes
- Memory working set exceeds 90% of available cache
- Disk space remaining less than 20%
- Primary-secondary replication delay > 30 seconds
MongoDB Atlas Alert Example:
{
"eventTypeName": "OUTSIDE_METRIC_THRESHOLD",
"metricName": "ASSERT_REGULAR",
"operator": "GREATER_THAN",
"threshold": 10,
"units": "RAW",
"notifications": [
{
"typeName": "SMS",
"intervalMin": 5
}
]
}
Performance Benchmarking
Establishing performance baselines helps identify abnormal changes:
Using sysbench for Testing:
sysbench --test=oltp --mongodb-db=test \
--mongodb-collection=bench \
--num-threads=8 --max-requests=100000 \
run
Key Benchmark Metrics to Record:
- 95th percentile query latency
- Transactions per second (TPS)
- Error rate
- Resource utilization curves
Real-World Troubleshooting Case
Scenario: An e-commerce platform experiences slow MongoDB responses during a major sales event.
Diagnosis Steps:
- Discover
globalLock.currentQueue.total
consistently > 50 - Check
db.currentOp()
to identify slow queries:db.currentOp({ "active": true, "secs_running": {"$gt": 3} })
- Use
explain()
to analyze and find missing order status index - Performance improves 6x after adding the index:
db.orders.createIndex({status: 1, createTime: -1})
Special Considerations for Containerized Environments
Monitoring differences when deploying in Kubernetes:
- Distinguish between container-internal and external metrics
- Monitor the impact of resource limits:
resources: limits: cpu: "2" memory: "4Gi" requests: cpu: "1" memory: "2Gi"
- Use Sidecar pattern for metric collection:
kubectl port-forward pod/mongodb-0 9216:9216
Historical Data Analysis
Use $out
to aggregate historical monitoring data:
db.metrics.aggregate([
{
$match: {timestamp: {$gte: ISODate("2023-01-01")}}
},
{
$group: {
_id: {$dateToString: {format: "%Y-%m-%d", date: "$timestamp"}},
avgCPU: {$avg: "$cpu.usage"},
peakConn: {$max: "$connections.current"}
}
},
{$out: "daily_stats"}
])
本站部分内容来自互联网,一切版权均归源网站或源作者所有。
如果侵犯了你的权益请来信告知我们删除。邮箱:cc@cccx.cn