Monitoring metrics (CPU, memory, disk, network)

Author：Chuan Chen 阅读数：40416人阅读分类： MongoDB

Monitoring Metrics (CPU, Memory, Disk, Network)

Database performance monitoring is a critical focus for operations and development teams. As a popular NoSQL database, MongoDB's operational status directly impacts business stability. By monitoring core metrics such as CPU, memory, disk, and network, potential issues can be identified early, and performance can be optimized.

CPU Usage Monitoring

MongoDB's CPU usage reflects the workload of query processing, index building, and other operations. High CPU usage may lead to increased query latency.

Key Metrics:

cpu_usage.user: Percentage of CPU usage in user space
cpu_usage.system: Percentage of CPU usage in system space
cpu_usage.nice: CPU usage by low-priority processes
globalLock.activeClients: Number of active client connections

Example Code (Node.js to Fetch CPU Metrics):

const { exec } = require('child_process');

exec('mongostat --host=localhost --rowcount=1 --noheaders', (error, stdout) => {
  const [_, usr, sys, ..._] = stdout.split(/\s+/);
  console.log(`User CPU: ${usr}%, System CPU: ${sys}%`);
});

Common Issue Scenarios:

Prolonged usage above 80% may indicate the need for query optimization or additional indexes
High system space usage may be caused by disk I/O waits
Sudden CPU spikes are often related to complex aggregation queries

Memory Usage Analysis

MongoDB employs a memory-mapped file mechanism, resulting in memory usage patterns that differ significantly from traditional databases.

Core Memory Metrics:

mem.resident: Resident physical memory size (MB)
mem.virtual: Virtual memory usage (MB)
mem.mapped: Size of memory-mapped files (MB)
wiredTiger.cache.bytes: WiredTiger cache usage

Memory Optimization Recommendations:

The working set should be smaller than the configured wiredTiger.cache.size value
Monitor the page_faults metric to detect frequent page faults
For datasets larger than 4GB, configure at least 1GB of WiredTiger cache

Example (Mongo Shell to Check Memory):

db.serverStatus().mem
// Sample Output:
{
  "resident" : 1456,
  "virtual" : 3254,
  "mapped" : 1024,
  "supported" : true
}

Disk I/O Performance

Disk performance directly impacts write throughput and data persistence speed, particularly in write-intensive scenarios.

Key Disk Metrics:

disk.io.wait: Percentage of time spent waiting for I/O
backgroundFlush.average_ms: Average time taken for disk flushes (ms)
wiredTiger.log.syncs: Number of log sync operations
storage.free: Free disk space (GB)

Typical Troubleshooting:

When disk.io.wait consistently exceeds 50%, consider:
- Upgrading disks (replace HDD with SSD)
- Adjusting journal.commitIntervalMs
- Checking for excessive random writes
Use iostat for additional diagnostics:

iostat -xm 1  # View device-level I/O statistics

Network Traffic Monitoring

Network bottlenecks can cause replication delays and client timeouts, especially in sharded cluster environments.

Core Network Metrics:

network.bytesIn: Inbound data volume (bytes)
network.bytesOut: Outbound data volume (bytes)
network.numRequests: Requests per second
repl.network.getmores: Count of fetch operations from secondary nodes

Network Optimization Example:

// Configure connection pool size (Node.js driver example)
const MongoClient = require('mongodb').MongoClient;
const url = 'mongodb://localhost:27017?poolSize=20&socketTimeoutMS=360000';

// Monitor current connections
db.serverStatus().connections
// Sample Output:
{
  "current" : 42,
  "available" : 818,
  "totalCreated" : 291
}

Metric Aggregation and Visualization

Combine Prometheus and Grafana for professional-grade monitoring:

Prometheus Configuration Example:

scrape_configs:
  - job_name: 'mongodb'
    static_configs:
      - targets: ['mongodb-exporter:9216']
    metrics_path: /metrics

Key Grafana Dashboard Charts:

Combined CPU/Memory/Disk trend graph
QPS statistics by operation type
Replica set member latency heatmap
Connection pool utilization dashboard

Alert Rule Configuration

Set reasonable threshold-based alerts based on actual business needs:

Typical Alert Rules:

CPU > 90% for 5 consecutive minutes
Memory working set exceeds 90% of available cache
Disk space remaining less than 20%
Primary-secondary replication delay > 30 seconds

MongoDB Atlas Alert Example:

{
  "eventTypeName": "OUTSIDE_METRIC_THRESHOLD",
  "metricName": "ASSERT_REGULAR",
  "operator": "GREATER_THAN",
  "threshold": 10,
  "units": "RAW",
  "notifications": [
    {
      "typeName": "SMS",
      "intervalMin": 5
    }
  ]
}

Performance Benchmarking

Establishing performance baselines helps identify abnormal changes:

Using sysbench for Testing:

sysbench --test=oltp --mongodb-db=test \
         --mongodb-collection=bench \
         --num-threads=8 --max-requests=100000 \
         run

Key Benchmark Metrics to Record:

95th percentile query latency
Transactions per second (TPS)
Error rate
Resource utilization curves

Real-World Troubleshooting Case

Scenario: An e-commerce platform experiences slow MongoDB responses during a major sales event.

Diagnosis Steps:

Discover globalLock.currentQueue.total consistently > 50

Check db.currentOp() to identify slow queries:

db.currentOp({
  "active": true,
  "secs_running": {"$gt": 3}
})

Use explain() to analyze and find missing order status index

Performance improves 6x after adding the index:

db.orders.createIndex({status: 1, createTime: -1})

Special Considerations for Containerized Environments

Monitoring differences when deploying in Kubernetes:

Distinguish between container-internal and external metrics

Monitor the impact of resource limits:

resources:
  limits:
    cpu: "2"
    memory: "4Gi"
  requests:
    cpu: "1"
    memory: "2Gi"

Use Sidecar pattern for metric collection:

kubectl port-forward pod/mongodb-0 9216:9216

Historical Data Analysis

Use $out to aggregate historical monitoring data:

db.metrics.aggregate([
  {
    $match: {timestamp: {$gte: ISODate("2023-01-01")}}
  },
  {
    $group: {
      _id: {$dateToString: {format: "%Y-%m-%d", date: "$timestamp"}},
      avgCPU: {$avg: "$cpu.usage"},
      peakConn: {$max: "$connections.current"}
    }
  },
  {$out: "daily_stats"}
])

做个网站！

本站部分内容来自互联网,一切版权均归源网站或源作者所有。

如果侵犯了你的权益请来信告知我们删除。邮箱：cc@cccx.cn

上一篇：灾难恢复与数据迁移

下一篇：MongoDB自带的监控工具（mongostat、mongotop）