Design of a Custom Performance Monitoring System

Author：Chuan Chen 阅读数：62045人阅读分类：性能优化

Core Objectives of Performance Monitoring Systems

The core objectives of a performance monitoring system are to collect, analyze, and display application performance metrics in real-time, helping developers quickly identify performance bottlenecks. An excellent custom performance monitoring system should possess characteristics such as high real-time capability, low intrusiveness, scalability, and ease of use. By customizing the monitoring system, it is possible to design dedicated monitoring metrics tailored to specific business scenarios rather than relying on fixed metrics from generic solutions.

System Architecture Design

Data Collection Layer

The data collection layer is responsible for gathering performance data from both the client and server sides. Frontend performance data is typically collected using the Performance API:

// Example of frontend performance data collection
const getPerformanceMetrics = () => {
  const timing = window.performance.timing;
  const metrics = {
    dns: timing.domainLookupEnd - timing.domainLookupStart,
    tcp: timing.connectEnd - timing.connectStart,
    ttfb: timing.responseStart - timing.requestStart,
    download: timing.responseEnd - timing.responseStart,
    domReady: timing.domComplete - timing.domLoading,
    loadEvent: timing.loadEventEnd - timing.loadEventStart,
    total: timing.loadEventEnd - timing.navigationStart
  };
  return metrics;
};

// Using MutationObserver to monitor DOM performance changes
const observer = new MutationObserver((mutations) => {
  const perfData = {
    mutationCount: mutations.length,
    processingTime: performance.now() - startTime
  };
  // Send data to the collection service
  sendToCollector(perfData);
});

Data Transmission Layer

The data transmission layer must consider network conditions and performance impact, typically employing the following strategies:

Use Web Workers for data preprocessing and compression
Implement batch reporting to reduce request frequency
Support offline caching and resumable transmission
Adopt lightweight protocols such as Protocol Buffers

// Example of batch reporting implementation
class PerformanceReporter {
  constructor() {
    this.queue = [];
    this.maxBatchSize = 10;
    this.flushInterval = 5000;
    this.init();
  }

  init() {
    setInterval(() => this.flush(), this.flushInterval);
    window.addEventListener('beforeunload', () => this.flushSync());
  }

  add(data) {
    this.queue.push(data);
    if (this.queue.length >= this.maxBatchSize) {
      this.flush();
    }
  }

  flush() {
    if (this.queue.length === 0) return;
    
    const batch = [...this.queue];
    this.queue = [];
    
    navigator.sendBeacon('/collect', JSON.stringify(batch));
  }

  flushSync() {
    if (this.queue.length === 0) return;
    const xhr = new XMLHttpRequest();
    xhr.open('POST', '/collect', false);
    xhr.send(JSON.stringify(this.queue));
  }
}

Data Storage Layer

The data storage design must consider query efficiency and storage costs:

Time-series databases (e.g., InfluxDB) for storing raw metrics
Elasticsearch for log-type data
Redis for caching hotspot data
Data partitioning strategies (by time/business line/region)

-- Example of time-series data table structure
CREATE TABLE performance_metrics (
  time TIMESTAMP,
  app_id STRING,
  page_url STRING,
  device_type STRING,
  dns_latency FLOAT,
  tcp_latency FLOAT,
  ttfb FLOAT,
  dom_ready FLOAT,
  load_time FLOAT,
  region STRING
) TAGS (env);

Key Performance Metrics Design

Frontend Core Metrics

Loading Performance Metrics
- FCP (First Contentful Paint)
- LCP (Largest Contentful Paint)
- TTI (Time to Interactive)
- FID (First Input Delay)
Runtime Metrics
- Memory usage trends
- Long task statistics (>50ms tasks)
- Layout shifts (CLS)
- Animation frame rates

// Long task monitoring
const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.duration > 50) {
      reportLongTask({
        duration: entry.duration,
        startTime: entry.startTime,
        container: entry.name || 'unknown'
      });
    }
  }
});
observer.observe({entryTypes: ['longtask']});

Backend Core Metrics

System Resource Metrics
- CPU usage
- Memory consumption
- Disk I/O
- Network bandwidth
Application Performance Metrics
- API response times (P95/P99)
- Error rates
- Database query performance
- Queue backlog status

# Flask middleware example
@app.before_request
def before_request():
    g.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - g.start_time;
    statsd.timing(f'api.{request.endpoint}.duration', duration*1000)
    if response.status_code >= 500:
        statsd.increment(f'api.{request.endpoint}.errors')
    return response

Data Analysis and Visualization

Data Aggregation Strategies

Time-based aggregation (1min/5min/1h)
Percentile calculations (P50/P95/P99)
Anomaly detection algorithms (3σ principle/IQR)
Year-over-year and month-over-month analysis

// Anomaly detection example
function detectAnomalies(data, windowSize = 10, threshold = 3) {
  const anomalies = [];
  for (let i = windowSize; i < data.length; i++) {
    const window = data.slice(i - windowSize, i);
    const mean = window.reduce((a,b) => a + b, 0) / windowSize;
    const std = Math.sqrt(
      window.reduce((a,b) => a + Math.pow(b - mean, 2), 0) / windowSize
    );
    if (Math.abs(data[i] - mean) > threshold * std) {
      anomalies.push({index: i, value: data[i]});
    }
  }
  return anomalies;
}

Visualization Design Principles

Dashboard Design
- Core metrics overview
- Trend comparison charts
- Geographic heatmaps
- Anomaly alert panels
Interactive Features
- Time range selection
- Drill-down analysis
- Threshold alert configuration
- Data export

<!-- ECharts visualization example -->
<div id="perf-chart" style="width: 100%;height:400px;"></div>
<script>
const chart = echarts.init(document.getElementById('perf-chart'));
chart.setOption({
  tooltip: {trigger: 'axis'},
  legend: {data: ['P50', 'P95', 'P99']},
  xAxis: {type: 'category', data: ['00:00','03:00','06:00','09:00','12:00']},
  yAxis: {type: 'value', name: 'Response Time(ms)'},
  series: [
    {name: 'P50', type: 'line', data: [120, 132, 145, 160, 172]},
    {name: 'P95', type: 'line', data: [220, 282, 291, 334, 390]},
    {name: 'P99', type: 'line', data: [320, 432, 501, 534, 620]}
  ]
});
</script>

Alert Mechanism Implementation

Alert Rule Design

Threshold alerts (static thresholds/dynamic baselines)
Sudden change alerts (month-over-month/year-over-year change rates)
Composite alerts (multiple conditions combined)
Dependency alerts (triggered by dependencies)

# Alert rule configuration example
alert_rules:
  - name: "API Response Time Anomaly"
    metrics: "api.response_time.p99"
    condition: "value > 1000 || (value - baseline) / baseline > 0.5"
    window: "5m"
    severity: "critical"
    receivers: ["ops-team"]
  
  - name: "Frontend Error Rate Increase"
    metrics: "js.error_rate"
    condition: "value > 0.01 && increase(1h) > 0.005"
    window: "1h"
    severity: "warning"

Alert Noise Reduction Strategies

Alert aggregation (merging notifications for the same issue)
Alert suppression (high-priority alerts suppress low-priority ones)
Alert snoozing (temporarily silencing resolved issues)
Alert escalation (unacknowledged alerts escalate notifications)

# Alert aggregation example
def aggregate_alerts(alerts):
    grouped = defaultdict(list)
    for alert in alerts:
        key = (alert['metric'], alert['service'])
        grouped[key].append(alert)
    
    result = []
    for key, group in grouped.items():
        if len(group) > 3:  # Aggregate if more than 3 similar alerts
            sample = group[0]
            result.append({
                **sample,
                'count': len(group),
                'first_occurrence': min(a['time'] for a in group),
                'last_occurrence': max(a['time'] for a in group)
            })
        else:
            result.extend(group)
    return result

System Optimization Directions

Collection End Optimization

Dynamic sampling rate adjustment (reduce sampling rate under high load)
Metric prioritization (full collection for core metrics, sampling for secondary metrics)
Data preprocessing (simple aggregation on the client side)
Heartbeat detection (monitoring collector status)

// Dynamic sampling rate implementation
function shouldSample(metricType) {
  const samplingRates = {
    'critical': 1.0,
    'important': 0.5,
    'normal': 0.1
  };
  const rate = samplingRates[getMetricPriority(metricType)] || 0.1;
  return Math.random() < rate;
}

Server-Side Optimization

Data sharding
Stream processing instead of batch processing
Cold and hot data separation
Read-write separation architecture

// Stream processing example (pseudo-code)
KafkaStreams streams = new KafkaStreams(
  StreamsBuilder()
    .stream("raw-metrics")
    .filter((k, v) -> v != null)
    .mapValues(this::parseMetric)
    .groupBy((k, v) -> v.getMetricType())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
    .aggregate(
      this::initAggregate,
      this::aggregateMetrics,
      Materialized.as("metrics-store"))
    .toStream()
    .to("aggregated-metrics"),
  config);

Practical Application Scenarios

E-Commerce Promotion Scenarios

Core transaction chain monitoring (order/payment)
Inventory service-specific monitoring
Flash sale system queue monitoring
Regional access hotspot monitoring

// E-commerce specific metrics example
{
  "checkout_load_time": 1240,
  "payment_success_rate": 0.992,
  "inventory_cache_hit_rate": 0.87,
  "flash_sale_queue_length": 1423,
  "recommend_api_latency": 56
}

Content Platform Scenarios

Video loading performance monitoring
Content recommendation click-through rate monitoring
Comment posting success rate
Image compression performance monitoring

// Video playback monitoring
videoElem.addEventListener('loadedmetadata', () => {
  const loadTime = performance.now() - startLoadTime;
  reportVideoMetric({
    event: 'metadata_loaded',
    duration: loadTime,
    bitrate: videoElem.videoBitrate,
    bufferHealth: videoElem.buffered.length
  });
});

做个网站！

本站部分内容来自互联网,一切版权均归源网站或源作者所有。

如果侵犯了你的权益请来信告知我们删除。邮箱：cc@cccx.cn

上一篇：性能指标采集与分析

下一篇：合成监控与真实监控对比