阿里云主机折上折
  • 微信号
Current Site:Index > Design of a Custom Performance Monitoring System

Design of a Custom Performance Monitoring System

Author:Chuan Chen 阅读数:62045人阅读 分类: 性能优化

Core Objectives of Performance Monitoring Systems

The core objectives of a performance monitoring system are to collect, analyze, and display application performance metrics in real-time, helping developers quickly identify performance bottlenecks. An excellent custom performance monitoring system should possess characteristics such as high real-time capability, low intrusiveness, scalability, and ease of use. By customizing the monitoring system, it is possible to design dedicated monitoring metrics tailored to specific business scenarios rather than relying on fixed metrics from generic solutions.

System Architecture Design

Data Collection Layer

The data collection layer is responsible for gathering performance data from both the client and server sides. Frontend performance data is typically collected using the Performance API:

// Example of frontend performance data collection
const getPerformanceMetrics = () => {
  const timing = window.performance.timing;
  const metrics = {
    dns: timing.domainLookupEnd - timing.domainLookupStart,
    tcp: timing.connectEnd - timing.connectStart,
    ttfb: timing.responseStart - timing.requestStart,
    download: timing.responseEnd - timing.responseStart,
    domReady: timing.domComplete - timing.domLoading,
    loadEvent: timing.loadEventEnd - timing.loadEventStart,
    total: timing.loadEventEnd - timing.navigationStart
  };
  return metrics;
};

// Using MutationObserver to monitor DOM performance changes
const observer = new MutationObserver((mutations) => {
  const perfData = {
    mutationCount: mutations.length,
    processingTime: performance.now() - startTime
  };
  // Send data to the collection service
  sendToCollector(perfData);
});

Data Transmission Layer

The data transmission layer must consider network conditions and performance impact, typically employing the following strategies:

  1. Use Web Workers for data preprocessing and compression
  2. Implement batch reporting to reduce request frequency
  3. Support offline caching and resumable transmission
  4. Adopt lightweight protocols such as Protocol Buffers
// Example of batch reporting implementation
class PerformanceReporter {
  constructor() {
    this.queue = [];
    this.maxBatchSize = 10;
    this.flushInterval = 5000;
    this.init();
  }

  init() {
    setInterval(() => this.flush(), this.flushInterval);
    window.addEventListener('beforeunload', () => this.flushSync());
  }

  add(data) {
    this.queue.push(data);
    if (this.queue.length >= this.maxBatchSize) {
      this.flush();
    }
  }

  flush() {
    if (this.queue.length === 0) return;
    
    const batch = [...this.queue];
    this.queue = [];
    
    navigator.sendBeacon('/collect', JSON.stringify(batch));
  }

  flushSync() {
    if (this.queue.length === 0) return;
    const xhr = new XMLHttpRequest();
    xhr.open('POST', '/collect', false);
    xhr.send(JSON.stringify(this.queue));
  }
}

Data Storage Layer

The data storage design must consider query efficiency and storage costs:

  1. Time-series databases (e.g., InfluxDB) for storing raw metrics
  2. Elasticsearch for log-type data
  3. Redis for caching hotspot data
  4. Data partitioning strategies (by time/business line/region)
-- Example of time-series data table structure
CREATE TABLE performance_metrics (
  time TIMESTAMP,
  app_id STRING,
  page_url STRING,
  device_type STRING,
  dns_latency FLOAT,
  tcp_latency FLOAT,
  ttfb FLOAT,
  dom_ready FLOAT,
  load_time FLOAT,
  region STRING
) TAGS (env);

Key Performance Metrics Design

Frontend Core Metrics

  1. Loading Performance Metrics

    • FCP (First Contentful Paint)
    • LCP (Largest Contentful Paint)
    • TTI (Time to Interactive)
    • FID (First Input Delay)
  2. Runtime Metrics

    • Memory usage trends
    • Long task statistics (>50ms tasks)
    • Layout shifts (CLS)
    • Animation frame rates
// Long task monitoring
const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.duration > 50) {
      reportLongTask({
        duration: entry.duration,
        startTime: entry.startTime,
        container: entry.name || 'unknown'
      });
    }
  }
});
observer.observe({entryTypes: ['longtask']});

Backend Core Metrics

  1. System Resource Metrics

    • CPU usage
    • Memory consumption
    • Disk I/O
    • Network bandwidth
  2. Application Performance Metrics

    • API response times (P95/P99)
    • Error rates
    • Database query performance
    • Queue backlog status
# Flask middleware example
@app.before_request
def before_request():
    g.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - g.start_time;
    statsd.timing(f'api.{request.endpoint}.duration', duration*1000)
    if response.status_code >= 500:
        statsd.increment(f'api.{request.endpoint}.errors')
    return response

Data Analysis and Visualization

Data Aggregation Strategies

  1. Time-based aggregation (1min/5min/1h)
  2. Percentile calculations (P50/P95/P99)
  3. Anomaly detection algorithms (3σ principle/IQR)
  4. Year-over-year and month-over-month analysis
// Anomaly detection example
function detectAnomalies(data, windowSize = 10, threshold = 3) {
  const anomalies = [];
  for (let i = windowSize; i < data.length; i++) {
    const window = data.slice(i - windowSize, i);
    const mean = window.reduce((a,b) => a + b, 0) / windowSize;
    const std = Math.sqrt(
      window.reduce((a,b) => a + Math.pow(b - mean, 2), 0) / windowSize
    );
    if (Math.abs(data[i] - mean) > threshold * std) {
      anomalies.push({index: i, value: data[i]});
    }
  }
  return anomalies;
}

Visualization Design Principles

  1. Dashboard Design

    • Core metrics overview
    • Trend comparison charts
    • Geographic heatmaps
    • Anomaly alert panels
  2. Interactive Features

    • Time range selection
    • Drill-down analysis
    • Threshold alert configuration
    • Data export
<!-- ECharts visualization example -->
<div id="perf-chart" style="width: 100%;height:400px;"></div>
<script>
const chart = echarts.init(document.getElementById('perf-chart'));
chart.setOption({
  tooltip: {trigger: 'axis'},
  legend: {data: ['P50', 'P95', 'P99']},
  xAxis: {type: 'category', data: ['00:00','03:00','06:00','09:00','12:00']},
  yAxis: {type: 'value', name: 'Response Time(ms)'},
  series: [
    {name: 'P50', type: 'line', data: [120, 132, 145, 160, 172]},
    {name: 'P95', type: 'line', data: [220, 282, 291, 334, 390]},
    {name: 'P99', type: 'line', data: [320, 432, 501, 534, 620]}
  ]
});
</script>

Alert Mechanism Implementation

Alert Rule Design

  1. Threshold alerts (static thresholds/dynamic baselines)
  2. Sudden change alerts (month-over-month/year-over-year change rates)
  3. Composite alerts (multiple conditions combined)
  4. Dependency alerts (triggered by dependencies)
# Alert rule configuration example
alert_rules:
  - name: "API Response Time Anomaly"
    metrics: "api.response_time.p99"
    condition: "value > 1000 || (value - baseline) / baseline > 0.5"
    window: "5m"
    severity: "critical"
    receivers: ["ops-team"]
  
  - name: "Frontend Error Rate Increase"
    metrics: "js.error_rate"
    condition: "value > 0.01 && increase(1h) > 0.005"
    window: "1h"
    severity: "warning"

Alert Noise Reduction Strategies

  1. Alert aggregation (merging notifications for the same issue)
  2. Alert suppression (high-priority alerts suppress low-priority ones)
  3. Alert snoozing (temporarily silencing resolved issues)
  4. Alert escalation (unacknowledged alerts escalate notifications)
# Alert aggregation example
def aggregate_alerts(alerts):
    grouped = defaultdict(list)
    for alert in alerts:
        key = (alert['metric'], alert['service'])
        grouped[key].append(alert)
    
    result = []
    for key, group in grouped.items():
        if len(group) > 3:  # Aggregate if more than 3 similar alerts
            sample = group[0]
            result.append({
                **sample,
                'count': len(group),
                'first_occurrence': min(a['time'] for a in group),
                'last_occurrence': max(a['time'] for a in group)
            })
        else:
            result.extend(group)
    return result

System Optimization Directions

Collection End Optimization

  1. Dynamic sampling rate adjustment (reduce sampling rate under high load)
  2. Metric prioritization (full collection for core metrics, sampling for secondary metrics)
  3. Data preprocessing (simple aggregation on the client side)
  4. Heartbeat detection (monitoring collector status)
// Dynamic sampling rate implementation
function shouldSample(metricType) {
  const samplingRates = {
    'critical': 1.0,
    'important': 0.5,
    'normal': 0.1
  };
  const rate = samplingRates[getMetricPriority(metricType)] || 0.1;
  return Math.random() < rate;
}

Server-Side Optimization

  1. Data sharding
  2. Stream processing instead of batch processing
  3. Cold and hot data separation
  4. Read-write separation architecture
// Stream processing example (pseudo-code)
KafkaStreams streams = new KafkaStreams(
  StreamsBuilder()
    .stream("raw-metrics")
    .filter((k, v) -> v != null)
    .mapValues(this::parseMetric)
    .groupBy((k, v) -> v.getMetricType())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
    .aggregate(
      this::initAggregate,
      this::aggregateMetrics,
      Materialized.as("metrics-store"))
    .toStream()
    .to("aggregated-metrics"),
  config);

Practical Application Scenarios

E-Commerce Promotion Scenarios

  1. Core transaction chain monitoring (order/payment)
  2. Inventory service-specific monitoring
  3. Flash sale system queue monitoring
  4. Regional access hotspot monitoring
// E-commerce specific metrics example
{
  "checkout_load_time": 1240,
  "payment_success_rate": 0.992,
  "inventory_cache_hit_rate": 0.87,
  "flash_sale_queue_length": 1423,
  "recommend_api_latency": 56
}

Content Platform Scenarios

  1. Video loading performance monitoring
  2. Content recommendation click-through rate monitoring
  3. Comment posting success rate
  4. Image compression performance monitoring
// Video playback monitoring
videoElem.addEventListener('loadedmetadata', () => {
  const loadTime = performance.now() - startLoadTime;
  reportVideoMetric({
    event: 'metadata_loaded',
    duration: loadTime,
    bitrate: videoElem.videoBitrate,
    bufferHealth: videoElem.buffered.length
  });
});

本站部分内容来自互联网,一切版权均归源网站或源作者所有。

如果侵犯了你的权益请来信告知我们删除。邮箱:cc@cccx.cn

Front End Chuan

Front End Chuan, Chen Chuan's Code Teahouse 🍵, specializing in exorcising all kinds of stubborn bugs 💻. Daily serving baldness-warning-level development insights 🛠️, with a bonus of one-liners that'll make you laugh for ten years 🐟. Occasionally drops pixel-perfect romance brewed in a coffee cup ☕.