Design of a Custom Performance Monitoring System
Core Objectives of Performance Monitoring Systems
The core objectives of a performance monitoring system are to collect, analyze, and display application performance metrics in real-time, helping developers quickly identify performance bottlenecks. An excellent custom performance monitoring system should possess characteristics such as high real-time capability, low intrusiveness, scalability, and ease of use. By customizing the monitoring system, it is possible to design dedicated monitoring metrics tailored to specific business scenarios rather than relying on fixed metrics from generic solutions.
System Architecture Design
Data Collection Layer
The data collection layer is responsible for gathering performance data from both the client and server sides. Frontend performance data is typically collected using the Performance API:
// Example of frontend performance data collection
const getPerformanceMetrics = () => {
const timing = window.performance.timing;
const metrics = {
dns: timing.domainLookupEnd - timing.domainLookupStart,
tcp: timing.connectEnd - timing.connectStart,
ttfb: timing.responseStart - timing.requestStart,
download: timing.responseEnd - timing.responseStart,
domReady: timing.domComplete - timing.domLoading,
loadEvent: timing.loadEventEnd - timing.loadEventStart,
total: timing.loadEventEnd - timing.navigationStart
};
return metrics;
};
// Using MutationObserver to monitor DOM performance changes
const observer = new MutationObserver((mutations) => {
const perfData = {
mutationCount: mutations.length,
processingTime: performance.now() - startTime
};
// Send data to the collection service
sendToCollector(perfData);
});
Data Transmission Layer
The data transmission layer must consider network conditions and performance impact, typically employing the following strategies:
- Use Web Workers for data preprocessing and compression
- Implement batch reporting to reduce request frequency
- Support offline caching and resumable transmission
- Adopt lightweight protocols such as Protocol Buffers
// Example of batch reporting implementation
class PerformanceReporter {
constructor() {
this.queue = [];
this.maxBatchSize = 10;
this.flushInterval = 5000;
this.init();
}
init() {
setInterval(() => this.flush(), this.flushInterval);
window.addEventListener('beforeunload', () => this.flushSync());
}
add(data) {
this.queue.push(data);
if (this.queue.length >= this.maxBatchSize) {
this.flush();
}
}
flush() {
if (this.queue.length === 0) return;
const batch = [...this.queue];
this.queue = [];
navigator.sendBeacon('/collect', JSON.stringify(batch));
}
flushSync() {
if (this.queue.length === 0) return;
const xhr = new XMLHttpRequest();
xhr.open('POST', '/collect', false);
xhr.send(JSON.stringify(this.queue));
}
}
Data Storage Layer
The data storage design must consider query efficiency and storage costs:
- Time-series databases (e.g., InfluxDB) for storing raw metrics
- Elasticsearch for log-type data
- Redis for caching hotspot data
- Data partitioning strategies (by time/business line/region)
-- Example of time-series data table structure
CREATE TABLE performance_metrics (
time TIMESTAMP,
app_id STRING,
page_url STRING,
device_type STRING,
dns_latency FLOAT,
tcp_latency FLOAT,
ttfb FLOAT,
dom_ready FLOAT,
load_time FLOAT,
region STRING
) TAGS (env);
Key Performance Metrics Design
Frontend Core Metrics
-
Loading Performance Metrics
- FCP (First Contentful Paint)
- LCP (Largest Contentful Paint)
- TTI (Time to Interactive)
- FID (First Input Delay)
-
Runtime Metrics
- Memory usage trends
- Long task statistics (>50ms tasks)
- Layout shifts (CLS)
- Animation frame rates
// Long task monitoring
const observer = new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
if (entry.duration > 50) {
reportLongTask({
duration: entry.duration,
startTime: entry.startTime,
container: entry.name || 'unknown'
});
}
}
});
observer.observe({entryTypes: ['longtask']});
Backend Core Metrics
-
System Resource Metrics
- CPU usage
- Memory consumption
- Disk I/O
- Network bandwidth
-
Application Performance Metrics
- API response times (P95/P99)
- Error rates
- Database query performance
- Queue backlog status
# Flask middleware example
@app.before_request
def before_request():
g.start_time = time.time()
@app.after_request
def after_request(response):
duration = time.time() - g.start_time;
statsd.timing(f'api.{request.endpoint}.duration', duration*1000)
if response.status_code >= 500:
statsd.increment(f'api.{request.endpoint}.errors')
return response
Data Analysis and Visualization
Data Aggregation Strategies
- Time-based aggregation (1min/5min/1h)
- Percentile calculations (P50/P95/P99)
- Anomaly detection algorithms (3σ principle/IQR)
- Year-over-year and month-over-month analysis
// Anomaly detection example
function detectAnomalies(data, windowSize = 10, threshold = 3) {
const anomalies = [];
for (let i = windowSize; i < data.length; i++) {
const window = data.slice(i - windowSize, i);
const mean = window.reduce((a,b) => a + b, 0) / windowSize;
const std = Math.sqrt(
window.reduce((a,b) => a + Math.pow(b - mean, 2), 0) / windowSize
);
if (Math.abs(data[i] - mean) > threshold * std) {
anomalies.push({index: i, value: data[i]});
}
}
return anomalies;
}
Visualization Design Principles
-
Dashboard Design
- Core metrics overview
- Trend comparison charts
- Geographic heatmaps
- Anomaly alert panels
-
Interactive Features
- Time range selection
- Drill-down analysis
- Threshold alert configuration
- Data export
<!-- ECharts visualization example -->
<div id="perf-chart" style="width: 100%;height:400px;"></div>
<script>
const chart = echarts.init(document.getElementById('perf-chart'));
chart.setOption({
tooltip: {trigger: 'axis'},
legend: {data: ['P50', 'P95', 'P99']},
xAxis: {type: 'category', data: ['00:00','03:00','06:00','09:00','12:00']},
yAxis: {type: 'value', name: 'Response Time(ms)'},
series: [
{name: 'P50', type: 'line', data: [120, 132, 145, 160, 172]},
{name: 'P95', type: 'line', data: [220, 282, 291, 334, 390]},
{name: 'P99', type: 'line', data: [320, 432, 501, 534, 620]}
]
});
</script>
Alert Mechanism Implementation
Alert Rule Design
- Threshold alerts (static thresholds/dynamic baselines)
- Sudden change alerts (month-over-month/year-over-year change rates)
- Composite alerts (multiple conditions combined)
- Dependency alerts (triggered by dependencies)
# Alert rule configuration example
alert_rules:
- name: "API Response Time Anomaly"
metrics: "api.response_time.p99"
condition: "value > 1000 || (value - baseline) / baseline > 0.5"
window: "5m"
severity: "critical"
receivers: ["ops-team"]
- name: "Frontend Error Rate Increase"
metrics: "js.error_rate"
condition: "value > 0.01 && increase(1h) > 0.005"
window: "1h"
severity: "warning"
Alert Noise Reduction Strategies
- Alert aggregation (merging notifications for the same issue)
- Alert suppression (high-priority alerts suppress low-priority ones)
- Alert snoozing (temporarily silencing resolved issues)
- Alert escalation (unacknowledged alerts escalate notifications)
# Alert aggregation example
def aggregate_alerts(alerts):
grouped = defaultdict(list)
for alert in alerts:
key = (alert['metric'], alert['service'])
grouped[key].append(alert)
result = []
for key, group in grouped.items():
if len(group) > 3: # Aggregate if more than 3 similar alerts
sample = group[0]
result.append({
**sample,
'count': len(group),
'first_occurrence': min(a['time'] for a in group),
'last_occurrence': max(a['time'] for a in group)
})
else:
result.extend(group)
return result
System Optimization Directions
Collection End Optimization
- Dynamic sampling rate adjustment (reduce sampling rate under high load)
- Metric prioritization (full collection for core metrics, sampling for secondary metrics)
- Data preprocessing (simple aggregation on the client side)
- Heartbeat detection (monitoring collector status)
// Dynamic sampling rate implementation
function shouldSample(metricType) {
const samplingRates = {
'critical': 1.0,
'important': 0.5,
'normal': 0.1
};
const rate = samplingRates[getMetricPriority(metricType)] || 0.1;
return Math.random() < rate;
}
Server-Side Optimization
- Data sharding
- Stream processing instead of batch processing
- Cold and hot data separation
- Read-write separation architecture
// Stream processing example (pseudo-code)
KafkaStreams streams = new KafkaStreams(
StreamsBuilder()
.stream("raw-metrics")
.filter((k, v) -> v != null)
.mapValues(this::parseMetric)
.groupBy((k, v) -> v.getMetricType())
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.aggregate(
this::initAggregate,
this::aggregateMetrics,
Materialized.as("metrics-store"))
.toStream()
.to("aggregated-metrics"),
config);
Practical Application Scenarios
E-Commerce Promotion Scenarios
- Core transaction chain monitoring (order/payment)
- Inventory service-specific monitoring
- Flash sale system queue monitoring
- Regional access hotspot monitoring
// E-commerce specific metrics example
{
"checkout_load_time": 1240,
"payment_success_rate": 0.992,
"inventory_cache_hit_rate": 0.87,
"flash_sale_queue_length": 1423,
"recommend_api_latency": 56
}
Content Platform Scenarios
- Video loading performance monitoring
- Content recommendation click-through rate monitoring
- Comment posting success rate
- Image compression performance monitoring
// Video playback monitoring
videoElem.addEventListener('loadedmetadata', () => {
const loadTime = performance.now() - startLoadTime;
reportVideoMetric({
event: 'metadata_loaded',
duration: loadTime,
bitrate: videoElem.videoBitrate,
bufferHealth: videoElem.buffered.length
});
});
本站部分内容来自互联网,一切版权均归源网站或源作者所有。
如果侵犯了你的权益请来信告知我们删除。邮箱:cc@cccx.cn
上一篇:性能指标采集与分析
下一篇:合成监控与真实监控对比