Install
openclaw skills install sre-log-analyticsAnalyze system logs by time range using Google SRE framework to summarize operation, classify errors, score health, and suggest improvements.
openclaw skills install sre-log-analyticsEnglish | 中文
This skill provides a systematic log analysis workflow based on Google SRE (Site Reliability Engineering) framework.
Trigger Conditions: Use this skill when:
Support filtering logs by the following methods:
YYYY-MM-DD to YYYY-MM-DD)Classify exceptions based on SRE best practices:
Give a 1-5 system health score based on error rate and exception frequency:
Provide suggestions based on Google SRE principles:
/var/log/, application logs are determined by deployment location)grep/awk for timestamp filtering (logic description only)tail/head for segmented readingAnalysis dimensions based on Google SRE framework:
| Analysis Dimension | Check Content |
|---|---|
| Error Rate | Proportion of error logs in total logs |
| Error Type Distribution | Aggregate statistics by error type |
| Error Timing | Time distribution of errors, whether sudden |
| Resource Usage | Whether resource exhaustion exists |
| Dependency Status | Whether it is caused by external dependency failure |
Merge similar exceptions to avoid duplicate reporting:
Output structured report including:
See references/report-template.md for reference output template.
According to user needs:
Below is the logic description for common log filtering operations, no actual scripts included:
Logic Steps:
1. Input: log_file_path, start_time, end_time
2. Initialize empty result list
3. For each line in log_file:
a. Extract timestamp string from the line
b. Parse timestamp to datetime object
c. If start_time <= datetime <= end_time:
i. Add line to result list
4. Output: result list
Logic Steps:
1. Input: log_lines
2. Initialize empty error list
3. Define error keywords: ["ERROR", "FATAL", "SEVERE", "Exception", "Error:"]
4. For each line in log_lines:
a. If any keyword matches the line:
i. Add line to error list
5. Output: error list, error_count = len(error_list)
Logic Steps:
1. Input: error_lines
2. Initialize empty aggregation dictionary
3. For each line in error_lines:
a. Extract error type keyword from line (e.g., OOM, connection refused, timeout)
b. If keyword exists in aggregation:
i. aggregation[keyword].count += 1
ii. Add line to aggregation[keyword].samples
c. Else:
i. Create new entry in aggregation with count = 1, samples = [line]
4. Sort aggregation by count descending (or by severity)
5. Output: sorted aggregation result
本技能基于 Google SRE (Site Reliability Engineering) 框架,提供系统化的日志分析工作流。
触发条件: 当以下情况时使用本技能:
支持按以下方式筛选日志:
YYYY-MM-DD 到 YYYY-MM-DD)基于 SRE 最佳实践对异常进行分类:
基于错误率、异常频次给出 1-5 的系统健康评分:
根据分析结果,结合 Google SRE 原则给出建议:
/var/log/,应用日志根据部署位置确定)grep/awk 进行时间戳筛选(仅逻辑描述)tail/head 分段读取基于 Google SRE 框架分析维度:
| 分析维度 | 检查内容 |
|---|---|
| 错误率 | 错误日志占总日志比例 |
| 错误类型分布 | 按错误类型聚合统计 |
| 错误时序 | 错误发生的时间分布,是否突发 |
| 资源使用 | 是否存在资源耗尽情况 |
| 依赖状态 | 是否因外部依赖故障引发 |
将同类异常合并,避免重复报告:
输出结构化报告,包含:
参考输出模板请见 references/report-template.md
根据用户需求:
以下是常见日志过滤操作的逻辑描述,不包含实际脚本:
逻辑步骤:
1. 输入: 日志文件路径, 开始时间, 结束时间
2. 初始化空结果列表
3. 遍历日志文件每一行:
a. 从行中提取时间戳字符串
b. 将时间戳解析为日期时间对象
c. 如果 开始时间 <= 日期时间 <= 结束时间:
i. 将行添加到结果列表
4. 输出: 结果列表
逻辑步骤:
1. 输入: 日志行列表
2. 初始化空错误列表
3. 定义错误关键词: ["ERROR", "FATAL", "SEVERE", "Exception", "Error:"]
4. 遍历日志每一行:
a. 如果任何关键词匹配该行:
i. 将行添加到错误列表
5. 输出: 错误列表, 错误计数 = len(错误列表)
逻辑步骤:
1. 输入: 错误行列表
2. 初始化空聚合字典
3. 遍历每个错误行:
a. 从行中提取错误类型关键词 (例如: OOM, connection refused, timeout)
b. 如果关键词已在聚合中:
i. 聚合[关键词].count += 1
ii. 将行添加到聚合[关键词].samples
c. 否则:
i. 在聚合中创建新条目,count = 1, samples = [line]
4. 按计数降序排序聚合 (或按严重程度)
5. 输出: 排序后的聚合结果