ops-mcp-server

# Ops MCP Server Skill Access your infrastructure's observability data and execute operational procedures through a unified MCP interface. ## Capabilities at a Glance | Module | Tools | What it answers | |--------|-------|----------------| | **Events** (Kubernetes) | `list-events-from-ops`, `get-events-from-ops` | What happened to a pod/deployment/node? | | **Metrics** (Prometheus) | `list-metrics-from-prometheus`, `query-metrics-from-prometheus`, `query-metrics-range-from-prometheus` | Is CPU/memory/traffic normal? What changed over time? | | **Logs** (Elasticsearch) | `list-log-indices-from-elasticsearch`, `search-logs-from-elasticsearch`, `query-logs-from-elasticsearch` | What errors are in the logs? What did service X log? | | **Traces** (Jaeger) | `get-services-from-jaeger`, `get-operations-from-jaeger`, `find-traces-from-jaeger`, `get-trace-from-jaeger` | Why is this request slow? Where did it fail? | | **SOPS** | `list-sops-from-ops`, `list-sops-parameters-from-ops`, `execute-sops-from-ops` | Run a standard operational procedure | ## Setup (first-time) ```bash # 1. Use mcporter with npx (no installation needed) # Or install globally: npm i -g mcporter # 2. Register the server cd ~/.openclaw/workspace npx mcporter config add ops-mcp-server --url http://localhost/mcp # 3. Authenticate (if needed) npx mcporter auth ops-mcp-server # On failure, add to ~/.openclaw/workspace/config/mcporter.json: # "headers": { "Authorization": "Bearer YOUR_TOKEN" } # 4. Verify npx mcporter list ops-mcp-server npx mcporter call ops-mcp-server list-events-from-ops page_size=5 # 5. Set env var export OPS_MCP_SERVER_URL="http://localhost/mcp" ``` --- ## How to Investigate: Decision Guide When a user describes a problem, use this guide to choose starting tools and build a complete picture. ### 🔴 "Something is broken / service is down" 1. **Kubernetes Events first** — check if pods crashed, restarted, or got evicted ``` get-events-from-ops subject_pattern="ops.clusters.*.namespaces.<ns>.pods.*.events" ``` 2. **Logs** — search for errors around the time of the incident ``` query-logs-from-elasticsearch query="FROM logs-* | WHERE @timestamp > NOW() - 30 minutes | WHERE level == 'error' | LIMIT 50" ``` 3. **Traces** — find failed or slow requests ``` find-traces-from-jaeger serviceName=<service> tags={"error":"true"} ``` ### 🟡 "Performance is degraded / requests are slow" 1. **Metrics** — check resource saturation ``` query-metrics-from-prometheus query="100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)" query-metrics-range-from-prometheus query="node_memory_MemAvailable_bytes" time_range="1h" step="1m" ``` 2. **Traces** — find slow spans ``` find-traces-from-jaeger serviceName=<service> durationMin=1000 ``` 3. **Logs** — look for timeouts or slow query warnings ### 🔵 "I need to run a procedure / restart something" 1. **List available SOPs** ``` list-sops-from-ops ``` 2. **Get parameters** ``` list-sops-parameters-from-ops sops_id=<id> ``` 3. **Execute** ``` execute-sops-from-ops sops_id=<id> parameters='{...}' ``` ### 🟢 "General health check / nothing specific" Start with events + a key metrics query, then go deeper based on what you find. --- ## Tool Quick Reference ### Events — NATS subject pattern format ``` # Namespace resources ops.clusters.{cluster}.namespaces.{ns}.{resourceType}.{name}.{observation} # Node level ops.clusters.{cluster}.nodes.{nodeName}.{observation} # Notifications ops.notifications.providers.{provider}.channels.{channel}.severities.{severity} ``` Wildcards: `*` = one segment, `>` = everything remaining (tail only) Observation types: `status` | `events` | `alerts` | `findings` Time is Unix milliseconds: `$(date +%s)000` ### Logs — ES|QL query patterns ```sql -- Recent errors FROM logs-* | WHERE @timestamp > NOW() - 30 minutes | WHERE level == 'error' | LIMIT 100 -- Top errors by frequency FROM logs-* | WHERE @timestamp > NOW() - 1 hour | WHERE level == 'error' | STATS count() BY message | SORT count DESC | LIMIT 10 -- Specific service FROM logs-* | WHERE service == 'checkout-service' | WHERE @timestamp > NOW() - 1 hour | LIMIT 50 ``` ### Metrics — PromQL patterns ``` # CPU usage 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) # Memory available node_memory_MemAvailable_bytes # HTTP error rate rate(http_requests_total{status=~"5.."}[5m]) ``` --- ## Detailed Examples & Reference Files For complete parameter lists, output formats, and advanced patterns, read the relevant file: - **events** → `examples/events.md` - **metrics** → `examples/metrics.md` - **logs** → `examples/logs.md` - **traces** → `examples/traces.md` - **sops** → `examples/sops.md` - **event subject format design** → `references/design.md` Read the relevant example file before making complex tool calls you're unsure about. --- ## What This Skill is NOT For - Direct infrastructure changes (use dedicated automation tooling) - Real-time alerting (investigation only, not a monitoring agent) - Writing to or modifying operational data (all access is read-only)

ops-mcp-server

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

ops-mcp-server

ops-mcp-server

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement