arista-device-health

# Arista EOS Device Health Check Structured triage procedure for assessing Arista EOS device health. Produces a prioritized findings report with severity classifications and recommended actions. EOS runs on a Linux kernel with individual processes (agents) managing each protocol and subsystem. This architecture means Linux-native tools (`bash top`, `bash df`, `bash dmesg`) are valid supplements to EOS show commands. Agent health is a first-class concern — a crashed or stuck agent can silently affect a subsystem even when aggregate device metrics look normal. The procedure covers core device health first, then extends to MLAG and VXLAN/EVPN for data center deployments. ## When to Use - Device reported as slow, dropping traffic, or unresponsive - Scheduled health audit of Arista switches (leaf, spine, or border) - Post-change verification after upgrades, patches, or configuration changes - Capacity planning data collection for CPU, memory, and link utilization - Incident response when an Arista switch is suspected as the fault domain - MLAG inconsistency or split-brain detection in a data center pair - VXLAN/EVPN overlay health assessment — VTEP reachability, BGP EVPN peering - Agent crash or restart detected in syslog ## Prerequisites - SSH or console access to the device (privilege 1 minimum) - EOS 4.28 or later (commands validated against EOS 4.30+) - Network reachability to management interface confirmed - Awareness of the device's normal baseline (CPU, memory, traffic patterns) - For MLAG configurations: access to both MLAG peers for cross-validation - For VXLAN/EVPN: knowledge of expected VTEP count and BGP EVPN peer list ## Procedure Follow this sequence. Core device health (Steps 1–5) applies to all deployments. Steps 6–7 are data center extensions for MLAG and VXLAN/EVPN — skip if not configured on this device. ### Step 1: Establish Baseline Context ``` show version show uptime show clock show inventory ``` Record: hostname, EOS version, uptime, hardware model, total memory, current time. Short uptime indicates a recent reload or crash — check `show reload cause` for the trigger. Note the hardware platform — Arista's 7050X, 7280R, 7500R, and 720X series have different memory and CPU profiles. ### Step 2: CPU and Memory Assessment EOS reports CPU and memory through both EOS commands and Linux-native tools. ``` show processes top once show version | include Free memory ``` **Linux-native alternative** (provides more granularity): ``` bash timeout 5 top -bn1 | head 20 bash free -m ``` `show processes top once` shows per-process CPU and memory similar to Linux `top`. EOS CPU is typically consumed by agents — each protocol has its own process. Key agents to watch: - **Stp** — Spanning Tree: high CPU indicates topology instability - **Rib** — Routing Information Base: route churn or large table operations - **Ebra** — EOS Bridge Agent: L2 forwarding, MAC learning storms - **IpRib** — IP RIB agent: routing table processing - **Bgp** — BGP agent: peer negotiations, route processing - **Fap-sobek** / **Sand** — ASIC agents: forwarding plane, hardware programming EOS memory: `show version` includes total and free memory. Free memory below 20% of total is warning. Linux `free -m` provides buffer/cache detail — EOS uses Linux page cache aggressively, so "used" memory includes reclaimable cache. ### Step 3: Agent and Daemon Health EOS runs each subsystem as an independent agent. A crashed or stuck agent affects its subsystem silently — device-level metrics may look normal while one protocol is completely inoperative. ``` show agent show agent [name] logs | tail 20 show logging last 50 | include AGENT|agent|crash|restart ``` `show agent` lists all EOS agents with their PID, state, and uptime. Healthy state: all agents show `running` with uptimes matching device uptime. Red flags: - Agent with shorter uptime than device → it crashed and restarted - Agent in `crashed` or `not running` state → subsystem is down - Multiple agent restarts in logs → recurring instability If an agent shows recent restart, check its logs: `show agent [name] logs`. Common restart causes: memory exhaustion, uncaught exception, watchdog timeout. ### Step 4: Interface Health ``` show interfaces status show interfaces counters errors show interfaces counters discards ``` For interfaces with errors: ``` show interfaces [name] show interfaces [name] transceiver ``` Error classification: - CRC / FCS errors → Layer 1 (cabling, optics, SFP seating) - Input errors without CRC → buffer overruns, MTU mismatch - Alignment errors → duplex mismatch or L1 issue - Output discards → egress congestion; check QoS or link capacity - Late collisions → duplex mismatch (should not occur on modern links) Optics DOM: `show interfaces transceiver` provides Tx/Rx power, temperature, voltage. Low Rx power (below -10 dBm for most 10G SFP+) indicates fiber or optic degradation. ### Step 5: Environment and Platform ``` show environment all show environment cooling show environment power show environment temperature show reload cause show logging last 30 | include ERR|WARN|traceback|panic ``` Check: all temperature sensors within limits, all power supplies OK, all fans operational. `show environment all` is a comprehensive single command. `show reload cause` — if the device recently reloaded, this explains why. Expected causes: user-initiated, software upgrade. Unexpected causes: kernel panic, watchdog timeout, power loss. Review syslog for tracebacks, panics, or agent crashes. EOS logs are also accessible via `bash cat /var/log/messages | tail 50`. ### Step 6: MLAG Health (Data Center Extension) Skip this step if MLAG is not configured. MLAG state is often the single most important health indicator in an Arista data center deployment — a degraded MLAG pair can cause traffic black-holes or asymmetric forwarding. ``` show mlag show mlag detail show mlag interfaces show mlag config-sanity ``` **MLAG state assessment:** | Field | Healthy Value | Problem Indicator | |-------|--------------|-------------------| | state | active | disabled, inactive | | negotiation status | connected | connecting, disconnected | | peer-link status | up | down, errdisabled | | local-interface status | up | down | | config-sanity | consistent | inconsistent | `show mlag detail` provides: - **Peer address** and **peer-link** interface status - **Heartbeat** interval and status — lost heartbeats indicate control plane reachability issues between peers - **Reload delay** timers — critical for preventing traffic loss during boot `show mlag config-sanity` checks for configuration mismatches between peers. Inconsistencies can cause traffic loops or black-holes. Address any `inconsistent` result before continuing other triage. `show mlag interfaces` — verify all MLAG interfaces show `active-full` on both sides. `active-partial` means one side is down — reduced redundancy. `disabled` means the MLAG interface is administratively or operationally down. ### Step 7: VXLAN/EVPN Health (Data Center Extension) Skip this step if VXLAN/EVPN is not configured. This assesses overlay fabric health for VXLAN deployments using BGP EVPN as the control plane. ``` show vxlan vtep show vxlan address-table show interfaces vxlan 1 show bgp evpn summary ``` **VXLAN health checks:** - `show vxlan vtep` — verify expected VTEP count. Missing VTEPs indicate underlay reachability or BGP EVPN peering issues. - `show interfaces vxlan 1` — VTEP source interface must be up. Flood list should match expected remote VTEPs (for flood-and-learn) or be empty (for BGP EVPN with ingress replication). - `show vxlan address-table` — MAC learning across the overlay. Stale or missing entries indicate control plane issues. **BGP EVPN peering:** - `show bgp evpn summary` — all EVPN peers must be in Established state. Peers in Active/Connect state have transport or configuration issues. - Compare EVPN route count to baseline — significant deviation indicates route withdrawal or redistribution problems. For L3 VXLAN (symmetric IRB): ``` show ip route vrf [name] summary show bgp evpn route-type ip-prefix ``` Verify: VRF route counts match expectations, no unexpected route withdrawals, EVPN type-5 routes present for inter-VRF routing. ## Threshold Tables Reference: `references/threshold-tables.md` for detailed per-parameter thresholds. | Parameter | Normal | Warning | Critical | Notes | |-----------|--------|---------|----------|-------| | CPU 5-min avg | < 40% | 40–70% | > 70% | Per-agent breakdown in `show processes top` | | Memory free | > 30% | 20–30% | < 20% | Linux page cache is reclaimable | | Agent state | All running | Agent restarted | Agent crashed/not running | Check `show agent` | | Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% | | | Output discards/hr | < 100 | 100–1000 | > 1000 | | | MLAG state | active/connected | config inconsistency | peer-link down | | | VXLAN VTEP count | Matches expected | Missing VTEPs | Vxlan1 interface down | | | Temperature | Within spec | 5°C of max | At/above max | Per-sensor | ## Decision Trees ### Primary Triage ``` Is the device reachable? ├── No → Check console, power, environment. Check reload cause after recovery. └── Yes ├── Agent health issue? │ ├── Agent crashed / not running → Subsystem is down │ │ ├── Identify which agent → Determines affected protocol/feature │ │ ├── Bgp agent → BGP sessions will be down │ │ ├── Stp agent → STP not running, L2 loops possible │ │ ├── Ebra agent → L2 forwarding affected │ │ └── Action: check agent logs, attempt restart, escalate to TAC │ ├── Agent restarted recently → Subsystem had a disruption │ │ └── Check logs: `show agent [name] logs` for crash reason │ └── All agents running, uptimes match → Agent layer healthy │ ├── CPU issue? │ ├── Identify top agent by CPU │ │ ├── Stp high → STP topology change, check link flaps │ │ ├── Rib/IpRib high → Route churn, check BGP/OSPF peers │ │ ├── Ebra high → MAC learning storm, check for loops │ │ ├── Bgp high → Large table, peer negotiation, route policy │ │ └── Fap/Sand high → ASIC programming backlog │ └── No single agent dominant → General overload, check traffic rates │ ├── Memory issue? │ ├── Free < 20% → Check top consumers via `show processes top once` │ ├── Linux cache: `bash free -m` → Cache is reclaimable, not a real leak │ └── Steady growth → Memory leak in agent, collect data, plan reload │ ├── Interface errors? → Classify error type │ ├── CRC/FCS errors → Layer 1 (cable, optic, SFP) │ ├── Output discards → QoS or congestion │ └── Alignment errors → Duplex mismatch │ ├── MLAG issue? (if configured) │ ├── See MLAG State Triage tree below │ └── MLAG healthy → Continue │ ├── VXLAN/EVPN issue? (if configured) │ ├── Missing VTEPs → Check underlay reachability, BGP EVPN peers │ ├── EVPN peer not Established → Check transport, route-map, ASN │ └── VTEP and peers OK → Overlay healthy │ └── All within thresholds → Document clean health ``` ### MLAG State Triage ``` MLAG configured? ├── No → Skip MLAG checks └── Yes ├── MLAG state: active? │ ├── No → MLAG disabled or inactive │ │ └── Check: `show mlag` for reason, peer reachability │ └── Yes → Continue │ ├── Peer-link status: up? │ ├── No → CRITICAL — peer-link down │ │ ├── Traffic orphaned on MLAG interfaces │ │ ├── Check: physical link, SFP, port-channel members │ │ └── Verify: reload-delay timers configured to prevent storms │ └── Yes → Continue │ ├── Negotiation: connected? │ ├── No → Control plane issue between peers │ │ └── Check: heartbeat, peer IP, VLAN configuration │ └── Yes → Continue │ ├── Config sanity: consistent? │ ├── No → Configuration mismatch │ │ ├── `show mlag config-sanity` for details │ │ ├── Common: VLAN mismatch, STP priority, trunk allowed VLANs │ │ └── Fix mismatches before trusting MLAG state │ └── Yes → Continue │ ├── MLAG interfaces: all active-full? │ ├── active-partial → One side down, reduced redundancy │ │ └── Check: interface status on both peers │ ├── disabled → Interface operationally down │ │ └── Check: physical link, member port status │ └── active-full → Healthy │ └── All checks pass → MLAG healthy ``` ### Escalation Criteria Escalate to senior engineer or Arista TAC when: - CPU sustained > 90% for 15+ minutes with no identifiable cause - Memory below 15% free with no recent change to explain consumption - Agent in crashed state that does not recover after restart - Multiple agents crashing or restarting within a short period - Traceback or kernel panic in logs (`show logging`, `bash dmesg`) - Any environmental alarm (power, fan, temperature) - More than 3 routing neighbor state changes in the last hour - MLAG peer-link failure with no clear physical cause - MLAG config-sanity inconsistencies that cannot be resolved - VXLAN overlay losing more than 10% of expected VTEPs ## Report Template ``` DEVICE HEALTH REPORT ==================== Device: [hostname] Platform: Arista EOS Model: [from show inventory] Software: [EOS version] Uptime: [uptime string] Check Time: [timestamp] Performed By: [operator/agent] SUMMARY: [HEALTHY | WARNING | CRITICAL | EMERGENCY] AGENT HEALTH: Total agents: [count] Running: [count] | Crashed: [count] | Not running: [count] Recent restarts: [list any agents with uptime < device uptime] FINDINGS: 1. [Severity] [Component] — [Description] Domain: [System | Interface | MLAG | VXLAN | Environment] Observed: [metric value] Threshold: [normal/warning/critical range] Action: [recommended action] 2. ... DC EXTENSIONS: MLAG: [healthy | degraded | critical | not configured] State: [active/inactive], Peer-link: [up/down], Config-sanity: [consistent/inconsistent] VXLAN/EVPN: [healthy | degraded | critical | not configured] VTEPs: [observed/expected], EVPN peers: [established/total] RECOMMENDATIONS: - [Prioritized action list] NEXT CHECK: [date based on severity — CRITICAL: 24hr, WARNING: 7d, HEALTHY: 30d] ``` ## Troubleshooting ### Device Unresponsive to SSH Try console access. If console is also unresponsive, check power and environment via out-of-band management. After recovery: `show reload cause` for the trigger, `bash dmesg | tail 50` for kernel messages, `show logging last 50` for pre-crash syslog. EOS stores crash data in `/var/log/` — accessible via `bash ls -la /var/log/`. ### Agent Crash and Recovery If `show agent` shows an agent as `crashed` or with uptime shorter than the device: 1. Check agent logs: `show agent [name] logs | tail 30` 2. Check syslog: `show logging last 50 | include [agent-name]` 3. Linux-level: `bash journalctl -u [agent-name] --no-pager | tail 30` 4. If the agent is not running, attempt: `agent [name] shutdown` then `no agent [name] shutdown` in configuration mode (note: this is a config change) Agent crashes are the most common EOS-specific failure mode. Collect logs before restarting — the crash data is needed for TAC investigation. ### CPU Spikes During Health Check EOS show commands can briefly spike CPU on the management plane. Wait 30 seconds after connecting before collecting CPU data. Use `show processes top once` rather than interactive `top` (which holds a session) for scripted checks. ### MLAG Split-Brain Detection If both peers report MLAG state `active` but negotiation shows `disconnected`, this is a split-brain condition. Both peers believe they are the primary. Causes: peer-link physical failure, peer-link VLAN issue, or heartbeat network partition. Immediate action: verify physical peer-link connectivity. If peer-link is physically up but MLAG reports disconnected, check VLAN trunking on the peer-link and the heartbeat network (usually the management VRF). ### Linux-Native Diagnostics EOS's Linux foundation means standard Linux tools work from `bash`: - `bash top -bn1` — process-level CPU and memory (more granular than EOS `top`) - `bash free -m` — memory with buffer/cache detail - `bash df -h` — filesystem usage (important for /mnt/flash and /var/log) - `bash dmesg | tail` — kernel messages (hardware errors, driver issues) - `bash uptime` — load averages - `bash cat /proc/meminfo` — detailed memory breakdown These supplement EOS show commands when deeper investigation is needed. Prefix all commands with `bash` from the EOS CLI, or enter `bash` for a shell. ### Inconsistent Memory Readings EOS's Linux page cache uses available memory for file caching. `show version` may report low free memory while actual memory pressure is low. Use `bash free -m` and check the `available` column (not `free`) for the real available memory. The `available` column accounts for reclaimable cache and buffers.

arista-device-health

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

arista-device-health