juniper-device-health

# Juniper JunOS Device Health Check Structured triage procedure for assessing Juniper device health across MX, SRX, EX, QFX, and PTX platforms. Produces a prioritized findings report with severity classifications and recommended actions. JunOS separates Routing Engine (RE) and Packet Forwarding Engine (PFE). These are independent health domains — a healthy RE does not guarantee a healthy PFE, and vice versa. This procedure assesses both explicitly. ## When to Use - Device reported as slow, dropping traffic, or unresponsive - Scheduled health audit of Juniper routers, switches, or firewalls - Post-change verification after commits, upgrades, or ISSU - Capacity planning data collection for RE CPU, memory, and link utilization - Incident response when a Juniper device is suspected as the fault domain - RE failover event — verify mastership and standby RE state - Chassis alarm triggered — severity triage and root cause identification ## Prerequisites - SSH or console access to the device (login class with `view` permissions minimum) - JunOS 21.x or later (commands validated against JunOS 23.2+) - Network reachability to management interface or fxp0 confirmed - Awareness of the device's normal baseline (CPU, memory, traffic patterns) - For dual-RE systems: know which RE should be master under normal operations - Knowledge of recent commit history if correlating symptoms with changes ## Procedure Follow this sequence. Each step produces data for the final report. RE mastership verification is mandatory first — all subsequent data is RE-scoped. ### Step 1: Verify RE Mastership (Mandatory) On dual-RE systems, health data comes from the RE you are logged into. If you are on the backup RE, all metrics reflect the standby engine — not the active forwarding path. This step is non-negotiable. ``` show chassis routing-engine | match "Slot|Current state|Mastership" show route summary | match "Router ID" show system uptime ``` Verify: your session is on the master RE. If `Current state` shows `Backup`, switch to master: `request routing-engine login other-routing-engine`. On single-RE platforms, confirm RE is `Master` (not in a degraded state). Record: hostname, RE slot, mastership state, uptime, last reboot reason. Short uptime after an unexpected reboot — investigate immediately. ### Step 2: Alarm Analysis JunOS surfaces alarms as first-class status indicators. Check chassis and system alarms before deeper investigation — alarms may already identify the problem. ``` show chassis alarms show system alarms ``` Alarm severities: - **Major** — service-affecting condition, requires immediate attention - **Minor** — degraded but service continues, investigate promptly If alarms are present, record each alarm's class, description, and time. Major alarms take priority over all other triage — address them first. Common alarm sources: FPC offline, power supply failure, rescue config not set, license expiry, FRU removal. No alarms → proceed with systematic health assessment. ### Step 3: Routing Engine Health RE handles control plane: routing protocols, management, commit operations. ``` show chassis routing-engine show system processes extensive | match "PID|last pid|%CPU" | head 20 show task replication ``` Key fields from `show chassis routing-engine`: - **CPU utilization** — temperature, idle percentage (idle below 30% is warning) - **Memory utilization** — total and used; watch for used > 80% - **Temperature** — compare to platform-specific thresholds - **Start time** — recent RE restart indicates crash or failover - **Load averages** — 1min/5min/15min; sustained > 1.0 per core is elevated High RE CPU with top process identification: - `rpd` — routing protocol daemon: route churn, table size, peer instability - `chassisd` — chassis management: sensor polling issues, FPC communication - `snmpd` — SNMP polling storms - `mgd` — management: large config, slow commit, CLI session overload - `kmd` — key management: IKE/IPsec negotiation storms (SRX) RE CPU spikes during `commit` operations are normal (can hit 80–90% briefly). Compare against commit history: `show system commit`. ### Step 4: PFE Health PFE handles data plane forwarding independently from RE. A healthy RE with a degraded PFE means traffic is being dropped even though the control plane looks fine. ``` show chassis fpc show chassis fpc detail show pfe statistics traffic show pfe statistics error ``` `show chassis fpc`: - **State** must be `Online`. Any other state (`Present`, `Offline`, `Empty`) indicates a hardware issue or intentional deactivation. - **CPU Total** — PFE CPU utilization; above 80% is warning, above 90% critical - **Memory heap utilization** — above 80% indicates PFE memory pressure `show pfe statistics traffic`: - Compare input vs output packet counts — large delta indicates drops - Check `fabric input drops` and `local input drops` for discard sources `show pfe statistics error`: - Any non-zero error counters warrant investigation - Sustained incrementing errors (check twice 30 seconds apart) indicate active issues On MX platforms with multiple FPCs, check each FPC individually. A single degraded FPC affects only interfaces on that linecard. ### Step 5: System Resources ``` show system storage show system memory show system core-dumps show system commit | head 10 ``` **Storage:** JunOS partitions can fill from logs, core dumps, or failed upgrades. Any partition above 85% used is warning. `/var` filling above 90% can prevent commits and logging. **Memory:** `show system memory` gives kernel-level view. Compare to RE memory from Step 3 for consistency. Sustained growth without corresponding config changes suggests a memory leak. **Core dumps:** Presence of recent core files (within last 7 days) indicates process crashes. Record the process name and timestamp — this is JTAC-relevant data. **Commit history:** Recent commits correlate with symptoms. A device that was healthy before a commit and unhealthy after has an obvious investigation path. ### Step 6: Interface and Routing Health ``` show interfaces terse | match "down|err" show interfaces extensive [name] | match "error|drop|CRC|carrier" show route summary show bgp summary show ospf neighbor show isis adjacency ``` For each interface with errors: - CRC errors → Layer 1 (cabling, optics, SFP) - Input errors without CRC → buffer overruns, MTU mismatch - Output drops → congestion or policer drops - Carrier transitions → link flap, check SFP DOM: `show interfaces diagnostics optics [name]` Routing: verify expected neighbor count, all adjacencies in Established/Full state. BGP prefix counts deviating > 10% from baseline indicate route churn. ### Step 7: Environment ``` show chassis environment show chassis temperature-thresholds show chassis power show chassis fan ``` Check: all temperature sensors within thresholds, all power supplies OK, all fans operational. Any environmental alarm maps directly to Major alarm severity. On platforms with redundant RE: check both RE temperatures. A standby RE running hot may indicate cooling issues even if master RE temperature is normal. ## Threshold Tables Reference: `references/threshold-tables.md` for detailed per-parameter thresholds. | Parameter | Normal | Warning | Critical | Notes | |-----------|--------|---------|----------|-------| | RE CPU idle | > 40% | 20–40% | < 20% | Spikes during commit are normal | | RE memory used | < 75% | 75–85% | > 85% | | | RE load avg (1min) | < 0.7/core | 0.7–1.5/core | > 1.5/core | Scale by RE core count | | PFE CPU | < 60% | 60–80% | > 80% | Per-FPC | | PFE heap used | < 70% | 70–85% | > 85% | Per-FPC | | Storage partition | < 80% | 80–90% | > 90% | /var critical for commits | | Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% | | | Output drops/hr | < 100 | 100–1000 | > 1000 | | | Chassis alarm | None | Minor present | Major present | | | Temperature | Within spec | 5°C of max | At/above max | Per-sensor | ## Decision Trees ### Primary Triage ``` Is the device reachable? ├── No → Check console, power, environment. Collect core dumps after recovery. └── Yes ├── Verify RE mastership → On master RE? │ ├── No → Switch to master RE, restart triage │ └── Yes → Continue │ ├── Chassis/system alarms present? │ ├── Major alarm → Address immediately │ │ ├── FPC offline → show chassis fpc detail, check PFE │ │ ├── Power supply failure → show chassis power, environment │ │ ├── RE failover → show chassis routing-engine, check standby │ │ └── Other Major → Collect alarm detail, escalate │ ├── Minor alarm → Note for report, continue triage │ └── No alarms → Continue systematic assessment │ ├── RE CPU issue? │ ├── rpd high → Route churn: check BGP/OSPF/ISIS neighbors │ ├── chassisd high → FPC/sensor communication: check chassis fpc │ ├── snmpd high → Polling storm: check SNMP community/clients │ ├── mgd high → Commit or CLI overload: check system commit │ ├── kmd high → IKE storms (SRX): check IPsec SA count │ └── Recent commit correlates → Rollback candidate │ ├── PFE issue? (RE healthy but traffic drops) │ ├── FPC not Online → Hardware issue, check fpc detail │ ├── PFE CPU > 80% → Forwarding overload │ │ └── Check traffic rates, filter complexity, NH resolution │ ├── PFE drops incrementing → Identify drop category │ │ ├── Fabric drops → Linecard-to-fabric issue │ │ ├── Local drops → Punt/exception path overload │ │ └── Discard → Filter or policer drops (may be expected) │ └── PFE memory pressure → Session/route table exhaustion │ ├── Memory issue? │ ├── RE memory > 85% → Identify top consumers via processes │ ├── Storage > 90% → Clean logs, core dumps, old images │ │ └── /var full → Immediate: prevents commits and logging │ └── Core dumps present → Process crash, collect for JTAC │ ├── Interface errors? → Classify error type │ ├── CRC/input errors → Layer 1 (cable, optic, SFP) │ ├── Output drops → QoS policer or congestion │ └── Carrier transitions → Link flap, check optics DOM │ └── All within thresholds → Document clean health ``` ### Alarm Severity Triage ``` Alarm detected ├── Major alarm? │ ├── FPC Offline │ │ ├── Single FPC → Affects only interfaces on that linecard │ │ ├── check: show chassis fpc detail [slot] │ │ └── Action: power cycle FPC if transient, RMA if persistent │ ├── Power Supply Failure │ │ ├── Redundancy lost → Immediate replacement │ │ └── Both PSUs failed → Emergency, device at risk │ ├── RE Failover │ │ ├── Was this planned? → Verify new master is healthy │ │ └── Unplanned → Investigate old master: show chassis routing-engine │ └── Other Major → Collect detail, open JTAC case │ └── Minor alarm? ├── Rescue config not set → `request system configuration rescue save` ├── License expiry → Check feature impact, plan renewal ├── FRU removal → Verify intentional, document └── Other Minor → Note in report, monitor ``` ### Escalation Criteria Escalate to senior engineer or JTAC when: - RE CPU sustained above 90% for 15+ minutes with no identifiable cause - RE memory above 90% used with no recent config change - PFE offline or in non-Online state after power cycle attempt - Core dumps present from critical processes (rpd, chassisd, pfed) - Major chassis alarm with no clear remediation - Multiple FPC failures or fabric errors - RE failover loop (multiple failovers in short period) - Any environmental alarm (power, fan, temperature) - More than 3 routing neighbor state changes in the last hour ## Report Template ``` DEVICE HEALTH REPORT ==================== Device: [hostname] Platform: JunOS Model: [from show chassis hardware] Software: [JunOS version] RE Slot: [RE0 or RE1] Mastership: [Master — verified] Uptime: [uptime string] Check Time: [timestamp] Performed By: [operator/agent] SUMMARY: [HEALTHY | WARNING | CRITICAL | EMERGENCY] ALARMS: Chassis: [count] ([Major: n, Minor: n] or None) System: [count] ([Major: n, Minor: n] or None) Details: [alarm descriptions if present] FINDINGS: 1. [Severity] [Component] — [Description] Domain: [RE | PFE | Chassis | Interface | Routing] Observed: [metric value] Threshold: [normal/warning/critical range] Action: [recommended action] 2. ... RE/PFE STATUS: RE: [healthy/degraded/critical] — CPU idle: [n]%, Memory: [n]% used PFE: [per-FPC state summary] Dual-RE: [master/backup state, or N/A for single-RE] RECOMMENDATIONS: - [Prioritized action list] NEXT CHECK: [date based on severity — CRITICAL: 24hr, WARNING: 7d, HEALTHY: 30d] ``` ## Troubleshooting ### Device Unresponsive to SSH Try console access. If console is also unresponsive, check power and environment via out-of-band management (craft interface, console server). After recovery: `show system core-dumps`, `show chassis routing-engine` for reboot reason, `show log messages | match "kernel|panic|watchdog"`. ### Logged Into Backup RE If `show chassis routing-engine` shows your RE as `Backup`, you are collecting standby metrics. Switch to master: `request routing-engine login other-routing-engine`. If master RE is unreachable from backup, this indicates master RE failure — check `show chassis routing-engine` from backup for master's last known state. ### RE CPU Spikes During Commit JunOS RE CPU can spike to 80–90% during commit operations. This is expected behavior — the config daemon and rpd both consume CPU during commit processing. Verify: `show system commit` to confirm a recent commit, then wait 2–3 minutes and re-check. Sustained high CPU after commit settles indicates a real problem. ### PFE Drops With Healthy RE The RE (control plane) and PFE (data plane) are independent. High PFE drops with a normal RE means traffic is being discarded at the forwarding level. Check: `show pfe statistics traffic` for drop categories, `show chassis fpc detail` for PFE CPU and memory. Common causes: filter/policer drops (may be expected), next-hop resolution failures, PFE memory exhaustion from large tables. ### Storage Full Preventing Commits If `/var` is above 95%, commits will fail. Clear space: `request system storage cleanup` — removes old logs, core dumps, and temporary files. If that is insufficient: `show system storage` to identify the largest consumers, then selectively remove old software images or rotated log files. ### Dual-RE Failover Investigation After an RE failover: verify new master is healthy (Steps 1–3), then investigate the old master. From the new master: `show chassis routing-engine` shows both REs' state. Check `show log messages | match "mastership|failover|switchover"` for the event trigger. Common causes: RE crash (core dump present), watchdog timeout, manual switchover, GRES/NSR failure.

juniper-device-health

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

juniper-device-health