返回顶部
o

ollama-load-balancer

Ollama load balancer for Llama, Qwen, DeepSeek, and Mistral inference across multiple machines. Load balancing with auto-discovery via mDNS, health checks, queue management, automatic failover, retry on node failure, and zombie request cleanup. Zero configuration. 负载均衡Ollama推理分发。Balanceador de carga Ollama para inferencia distribuida.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.4
安全检测
已通过
192
下载量
0
收藏
概述
安装方式
版本历史

ollama-load-balancer

# Ollama Load Balancer You are managing an Ollama load balancer that distributes inference requests across multiple Ollama instances with automatic discovery, health monitoring, and failover. The load balancer handles all routing decisions transparently. ## What the load balancer solves Ollama has no built-in load balancing. One machine goes down, your app gets errors. No health checks, no failover, no queue management. You're manually pointing clients at specific machines and hoping they stay up. This load balancer auto-discovers Ollama instances via mDNS, monitors their health continuously, and distributes load based on real-time scoring. The load balancer automatically retries on failure. Zero config files. Zero Docker. `pip install ollama-herd`, run two commands, and load balancing is active. ## Deploy the load balancer ```bash pip install ollama-herd herd # start the load balancer on port 11435 herd-node # start load balancer backend node on each machine ``` Package: [`ollama-herd`](https://pypi.org/project/ollama-herd/) | Repo: [github.com/geeks-accelerator/ollama-herd](https://github.com/geeks-accelerator/ollama-herd) ## Load Balancer Endpoint The load balancer runs at `http://localhost:11435`. Drop-in replacement for direct Ollama connections — same API, same model names, with load balancing built in. ```python from openai import OpenAI # Load balancer client — requests are balanced across all backend nodes load_balancer_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed") load_balanced_response = load_balancer_client.chat.completions.create( model="llama3.3:70b", messages=[{"role": "user", "content": "Explain load balancing for LLM inference"}] ) ``` ## Load Balancer Health Monitoring ### Fleet-wide load balancer health check (15 automated checks) ```bash curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool ``` The load balancer checks: offline nodes, degraded nodes, memory pressure, underutilized nodes, model thrashing, request timeouts, error rates. Each load balancer check returns severity (info/warning/critical) and recommendations. ### Load balancer node status and metrics ```bash curl -s http://localhost:11435/fleet/status | python3 -m json.tool ``` Returns per-node: status (online/degraded/offline), CPU utilization, memory usage, loaded models with context lengths, and load balancer queue depths (pending/in-flight/done/failed). ### Load balancer queue depths ```bash curl -s http://localhost:11435/fleet/status | python3 -c " import sys, json # Load balancer queue inspection data = json.load(sys.stdin) for key, q in data.get('queues', {}).items(): print(f\"{key}: {q['pending']} pending, {q['in_flight']}/{q['max_concurrent']} in-flight\") " ``` ## Load Balancer Auto-Recovery - **Load balancer auto-retry** — if a node fails before the first response chunk, the load balancer re-scores and retries on the next-best node (up to 2 retries, configurable via `FLEET_MAX_RETRIES`) - **Load balancer zombie reaper** — background task detects in-flight requests stuck longer than 10 minutes and cleans them up - **Load balancer context protection** — strips dangerous `num_ctx` parameters that would trigger model reloads - **Load balancer VRAM-aware fallback** — routes to an already-loaded model instead of triggering a cold load - **Load balancer auto-pull** — optionally pulls missing models (disabled by default, toggle via settings) - **Load balancer holding queue** — when all nodes are busy, requests wait (up to 30s) rather than failing ## Load Balancer API Endpoints ### Models available through the load balancer ```bash # All models across the load-balanced fleet curl -s http://localhost:11435/api/tags | python3 -m json.tool # Models currently loaded in load balancer backend memory curl -s http://localhost:11435/api/ps | python3 -m json.tool # OpenAI-compatible model list via load balancer curl -s http://localhost:11435/v1/models | python3 -m json.tool ``` ### Load balancer request traces ```bash curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool ``` ### Load balancer usage statistics ```bash curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool ``` ### Load balancer model recommendations ```bash curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool ``` ### Load balancer settings (runtime toggles) ```bash # View load balancer config curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool # Toggle load balancer features curl -s -X POST http://localhost:11435/dashboard/api/settings \ -H "Content-Type: application/json" \ -d '{"auto_pull": false}' ``` ### Load balancer model management ```bash # View per-node model details behind the load balancer curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool # Pull a model to a load balancer backend node curl -s -X POST http://localhost:11435/dashboard/api/pull \ -H "Content-Type: application/json" \ -d '{"model": "llama3.3:70b", "node_id": "load-balancer-node-1"}' # Delete a model from a load balancer node curl -s -X POST http://localhost:11435/dashboard/api/delete \ -H "Content-Type: application/json" \ -d '{"model": "old-model:7b", "node_id": "load-balancer-node-1"}' ``` ### Load balancer per-app analytics ```bash curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool ``` ## Load Balancer Dashboard Web dashboard at `http://localhost:11435/dashboard` with eight tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. All load balancer data updates in real-time via Server-Sent Events. ## Load Balancer Operational Queries ### Recent load balancer failures with error details ```bash sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message, latency_ms/1000.0 as secs FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10" ``` ### Load balancer retry frequency by node ```bash sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, SUM(retry_count) as retries, COUNT(*) as total FROM request_traces GROUP BY node_id ORDER BY retries DESC" ``` ### Load balancer requests per hour ```bash sqlite3 ~/.fleet-manager/latency.db "SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour, COUNT(*) as requests FROM request_traces GROUP BY hour ORDER BY hour" ``` ### Test load balancer inference ```bash curl -s http://localhost:11435/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Test load balancing across nodes"}],"stream":false}' curl -s http://localhost:11435/api/chat \ -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Verify load balancer routing"}],"stream":false}' ``` ## Load Balancer Guardrails - Never restart or stop the load balancer or node agents without explicit user confirmation. - Never delete or modify files in `~/.fleet-manager/` (contains load balancer latency data, traces, and logs). - Do not pull or delete models on load balancer nodes without user confirmation — downloads can be 10-100+ GB. - If a load balancer node shows as offline, report it rather than attempting to SSH into the machine. - If all load balancer nodes are saturated, suggest the user check the dashboard. ## Load Balancer Failure Handling - Connection refused → load balancer may not be running, suggest `herd` or `uv run herd` - 0 nodes online → suggest starting `herd-node` on load balancer backend devices - mDNS discovery fails → use `--router-url http://router-ip:11435` - Load balancer requests hang → check for `num_ctx` in client requests; verify with `grep "Context protection" ~/.fleet-manager/logs/herd.jsonl` - Load balancer API errors → check `~/.fleet-manager/logs/herd.jsonl`

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 ollama-load-balancer-1776067996 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 ollama-load-balancer-1776067996 技能

通过命令行安装

skillhub install ollama-load-balancer-1776067996

下载 Zip 包

⬇ 下载 ollama-load-balancer v1.0.4

文件大小: 3.82 KB | 发布时间: 2026-4-14 14:08

v1.0.4 最新 2026-4-14 14:08
Cross-platform support: macOS, Linux, and Windows. Updated OS metadata, descriptions, and hardware recommendations.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部