Ollama Ollama Herd. Ollama本地推理. Ollama IA Local.

v1.2.0

Ollama Ollama Herd — multimodal Ollama model router that herds your Ollama LLMs into one smart Ollama endpoint. Route Ollama Llama, Qwen, DeepSeek, Phi, Mist...

⭐ 2· 31·0 current·0 all-time

byTwin Geeks@twinsgeeks·duplicate of @twinsgeeks/ollama-fleet-router

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

medium confidence

✓

Purpose & Capability

Name/description, declared binaries (curl|wget) and optional python/pip/sqlite3, and config paths under ~/.fleet-manager align with a local Python-based Ollama routing agent. Requesting network tools and Python/pip is proportional to installing and running a PyPI package that coordinates multiple nodes.

ℹ

Instruction Scope

SKILL.md is instruction-only and tells the user to pip install 'ollama-herd' and run 'herd' and 'herd-node'. That fits the router purpose. However the instructions imply the router will auto-pull models, open a local HTTP API (port 11435), and persist telemetry/logs (latency.db, herd.jsonl). Those behaviors are expected for a fleet router but expand scope to network I/O, large model downloads, and local storage of request data — consider privacy and bandwidth impact.

✓

Install Mechanism

No install spec embedded in the skill bundle; SKILL.md recommends 'pip install ollama-herd' (PyPI) and running installed CLI binaries. Installing from PyPI is expected for a Python tool; it is a moderate-risk install mechanism but appropriate for the stated purpose. There are no direct downloads from arbitrary URLs in the instructions.

ℹ

Credentials

The skill declares no required environment variables or credentials, which is consistent with the instructions. One noteworthy point: examples show 'api_key="not-needed"', suggesting the router may accept unauthenticated local requests — convenient but potentially risky if the local endpoint is exposed. Also the router will likely write/read ~/.fleet-manager files (latency.db, logs) which could contain request contents or metadata.

✓

Persistence & Privilege

The skill is not always-on and is user-invocable (defaults). It expects to install and run a daemon (herd/herd-node) and to create files under ~/.fleet-manager for logs/latency — this is normal for a service of this type and confined to its own config paths. No indications it modifies other skills or system-wide agent settings.

Assessment

This skill appears to do what it claims: it guides you to install a PyPI package that runs a local Ollama routing service. Before installing: 1) Inspect the GitHub repo and the PyPI package contents to confirm publisher identity and read the code (verify what 'herd-node' does). 2) Expect large model downloads if you enable auto-pull — check disk space and bandwidth. 3) The router exposes an HTTP API on localhost (11435) and may be unauthenticated by default; ensure it's not reachable from untrusted networks and consider enabling authentication or firewall rules. 4) Logs and a SQLite latency DB are stored under ~/.fleet-manager (may contain request metadata/content) — review retention and privacy. 5) If deploying across multiple machines, secure node-to-node communication (auth tokens / mTLS) and limit network access. If you cannot review the code, run the service in a sandboxed environment (container or isolated VM) and monitor network/file activity before trusting it with sensitive data.

Like a lobster shell, security has layers — review code before you run it.

apple-siliconvk97f2h653h0a5vd86vdt30psr9841n77deepseekvk97f2h653h0a5vd86vdt30psr9841n77embeddingsvk9720ebmwk5mj2yax80y1ggrks83wmzdfleet-managementvk9720ebmwk5mj2yax80y1ggrks83wmzdgemmavk977524x9bk9wr9cwjf4egc5e983wt83image-generationvk9720ebmwk5mj2yax80y1ggrks83wmzdinferencevk977524x9bk9wr9cwjf4egc5e983wt83latestvk97f2h653h0a5vd86vdt30psr9841n77llamavk97f2h653h0a5vd86vdt30psr9841n77llmvk977524x9bk9wr9cwjf4egc5e983wt83llm-routervk9720ebmwk5mj2yax80y1ggrks83wmzdload-balancervk977524x9bk9wr9cwjf4egc5e983wt83local-aivk97f2h653h0a5vd86vdt30psr9841n77mac-minivk97f2h653h0a5vd86vdt30psr9841n77mac-studiovk97f2h653h0a5vd86vdt30psr9841n77macbook-provk97fg1td5w1v051zr7m9wmdamn83xqjnmistralvk97f2h653h0a5vd86vdt30psr9841n77model-routervk97f2h653h0a5vd86vdt30psr9841n77multimodalvk977524x9bk9wr9cwjf4egc5e983wt83multimodal-routervk97f2h653h0a5vd86vdt30psr9841n77ollamavk97f2h653h0a5vd86vdt30psr9841n77ollama-herdvk97f2h653h0a5vd86vdt30psr9841n77phivk97f2h653h0a5vd86vdt30psr9841n77qwenvk97f2h653h0a5vd86vdt30psr9841n77routingvk977524x9bk9wr9cwjf4egc5e983wt83self-hostedvk97f2h653h0a5vd86vdt30psr9841n77speech-to-textvk9720ebmwk5mj2yax80y1ggrks83wmzd

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

llama Clawdis

OSmacOS · Linux

Any bincurl, wget

SKILL.md

Ollama — Herd Your Ollama LLMs Into One Endpoint

You have Ollama running on multiple machines. This skill gives you one Ollama endpoint that routes every Ollama request to the best available device automatically. No more hardcoding Ollama IPs, no more manual Ollama load balancing, no more "which Ollama machine has that model loaded?"

Setup Ollama Herd

pip install ollama-herd          # install the Ollama router
herd                             # start the Ollama router on port 11435
herd-node                        # run on each machine with Ollama installed

Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same Ollama models, smarter Ollama routing.

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use your Ollama models through the fleet

OpenAI SDK (drop-in Ollama routing)

# ollama_openai_client — route Ollama requests via OpenAI SDK
from openai import OpenAI

ollama_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
ollama_response = ollama_client.chat.completions.create(
    model="llama3.3:70b",  # any Ollama model
    messages=[{"role": "user", "content": "Hello from Ollama"}],
    stream=True,
)
for chunk in ollama_response:
    print(chunk.choices[0].delta.content or "", end="")

Ollama API (same as before, different port)

# Ollama chat — routed through the Ollama fleet
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3:235b",
  "messages": [{"role": "user", "content": "Hello via Ollama Herd"}],
  "stream": false
}'

# List all Ollama models across all machines
curl http://localhost:11435/api/tags

# Ollama models currently in GPU memory
curl http://localhost:11435/api/ps

# Ollama embeddings
curl http://localhost:11435/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Ollama embedding search query"
}'

What the Ollama router does

When an Ollama request comes in, the Ollama router scores every online Ollama node on 7 signals:

Ollama Thermal — is the Ollama model already loaded in GPU memory? (+50 for hot)
Ollama Memory fit — how much headroom does the Ollama node have?
Ollama Queue depth — how many Ollama requests are waiting?
Ollama Wait time — estimated latency based on Ollama history
Ollama Role affinity — large Ollama models prefer big machines
Ollama Availability — is the Ollama node reliably available?
Ollama Context fit — does the loaded Ollama context window fit the request?

The highest-scoring Ollama node handles the request. If it fails, the Ollama router retries on the next best node automatically.

Supported Ollama models

Any model that runs on Ollama works through the Ollama fleet. Popular Ollama models:

Ollama Model	Sizes	Best for
`llama3.3`	8B, 70B	General purpose Ollama inference
`qwen3`	0.6B–235B	Multilingual Ollama reasoning
`qwen3.5`	0.8B–397B	Latest generation Ollama model
`deepseek-v3`	671B (37B active)	Ollama GPT-4o alternative
`deepseek-r1`	1.5B–671B	Ollama reasoning (like o3)
`phi4`	14B	Small, fast Ollama model
`mistral`	7B	Fast Ollama European languages
`gemma3`	1B–27B	Google's open Ollama model
`codestral`	22B	Ollama code generation
`qwen3-coder`	30B (3.3B active)	Agentic Ollama coding
`nomic-embed-text`	137M	Ollama embeddings for RAG

Ollama Resilience features

Ollama Auto-retry — re-routes to next best Ollama node on failure (before first chunk)
Ollama VRAM-aware fallback — routes to a loaded Ollama model in the same category instead of cold-loading
Ollama Context protection — prevents num_ctx from triggering expensive Ollama model reloads
Ollama Zombie reaper — cleans up stuck in-flight Ollama requests
Ollama Auto-pull — downloads missing Ollama models to the best node automatically

Also available via Ollama Herd

The same Ollama fleet router handles three more workloads:

Ollama Image generation

curl -o image.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model":"z-image-turbo","prompt":"a sunset via Ollama Herd","width":1024,"height":1024,"steps":4}'

Ollama Speech-to-text

curl http://localhost:11435/api/transcribe -F "audio=@recording.wav"

Ollama Embeddings

curl http://localhost:11435/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Ollama embedding text"}'

Ollama Dashboard

http://localhost:11435/dashboard — 8 tabs: Ollama Fleet Overview, Trends, Ollama Model Insights, Apps, Benchmarks, Ollama Health, Recommendations, Settings. Real-time Ollama queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.

Ollama Request tagging

Track per-project Ollama usage:

ollama_response = ollama_client.chat.completions.create(
    model="llama3.3:70b",  # Ollama model
    messages=messages,
    extra_body={"metadata": {"tags": ["my-ollama-project", "reasoning"]}},
)

Full Ollama documentation

Ollama Agent Setup Guide

Ollama Guardrails

Never restart the Ollama router or Ollama node agents without user confirmation.
Never delete or modify files in ~/.fleet-manager/ (Ollama data).
Never pull or delete Ollama models without user confirmation.

Files

1 total

Select a file

Select a file to preview.

Comments

Loading comments…