Install
openclaw skills install mlx-apple-silicon-mlxMLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux uses MLX for Flux image generation, DiffusionKit uses MLX for Stable Diffusion 3, and Qwen3-ASR uses MLX for transcription. One fleet router coordinates all four across Mac Studio, Mac Mini, MacBook Pro.
openclaw skills install mlx-apple-silicon-mlxEverything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.
| Capability | Tool | MLX usage |
|---|---|---|
| LLM inference | Ollama | MLX backend for model loading and inference on Apple Silicon |
| Image gen (Flux) | mflux | Pure MLX implementation of Flux diffusion models |
| Image gen (SD3) | DiffusionKit | MLX-native Stable Diffusion 3 and 3.5 |
| Speech-to-text | Qwen3-ASR | MLX-accelerated audio transcription |
| Embeddings | Ollama | MLX backend for embedding model inference |
One router. One framework. Four modalities. All local.
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # start the router (port 11435)
herd-node # run on each device — finds the router automatically
# Install image generation backends
uv tool install mflux # Flux models (~7s at 512px)
uv tool install diffusionkit # Stable Diffusion 3/3.5
All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.
Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Explain MLX unified memory"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.
# Flux via mflux (fastest)
curl -o flux.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model": "z-image-turbo", "prompt": "a neural network visualization", "width": 1024, "height": 1024}'
# Stable Diffusion 3 via DiffusionKit
curl -o sd3.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model": "sd3-medium", "prompt": "a circuit board landscape", "width": 1024, "height": 1024, "steps": 20}'
Qwen3-ASR transcribes audio using MLX acceleration.
curl http://localhost:11435/api/transcribe \
-F "file=@meeting.wav" \
-F "model=qwen3-asr"
Ollama embedding models run on the MLX backend.
curl http://localhost:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "Apple MLX framework for machine learning"}'
| Chip | GPU Cores | Memory | LLM Sweet Spot | Image Gen |
|---|---|---|---|---|
| M1 | 8 | 8-16GB | 3-7B models | Slow |
| M2 Pro | 19 | 32GB | 14B models | Capable |
| M3 Max | 40 | 128GB | 70B models | Fast |
| M4 Ultra | 80 | 256GB | 120B+ models | Very fast |
# Fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# Model recommendations based on your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
# Health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.
Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:
CLAUDE.md provides full architectural context. Fork, branch, PR.~/.fleet-manager/.