MLX Local AI — LLM, Image Gen, STT, Embeddings Native on Apple Silicon
MLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux...
Like a lobster shell, security has layers — review code before you run it.
License
Runtime requirements
SKILL.md
MLX Local AI — Apple's ML Framework Powers Your Entire Fleet
Everything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.
The MLX stack
| Capability | Tool | MLX usage |
|---|---|---|
| LLM inference | Ollama | MLX backend for model loading and inference on Apple Silicon |
| Image gen (Flux) | mflux | Pure MLX implementation of Flux diffusion models |
| Image gen (SD3) | DiffusionKit | MLX-native Stable Diffusion 3 and 3.5 |
| Speech-to-text | Qwen3-ASR | MLX-accelerated audio transcription |
| Embeddings | Ollama | MLX backend for embedding model inference |
One router. One framework. Four modalities. All local.
Setup
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # start the router (port 11435)
herd-node # run on each device — finds the router automatically
# Install image generation backends
uv tool install mflux # Flux models (~7s at 512px)
uv tool install diffusionkit # Stable Diffusion 3/3.5
All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.
LLM inference via MLX
Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Explain MLX unified memory"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
Image generation via MLX
Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.
# Flux via mflux (fastest)
curl -o flux.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model": "z-image-turbo", "prompt": "a neural network visualization", "width": 1024, "height": 1024}'
# Stable Diffusion 3 via DiffusionKit
curl -o sd3.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model": "sd3-medium", "prompt": "a circuit board landscape", "width": 1024, "height": 1024, "steps": 20}'
Speech-to-text via MLX
Qwen3-ASR transcribes audio using MLX acceleration.
curl http://localhost:11435/api/transcribe \
-F "file=@meeting.wav" \
-F "model=qwen3-asr"
Embeddings via MLX
Ollama embedding models run on the MLX backend.
curl http://localhost:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "Apple MLX framework for machine learning"}'
Why MLX matters for local AI
- Unified memory — model weights, activations, and KV cache share one memory pool. No CPU-GPU transfer overhead.
- Metal acceleration — MLX compiles to Metal shaders that run on Apple Silicon GPU cores (up to 80 on M3/M4 Ultra).
- Lazy evaluation — MLX only computes what's needed, reducing memory pressure.
- Dynamic shapes — no recompilation when input sizes change (unlike some CUDA frameworks).
- Apple-maintained — MLX is developed by Apple's ML research team, optimized for every chip generation.
Fleet performance on Apple Silicon
| Chip | GPU Cores | Memory | LLM Sweet Spot | Image Gen |
|---|---|---|---|---|
| M1 | 8 | 8-16GB | 3-7B models | Slow |
| M2 Pro | 19 | 32GB | 14B models | Capable |
| M3 Max | 40 | 128GB | 70B models | Fast |
| M4 Ultra | 80 | 256GB | 120B+ models | Very fast |
Monitor your MLX fleet
# Fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# Model recommendations based on your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
# Health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.
Full documentation
- Agent Setup Guide — all 4 model types
- Image Generation Guide — 3 backends
- API Reference
Contribute
Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:
- Star on GitHub — helps others discover the project
- Open an issue — bug reports, feature requests, questions
- AI agents welcome —
CLAUDE.mdprovides full architectural context. Fork, branch, PR. - 412 tests, async Python, runs in under 40 seconds. Hard to break things.
Guardrails
- No automatic downloads — all model pulls require explicit user confirmation.
- Model deletion requires explicit user confirmation.
- All requests stay local — no data leaves your network.
- Never delete or modify files in
~/.fleet-manager/.
Files
1 totalComments
Loading comments…
