MLX Local AI — LLM, Image Gen, STT, Embeddings Native on Apple Silicon

MLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 14 · 0 current installs · 0 all-time installs
byTwin Geeks@twinsgeeks
MIT-0
Security Scan
VirusTotalVirusTotal
Pending
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (run MLX-powered LLMs, image gen, STT, embeddings on Apple Silicon) matches the runtime instructions (start a local fleet router, install 'ollama-herd', run 'herd'/'herd-node', call localhost:11435 APIs). One minor inconsistency: the docs reference installing tools via an 'uv tool' CLI but 'uv' is not declared in the skill's required binaries; python3/pip are listed as optional but the examples assume pip is available. Otherwise the requested artifacts (local binaries, no external credentials) are proportional to the stated purpose.
Instruction Scope
SKILL.md stays focused on running a local fleet and calling a localhost API. However there are two contradictions to surface: (1) the text says 'No automatic downloads' and 'All requests stay local', but the instructions include user-run 'pip install ollama-herd' and 'uv tool install ...' which will download packages/models from external registries if executed; (2) the fleet 'finds the router automatically' implies some local network discovery/peer communication — users should confirm whether services bind only to localhost or to external interfaces. The instructions do not ask the agent to read unrelated files or exfiltrate secrets.
Install Mechanism
This is instruction-only (no install spec embedded in the skill). The runtime examples tell the user to run pip and 'uv tool' to install components; that is a user-driven install process rather than an automatic install performed by the skill bundle. Because nothing in the skill package itself is downloaded or written, the install mechanism risk from the registry package is low — but the user-facing instructions will cause external downloads when followed.
Credentials
The skill declares no required environment variables or credentials and the examples use localhost endpoints and a placeholder API key. No secrets or unrelated service tokens are requested. This is proportionate to a local-only ML fleet manager.
Persistence & Privilege
The skill is not always-on and does not declare elevated privileges. Metadata references config/log paths under ~/.fleet-manager, which is reasonable for a fleet manager. The skill does not request to modify other skills or system-wide agent settings.
Assessment
This skill appears to be what it claims: a macOS-local MLX fleet manager. Before you install or run the commands, check these points: (1) Verify the upstream code (PyPI 'ollama-herd' package and the GitHub repo) yourself — read the package source before pip installing. (2) Confirm how the 'herd' service binds network interfaces (localhost-only vs 0.0.0.0) and whether node discovery uses LAN broadcasts — if you only want local access keep the service bound to localhost and enable firewall rules. (3) Expect large external downloads when you run 'uv tool install' or fetch models; confirm the download prompts and origins. (4) The SKILL.md says 'No automatic downloads' but the examples show user-run installs that download code/models — treat that statement as meaning 'no silent background downloads' rather than 'no downloads ever'. (5) If you plan to run this on multi-device fleets, audit how authentication and discovery are handled to avoid exposing model endpoints to untrusted machines. If you want stronger assurance, ask the skill author for a signed release URL or inspect the pip package contents and startup scripts before running them.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
apple-mlxvk9795gz19b6khgky6d9cdyy8xx83yeejapple-siliconvk9795gz19b6khgky6d9cdyy8xx83yeejdiffusionkitvk9795gz19b6khgky6d9cdyy8xx83yeejlatestvk9795gz19b6khgky6d9cdyy8xx83yeejlocal-aivk9795gz19b6khgky6d9cdyy8xx83yeejmac-minivk9795gz19b6khgky6d9cdyy8xx83yeejmac-studiovk9795gz19b6khgky6d9cdyy8xx83yeejmetalvk9795gz19b6khgky6d9cdyy8xx83yeejmfluxvk9795gz19b6khgky6d9cdyy8xx83yeejmlxvk9795gz19b6khgky6d9cdyy8xx83yeejmlx-inferencevk9795gz19b6khgky6d9cdyy8xx83yeejollamavk9795gz19b6khgky6d9cdyy8xx83yeejon-devicevk9795gz19b6khgky6d9cdyy8xx83yeejunified-memoryvk9795gz19b6khgky6d9cdyy8xx83yeej

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

bolt Clawdis
OSmacOS
Any bincurl, wget

SKILL.md

MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

Everything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.

The MLX stack

CapabilityToolMLX usage
LLM inferenceOllamaMLX backend for model loading and inference on Apple Silicon
Image gen (Flux)mfluxPure MLX implementation of Flux diffusion models
Image gen (SD3)DiffusionKitMLX-native Stable Diffusion 3 and 3.5
Speech-to-textQwen3-ASRMLX-accelerated audio transcription
EmbeddingsOllamaMLX backend for embedding model inference

One router. One framework. Four modalities. All local.

Setup

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/
herd                       # start the router (port 11435)
herd-node                  # run on each device — finds the router automatically

# Install image generation backends
uv tool install mflux           # Flux models (~7s at 512px)
uv tool install diffusionkit    # Stable Diffusion 3/3.5

All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.

LLM inference via MLX

Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain MLX unified memory"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Image generation via MLX

Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.

# Flux via mflux (fastest)
curl -o flux.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "z-image-turbo", "prompt": "a neural network visualization", "width": 1024, "height": 1024}'

# Stable Diffusion 3 via DiffusionKit
curl -o sd3.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "sd3-medium", "prompt": "a circuit board landscape", "width": 1024, "height": 1024, "steps": 20}'

Speech-to-text via MLX

Qwen3-ASR transcribes audio using MLX acceleration.

curl http://localhost:11435/api/transcribe \
  -F "file=@meeting.wav" \
  -F "model=qwen3-asr"

Embeddings via MLX

Ollama embedding models run on the MLX backend.

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "Apple MLX framework for machine learning"}'

Why MLX matters for local AI

  • Unified memory — model weights, activations, and KV cache share one memory pool. No CPU-GPU transfer overhead.
  • Metal acceleration — MLX compiles to Metal shaders that run on Apple Silicon GPU cores (up to 80 on M3/M4 Ultra).
  • Lazy evaluation — MLX only computes what's needed, reducing memory pressure.
  • Dynamic shapes — no recompilation when input sizes change (unlike some CUDA frameworks).
  • Apple-maintained — MLX is developed by Apple's ML research team, optimized for every chip generation.

Fleet performance on Apple Silicon

ChipGPU CoresMemoryLLM Sweet SpotImage Gen
M188-16GB3-7B modelsSlow
M2 Pro1932GB14B modelsCapable
M3 Max40128GB70B modelsFast
M4 Ultra80256GB120B+ modelsVery fast

Monitor your MLX fleet

# Fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# Model recommendations based on your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

# Health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.

Full documentation

Contribute

Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:

  • Star on GitHub — helps others discover the project
  • Open an issue — bug reports, feature requests, questions
  • AI agents welcomeCLAUDE.md provides full architectural context. Fork, branch, PR.
  • 412 tests, async Python, runs in under 40 seconds. Hard to break things.

Guardrails

  • No automatic downloads — all model pulls require explicit user confirmation.
  • Model deletion requires explicit user confirmation.
  • All requests stay local — no data leaves your network.
  • Never delete or modify files in ~/.fleet-manager/.

Files

1 total
Select a file
Select a file to preview.

Comments

Loading comments…