Relay Backends
Use any device running an OpenAI-compatible API as an inference backend for your mycellm node. The relay device provides the compute — mycellm provides the routing, credit accounting, and network presence.
How it works
Section titled “How it works”iPad / Phone / GPU box Your mycellm node┌──────────────────────┐ ┌────────────────────┐│ Ollama / LM Studio / │ ← HTTP → │ mycellm serve ││ PocketPal / vLLM │ │ --relay device:80 ││ :8080/v1/models │ │ announces models │└──────────────────────┘ └────────────────────┘ │ QUIC to network- The relay device runs any app that exposes
/v1/modelsand/v1/chat/completions - mycellm discovers models from the relay’s
/v1/modelsendpoint - Models are announced to the network as
relay:<model-name> - Inference requests are proxied transparently to the relay device
- Credits accrue to your node (you contributed the compute)
Via CLI flag
Section titled “Via CLI flag”mycellm serve --relay http://ipad.lan:8080Multiple relays:
mycellm serve --relay http://ipad.lan:8080 --relay http://ollama.lan:11434Via environment variable
Section titled “Via environment variable”MYCELLM_RELAY_BACKENDS=http://ipad.lan:8080,http://ollama.lan:11434Via dashboard
Section titled “Via dashboard”Open the dashboard → Models tab → Relay Device tab → paste the device URL and click Add Relay.
Connected relays show online/offline status and their discovered models.
Via API
Section titled “Via API”curl -X POST http://localhost:8420/v1/node/relay/add \ -H "Content-Type: application/json" \ -d '{"url": "http://ipad.lan:8080", "name": "iPad Pro"}'Via chat REPL
Section titled “Via chat REPL”/relay add http://ipad.lan:8080/relay # list all relays/relay refresh # re-discover models/relay remove http://ipad.lan:8080Compatible apps
Section titled “Compatible apps”Any app that exposes an OpenAI-compatible API works as a relay:
| App | Platform | Notes |
|---|---|---|
| Ollama | macOS, Linux, Windows | Default port 11434. Batches requests. |
| LM Studio | macOS, Linux, Windows | Enable “Local Server” in sidebar |
| llama.cpp server | Any | llama-server --port 8080 |
| vLLM | Linux (CUDA) | High-throughput, continuous batching |
| LocalAI | Any | Drop-in OpenAI replacement |
iPad / iPhone as a relay
Section titled “iPad / iPhone as a relay”Apple Silicon devices (M1–M4) are excellent inference backends. You need an app that runs a local LLM and exposes an OpenAI-compatible API server.
Currently the best option for iOS/iPadOS is running Ollama via a Mac on the same network, then pointing the relay at that Mac. Native iOS apps with API server support are still emerging — check the App Store for new options.
For Mac devices (MacBook, Mac Mini, Mac Studio):
- Install Ollama or LM Studio
- Pull a model:
ollama pull llama3.2:3b - Ollama serves on port 11434 by default
- Add as relay:
mycellm serve --relay http://<mac-ip>:11434
The M4 with 16GB RAM can run 8B models at ~30 tok/s via Metal.
API reference
Section titled “API reference”GET /v1/node/relay
Section titled “GET /v1/node/relay”List all relay backends and their status.
{ "relays": [ { "url": "http://ipad.lan:8080", "name": "ipad", "online": true, "models": ["llama3.2:3b", "phi-4-mini"], "model_count": 2 } ]}POST /v1/node/relay/add
Section titled “POST /v1/node/relay/add”{"url": "http://ipad.lan:8080", "name": "iPad Pro", "max_concurrent": 2}max_concurrent controls how many simultaneous requests mycellm sends to this device (default: 32). Set lower for constrained devices like iPads (2), higher for beefy GPU servers.
POST /v1/node/relay/remove
Section titled “POST /v1/node/relay/remove”{"url": "http://ipad.lan:8080"}POST /v1/node/relay/refresh
Section titled “POST /v1/node/relay/refresh”Re-discover models from all relay backends. Returns count of new models found.
How relay models appear on the network
Section titled “How relay models appear on the network”Relay models are prefixed with relay: to distinguish them from locally-loaded models:
GET /v1/models
{ "data": [ {"id": "Qwen2.5-3B-Q8_0", "owned_by": "local"}, {"id": "relay:llama3.2:3b", "owned_by": "relay:ipad"}, {"id": "relay:phi-4-mini", "owned_by": "relay:ipad"} ]}To the rest of the network, these models are indistinguishable from locally-loaded models. Peers route inference requests to your node, and your node proxies them to the relay device.
Concurrency
Section titled “Concurrency”Each model source has different concurrency characteristics:
| Source | Concurrent requests | Why |
|---|---|---|
| Local GGUF (llama.cpp) | 1 per model | C library context is not thread-safe |
| API Provider | 32 per model (default) | Cloud server handles backpressure |
| Device Relay | 32 per model (default) | Remote device handles backpressure |
A node with 2 local models can serve 2 concurrent users — one per model. Adding relay or API provider models adds more concurrent capacity without the hardware constraint.
Tune per device with max_concurrent:
# iPad relay — limited device, keep lowcurl -X POST localhost:8420/v1/node/relay/add \ -d '{"url": "http://ipad:8080", "max_concurrent": 2}'
# Cloud API — high throughputcurl -X POST localhost:8420/v1/node/models/load \ -d '{"name": "gpt-4o", "backend": "openai", "api_base": "...", "max_concurrent": 64}'Automatic health checking
Section titled “Automatic health checking”mycellm polls relay backends every 60 seconds to detect:
- New models added to the relay device
- Models removed from the relay device
- Relay device going offline/coming back online
If a relay goes offline, its models are marked unavailable and requests route elsewhere on the network.