Configuration

The Buttress server reads a single TOML file passed via --config. Every section is optional; omit it to use defaults.

Minimal example

[server]
port = 2080

[[generators]]
type = "ggml-llm"
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800

Top-level sections

Section	Purpose
`[env]`	Environment variables exported into the process only if not already set
`[server]`	HTTP/WebSocket listener (port, log level, body limits)
`[runtime]`	Global defaults shared by every generator
`[runtime.session_cache]`	KV-cache reuse store for `ggml-llm`
`[autodiscover]`	LAN UDP / HTTP discovery toggles
`[openai_compat]`	Enable OpenAI-compatible HTTP routes
`[anthropic_messages]`	Enable Anthropic-compatible HTTP routes
`[[generators]]`	Array of generator instances — one entry per loaded model

`[server]`

Key	Type	Default	Description
`id`	string	`buttress-<machineId>`	Stable server id used for binding and discovery
`name`	string	`Buttress Server (<short id>)`	Friendly name shown in BRICKS Controller
`port`	number	`2080`	HTTP/WebSocket port (overridden by `--port`)
`log_level`	string	unset	One of `debug`, `info`, `warn`, `error`
`max_body_size`	number or string	`"50MB"`	Max upload size; accepts `"100MB"`, `"1GB"`, or raw bytes
`session_timeout`	number or string	`60000`	WebSocket idle timeout in ms; also accepts `"1m"`, `"30s"`
`temp_file_dir`	string	`$TMPDIR/.buttress`	Directory for STT audio uploads and other temp files

`[runtime]`

Global defaults shared by every generator. Per-generator values under [generators.model] win; otherwise these defaults apply.

[runtime]
cache_dir = "~/.buttress/models"
huggingface_token = "hf_..."
n_gpu_layers = "auto"

Key	Type	Description
`cache_dir`	string	Model and metadata cache root (default `~/.buttress/models`)
`huggingface_token`	string	Hugging Face auth token; falls back to `$HUGGINGFACE_TOKEN`. Applied to all backends regardless of variable name
`http_headers`	table	Extra headers attached to Hugging Face / HTTP downloads
`context_release_delay_ms`	number	Idle time before unloading a context (default `10000`; `0` = immediate)
`prefer_variants`	string[]	Override backend variant probe order (ggml backends)
`n_threads`	number	CPU thread count
`n_ctx`	number	Context window (per-model value wins; auto-capped at training context)
`n_gpu_layers`	number or `"auto"`	Layers offloaded to GPU (default `"auto"`)
`n_batch`	number	Prompt batch size. Note: the model layer defaults `n_batch` to `512`, which shadows the runtime value unless `n_batch` is set explicitly under `[generators.model]`
`n_ubatch`	number	Prompt micro-batch size
`n_parallel`	number	Parallel sequences (default `4`)
`n_cpu_moe`	number	MoE expert layers offloaded to CPU
`flash_attn_type`	string	`"on"`, `"off"`, or `"auto"`. Default is GPU-conditional: `"auto"` when a GPU backend is selected, `"off"` on CPU
`cache_type_k`, `cache_type_v`	string	KV-cache dtype (`f16`, `f32`, `q8_0`, `q4_0`, …)
`kv_unified`	boolean	Use a unified KV cache across sequences
`swa_full`	boolean	Materialize full attention even for sliding-window layers
`ctx_shift`	boolean	Allow llama.cpp’s rolling context shift
`use_mmap`, `use_mlock`	boolean	Memory-mapping / locking
`no_extra_bufts`	boolean	Disable extra compute buffer types
`cpu_mask`, `cpu_strict`	string / boolean	CPU affinity (advanced)
`devices`	string[]	Restrict to specific GGML devices
Speculative keys	various	`speculative`, `spec_type`, `spec_draft_n_max`, `spec_draft_n_min`, `spec_draft_p_min`, `spec_draft_p_split`

`[runtime.session_cache]`

For ggml-llm generators, the server can persist KV cache state between requests so that a follow-up completion sharing a prompt prefix skips prompt processing.

[runtime.session_cache]
enabled = true
max_size_bytes = "10GB"
max_entries = 1000

Key	Default	Description
`enabled`	`true`	Enable persistent KV cache
`max_size_bytes`	`"10GB"`	Total disk budget; accepts `"500MB"`, `"50GB"`, or a number
`max_entries`	`1000`	Max number of cached states (LRU eviction)

Cache files are stored under {cache_dir}/.session-state-cache/. mlx-llm keeps a separate session cache under {cache_dir}/mlx-session-cache/, configured independently per generator.

`[[generators]]`

Each [[generators]] block declares one model the server can host. Repeat the block to host multiple. Every block has a type, an optional [generators.backend] table, and a [generators.model] table.

[[generators]]
type = "ggml-llm"

[generators.backend]
# backend selection and resource planning

[generators.model]
repo_id = "..."
# model identity and runtime overrides

Common `[generators.model]` keys

Shared by all generator types (ggml-llm, ggml-stt, mlx-llm):

Key	Type	Description
`repo_id` (required)	string	Hugging Face repo (`org/repo`)
`revision`	string	Default `"main"`
`download`	boolean	Pre-download at server startup (default `false`)

Honored by ggml-llm and ggml-stt only (mlx-llm derives quantization from the repo itself and ignores these):

Key	Type	Description
`filename`	string	Pin a specific artifact in the repo
`url`	string	Direct download URL (skips manifest lookup)
`quantization`	string	Preferred quant tag — e.g. `q4_0`, `q8_0`, `mxfp4`
`preferred_quantizations`	string[]	Ordered fallback list when `quantization` doesn’t match (alias: `quantizations`)
`allow_local_file`	boolean	Required to use `local_path` or `mmproj_local_path`
`local_path`	string	Use a local file as the load path. Repo metadata is still resolved from Hugging Face, so `repo_id` is still required
`api_base`, `base_url`	string	Override Hugging Face API / blob hosts (mirrors or proxies)

`ggml-llm` (llama.cpp / GGUF)

[[generators]]
type = "ggml-llm"

[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
gpu_memory_fraction = 0.95

[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
download = true

[generators.backend] only controls backend selection and resource planning. Runtime overrides (n_ctx, n_gpu_layers, flash_attn_type, etc.) go under [generators.model]. [generators.backend]

Key	Type	Default	Description
`variant`	string	auto	Force `cuda`, `vulkan`, `snapdragon`, or `default`
`variant_preference`	string[]	`["cuda", "vulkan", "snapdragon", "default"]`	Probe order when `variant` is unset
`gpu_memory_fraction`	number	`0.85`	Max GPU fraction the hardware guardrails may plan against
`cpu_memory_fraction`	number	`0.5`	Max RAM fraction for CPU-side buffers

[generators.model] — in addition to the common ggml keys above, every [runtime] key can be overridden per-generator: n_ctx, n_gpu_layers, n_batch, n_ubatch, n_threads, n_parallel, n_cpu_moe, flash_attn_type, cache_type_k, cache_type_v, kv_unified, swa_full, ctx_shift, use_mmap, use_mlock, no_extra_bufts, cpu_mask, cpu_strict, devices. Multimodal (mtmd) — auto-downloads the matching mmproj-*.gguf from the same repo:

Key	Type	Description
`enable_mtmd`	boolean	Default `false`
`mmproj_filename`	string	Pin a specific projector file
`mmproj_url`	string	Direct URL override
`mmproj_local_path`	string	Local projector file (requires `allow_local_file = true`)
`mmproj_use_gpu`	boolean	Unset = auto (true when `n_gpu_layers > 0`)
`mmproj_image_min_tokens`	number	Min visual tokens (dynamic-resolution models; `-1` = unset)
`mmproj_image_max_tokens`	number	Max visual tokens (`-1` = unset)

Speculative decoding

Key	Type	Description
`speculative`	string	Draft model identifier
`spec_type`	string	Strategy (backend-defined)
`spec_draft_n_max`	int	Max drafted tokens per step
`spec_draft_n_min`	int	Min drafted tokens
`spec_draft_p_min`	number	Min acceptance probability
`spec_draft_p_split`	number	Split threshold

`ggml-stt` (whisper.cpp)

[[generators]]
type = "ggml-stt"

[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]

[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-large-v3-turbo-q8_0.bin"
use_gpu = true
use_flash_attn = "on"
download = true

[generators.backend]

Key	Type	Default	Description
`variant`	string	auto	Force `cuda`, `vulkan`, or `default`
`variant_preference`	string[]	`["cuda", "vulkan", "default"]`	Probe order
`gpu_memory_fraction`	number	`0.85`
`cpu_memory_fraction`	number	`0.5`

[generators.model] — in addition to the common ggml keys above:

Key	Type	Default	Description
`repo_id`	string	`"BricksDisplay/whisper-ggml"`	Defaulted (unlike `ggml-llm`)
`preferred_quantizations`	string[]	`["q8_0", <no-quant>, "q5_1"]`	Default fallback chain
`use_gpu`	boolean	`true`	Set to `false` to force CPU even when a GPU is available
`use_flash_attn`	string or boolean	`"auto"`	`"on"`, `"off"`, or `"auto"`. `true` / `false` are accepted as shortcuts. `"auto"` enables flash-attn when a GPU is in use

Runtime extras — under [runtime] for ggml-stt only:

Key	Type	Description
`max_threads`	number	Caps the whisper.cpp thread count

`mlx-llm` (Apple Silicon)

[[generators]]
type = "mlx-llm"

[generators.model]
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
vlm = true
download = true

There is no [generators.backend] section for mlx-llm. On first use, the backend creates a Python virtualenv at {cache_dir}/mlx-env and installs mlx_lm_package, mlx_vlm_package, plus torch and torchvision (required by some VLM processors). If an existing venv already has mlx_vlm and torch importable, the install step is skipped. [generators.model] — common repo_id / revision / download plus:

Key	Type	Default	Description
`adapter_path`	string	—	Local LoRA adapter directory
`vlm`	`"auto"` or boolean	`"auto"`	Force VLM (`true`) vs text-only (`false`); `"auto"` infers from the repo
`tokenizer_config`	table	—	Forwarded to `mlx_lm.load(..., tokenizer_config=...)`
`model_config`	table	—	Forwarded to `mlx_lm.load(..., model_config=...)`

quantization, filename, and preferred_quantizations are not used — the MLX repo itself determines the quantization. Runtime extras — under [runtime] for mlx-llm:

Key	Type	Default	Description
`mlx_env_dir`	string	`{cache_dir}/mlx-env`	Location of the auto-managed Python venv
`mlx_lm_package`	string	`"mlx-lm==0.31.1"`	pip spec used when provisioning the venv
`mlx_vlm_package`	string	`"mlx-vlm==0.4.0"`	pip spec used when provisioning the venv

`[autodiscover]`

The server announces itself on UDP 8089 so Foundation devices on the same LAN can find it. Auto-discovery is on by default.

[autodiscover]
[autodiscover.udp]
port = 8089

[autodiscover.udp.announcements]
enabled = true
interval = 5000

[autodiscover.udp.requests]
enabled = true
responseDelay = 100

[autodiscover.http]
enabled = true
path = "/buttress/info"
cors = true

Set autodiscover = false to disable discovery entirely. See the autodiscovery reference for protocol details.

`[env]`

Environment variables applied at startup, but only if they are not already set in the system environment. System variables and command-line exports take precedence.

[env]
HUGGINGFACE_TOKEN = "hf_..."
CUDA_VISIBLE_DEVICES = "0"

The ggml backends read HUGGINGFACE_TOKEN (not HF_TOKEN). For a single token that applies to every backend regardless of variable name, set [runtime] huggingface_token instead.

Compatibility endpoints

These endpoints are experimental. The schemas, error shapes, and CORS defaults may change.

The server can expose OpenAI- and Anthropic-compatible HTTP routes alongside the native WebSocket RPC. Each is opt-in.

[openai_compat]
enabled = true
# cors_allowed_origins = "*"

[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]

Endpoint	Config flag
`POST /oai-compat/v1/chat/completions`	`[openai_compat] enabled = true`
`GET /oai-compat/v1/models`	`[openai_compat] enabled = true`
`POST /anthropic-messages/v1/messages`	`[anthropic_messages] enabled = true`
`POST /anthropic-messages/v1/messages/count_tokens`	`[anthropic_messages] enabled = true`

You can also enable each endpoint via env var: ENABLE_OPENAI_COMPAT_ENDPOINT=1 or ENABLE_ANTHROPIC_MESSAGES_ENDPOINT=1.

Minimal example

Top-level sections

`[server]`

`[runtime]`

`[runtime.session_cache]`

`[[generators]]`

Common `[generators.model]` keys

`ggml-llm` (llama.cpp / GGUF)

`ggml-stt` (whisper.cpp)

`mlx-llm` (Apple Silicon)

`[autodiscover]`

`[env]`

Compatibility endpoints

Next steps

Workspace binding

LAN auto-discovery

​Minimal example

​Top-level sections

​[server]

​[runtime]

​[runtime.session_cache]

​[[generators]]

​Common [generators.model] keys

​ggml-llm (llama.cpp / GGUF)

​ggml-stt (whisper.cpp)

​mlx-llm (Apple Silicon)

​[autodiscover]

​[env]

​Compatibility endpoints

​Next steps

Workspace binding

LAN auto-discovery

Minimal example

Top-level sections

`[server]`

`[runtime]`

`[runtime.session_cache]`

`[[generators]]`

Common `[generators.model]` keys

`ggml-llm` (llama.cpp / GGUF)

`ggml-stt` (whisper.cpp)

`mlx-llm` (Apple Silicon)

`[autodiscover]`

`[env]`

Compatibility endpoints

Next steps