Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bricks.tools/llms.txt

Use this file to discover all available pages before exploring further.

The Buttress server reads a single TOML file passed via --config. Every section is optional; omit it to use defaults.

Minimal example

[server]
port = 2080

[[generators]]
type = "ggml-llm"
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800

[server]

KeyTypeDefaultDescription
portnumber2080HTTP/WebSocket port
log_levelstring"info"One of debug, info, warn, error
idstringbuttress-<machineId>Stable server id used for binding and discovery
namestringauto-generatedFriendly name shown in BRICKS Controller
max_body_sizenumber or string52428800 (50 MB)Max upload size; accepts "50MB", "1GB", etc.
session_timeoutnumber or string60000 (1 min)WebSocket idle timeout; accepts "1m", "30s"
temp_file_dirstring<os-tmpdir>/.buttressDirectory for STT audio uploads and other temp files

[runtime]

Where the server stores downloaded models.
[runtime]
cache_dir = "~/.buttress/models"
huggingface_token = ""
KeyDefaultDescription
cache_dir~/.buttress/modelsWhere downloaded model files live
huggingface_token""Hugging Face auth token; falls back to HF_TOKEN env var

[runtime.session_cache]

For ggml-llm generators, the server can persist KV cache state between requests so that a follow-up completion sharing a prompt prefix skips prompt processing.
[runtime.session_cache]
enabled = true
max_size_bytes = "10GB"
max_entries = 1000
KeyDefaultDescription
enabledtrueEnable persistent KV cache
max_size_bytes"10GB"Total disk budget; accepts "500MB", "50GB", or a number
max_entries1000Max number of cached states (LRU eviction)
Cache files are stored under {cache_dir}/.session-state-cache/.

[[generators]]

Each [[generators]] block declares one model the server can host. Repeat the block to host multiple.

LLM (llama.cpp / GGML)

[[generators]]
type = "ggml-llm"

[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]

[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800

LLM (MLX, Apple Silicon only)

[[generators]]
type = "mlx-llm"

[generators.model]
repo_id = "mlx-community/Llama-3.2-3B-Instruct-4bit"
n_ctx = 8192

Speech-to-text (Whisper / GGML)

[[generators]]
type = "ggml-stt"

[generators.backend]
variant_preference = ["coreml", "default"]

[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-small.bin"
KeyDescription
typeOne of ggml-llm, mlx-llm, ggml-stt
backend.variant_preferenceOrdered list of backend variants. LLM accepts cuda, vulkan, snapdragon, default. STT accepts coreml, default
model.repo_idHugging Face repo id
model.filenameSpecific file inside the repo (STT only)
model.quantizationQuantization tag matching the repo (LLM only)
model.n_ctxContext window length in tokens (LLM only)

[autodiscover]

The server announces itself on UDP 8089 so Foundation devices on the same LAN can find it. Auto-discovery is on by default.
[autodiscover]
[autodiscover.udp]
port = 8089

[autodiscover.udp.announcements]
enabled = true
interval = 5000

[autodiscover.udp.requests]
enabled = true
responseDelay = 100

[autodiscover.http]
enabled = true
path = "/buttress/info"
cors = true
Set [autodiscover] = false to disable discovery entirely. See the autodiscovery reference for protocol details.

[env]

Environment variables applied at startup, but only if they are not already set in the system environment. System variables and command-line exports take precedence.
[env]
HF_TOKEN = "hf_..."
CUDA_VISIBLE_DEVICES = "0"

Compatibility endpoints

These endpoints are experimental. The schemas, error shapes, and CORS defaults may change.
The server can expose OpenAI- and Anthropic-compatible HTTP routes alongside the native WebSocket RPC. Each is opt-in.
[openai_compat]
enabled = true
# cors_allowed_origins = "*"

[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]
EndpointConfig flag
POST /oai-compat/v1/chat/completions[openai_compat] enabled = true
GET /oai-compat/v1/models[openai_compat] enabled = true
POST /anthropic-messages/v1/messages[anthropic_messages] enabled = true
POST /anthropic-messages/v1/messages/count_tokens[anthropic_messages] enabled = true
You can also enable each endpoint via env var: ENABLE_OPENAI_COMPAT_ENDPOINT=1 or ENABLE_ANTHROPIC_MESSAGES_ENDPOINT=1.

Next steps

Workspace binding

Pair the server with a BRICKS workspace and enable auth.

LAN auto-discovery

How Foundation devices find your server on the LAN.