Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bricks.tools/llms.txt

Use this file to discover all available pages before exploring further.

The Buttress server runs from npm and exposes a single executable, bricks-buttress. It works on macOS, Linux, and Windows; for GPU acceleration on Linux, install CUDA or Vulkan drivers before starting the server.

Hardware

ResourceRecommended
GPUNVIDIA (CUDA), AMD/Intel (Vulkan), or Apple Silicon (Metal)
RAMAt least 2× the size of the largest model you plan to load
DiskEnough free space in cache_dir to hold every model you download
NetworkWired LAN — UDP broadcasts must reach Foundation devices
The server runs without a GPU but the throughput drops sharply, and capability scoring will mark the host as a less-preferred backend.

Install from npm

Requires Node.js 22+ (or Bun).
npm install -g @fugood/buttress-server
This installs the bricks-buttress binary on your PATH.
On a fresh Apple Silicon Mac, install with Bun (bun add -g @fugood/buttress-server) for faster cold starts and lower memory overhead.

Run the server

Without a config, the server starts on port 2080 with sensible defaults:
bricks-buttress
With a TOML config:
bricks-buttress --config ./config.toml
Pass an inline TOML string instead of a path with the same flag:
bricks-buttress --config '[server]
port = 3000

[[generators]]
type = "ggml-llm"
[generators.model]
repo_id = "ggml-org/gemma-3-270m-qat-GGUF"'
See the configuration reference for the full schema.

CLI flags

FlagDescription
-p, --port <port>Port to listen on (default: 2080)
-c, --config <path|toml>Path to a TOML file or an inline TOML string
-v, --versionPrint the server version
-h, --helpShow help
The port resolves in this order: --port flag → [server] port in TOML → default 2080.

Environment variables

VariableEffect
NODE_ENVSet to development for verbose logs
ENABLE_OPENAI_COMPAT_ENDPOINTSet to 1 to enable the OpenAI-compat endpoint
ENABLE_ANTHROPIC_MESSAGES_ENDPOINTSet to 1 to enable the Anthropic messages endpoint
HF_TOKENHugging Face token for downloading gated models
System environment variables override values set under [env] in your TOML config.

macOS GPU memory

On Apple Silicon Macs, the GPU is allowed about 70% of system memory by default. To raise the cap before loading large models:
# Allow up to 128 GB on a 128 GB host
sudo sysctl iogpu.wired_limit_mb=137438

# Restore the default
sudo sysctl iogpu.wired_limit_mb=0

Verify

When the server starts, it prints a LAN-reachable URL like Visit http://<ip>:2080/status to see status via LAN. Open that URL — or http://localhost:2080/status from the same machine — to load the status dashboard. The dashboard shows, per backend (GGML-LLM, GGML-STT, MLX-LLM):
  • The list of loaded generators and which ones currently hold an active model context
  • Parallel slot usage and queued requests for STT
  • Recent model-load history and completion / transcription history (collapsible)
A Refresh button on the page polls the same data on demand. There is no auth on /status — host the server on a trusted LAN. For machine-readable output, query the JSON endpoints directly:
# Server identity, capabilities, generators, auth status
curl http://localhost:2080/buttress/info

# Live generator/queue/history snapshot (same data as the dashboard)
curl http://localhost:2080/buttress/status
/buttress/info is what Foundation devices read during HTTP fallback discovery — see LAN auto-discovery.

Next steps

Configuration

Configure generators, caching, and compatibility endpoints.

Workspace binding

Pair the server with a workspace and enable JWT auth.