Installation

The Buttress server runs from npm and exposes a single executable, bricks-buttress. It works on macOS, Linux, and Windows; for GPU acceleration on Linux, install CUDA or Vulkan drivers before starting the server.

Hardware

Resource	Recommended
GPU	NVIDIA (CUDA), AMD/Intel (Vulkan), or Apple Silicon (Metal)
RAM	At least 2× the size of the largest model you plan to load
Disk	Enough free space in `cache_dir` to hold every model you download
Network	Wired LAN — UDP broadcasts must reach Foundation devices

The server runs without a GPU but the throughput drops sharply, and capability scoring will mark the host as a less-preferred backend.

Install from npm

Requires Node.js 22+ (or Bun).

npm install -g @fugood/buttress-server

This installs the bricks-buttress binary on your PATH.

On a fresh Apple Silicon Mac, install with Bun (bun add -g @fugood/buttress-server) for faster cold starts and lower memory overhead.

Run the server

Without a config, the server starts on port 2080 with sensible defaults:

bricks-buttress

With a TOML config:

bricks-buttress --config ./config.toml

Pass an inline TOML string instead of a path with the same flag:

bricks-buttress --config '[server]
port = 3000

[[generators]]
type = "ggml-llm"
[generators.model]
repo_id = "ggml-org/gemma-3-270m-qat-GGUF"'

See the configuration reference for the full schema.

CLI flags

Flag	Description
`-p, --port <port>`	Port to listen on (default: `2080`)
`-c, --config <path\|toml>`	Path to a TOML file or an inline TOML string
`-v, --version`	Print the server version
`-h, --help`	Show help

The port resolves in this order: --port flag → [server] port in TOML → default 2080.

Environment variables

Variable	Effect
`NODE_ENV`	Set to `development` for verbose logs
`ENABLE_OPENAI_COMPAT_ENDPOINT`	Set to `1` to enable the OpenAI-compat endpoint
`ENABLE_ANTHROPIC_MESSAGES_ENDPOINT`	Set to `1` to enable the Anthropic messages endpoint
`HF_TOKEN`	Hugging Face token for downloading gated models

System environment variables override values set under [env] in your TOML config.

macOS GPU memory

On Apple Silicon Macs, the GPU is allowed about 70% of system memory by default. To raise the cap before loading large models:

# Allow up to 128 GB on a 128 GB host
sudo sysctl iogpu.wired_limit_mb=137438

# Restore the default
sudo sysctl iogpu.wired_limit_mb=0

Verify

When the server starts, it prints a LAN-reachable URL like Visit http://<ip>:2080/status to see status via LAN. Open that URL — or http://localhost:2080/status from the same machine — to load the status dashboard. The dashboard shows, per backend (GGML-LLM, GGML-STT, MLX-LLM):

The list of loaded generators and which ones currently hold an active model context
Parallel slot usage and queued requests for STT
Recent model-load history and completion / transcription history (collapsible)

A Refresh button on the page polls the same data on demand. There is no auth on /status — host the server on a trusted LAN. For machine-readable output, query the JSON endpoints directly:

# Server identity, capabilities, generators, auth status
curl http://localhost:2080/buttress/info

# Live generator/queue/history snapshot (same data as the dashboard)
curl http://localhost:2080/buttress/status

/buttress/info is what Foundation devices read during HTTP fallback discovery — see LAN auto-discovery.

Hardware

Install from npm

Run the server

CLI flags

Environment variables

macOS GPU memory

Verify

Next steps

Configuration

Workspace binding

​Hardware

​Install from npm

​Run the server

​CLI flags

​Environment variables

​macOS GPU memory

​Verify

​Next steps

Configuration

Workspace binding

Hardware

Install from npm

Run the server

CLI flags

Environment variables

macOS GPU memory

Verify

Next steps