AI & LLM Integration
omfreebdy includes llama.cpp for running AI models locally with GPU acceleration via Vulkan.
Why Vulkan?
Section titled “Why Vulkan?”FreeBSD doesn’t support CUDA natively. CUDA requires Linux kernel modules, and Linux jails can’t access CUDA via linuxulator (results in Error 304). Vulkan works natively with NVIDIA GPUs via drm-kmod.
Performance is excellent: ~84 tokens/sec on RTX 3070 with Qwen3-4B.
Architecture
Section titled “Architecture”FreeBSD Host├── llama-server (rc service, port 8080)│ ├── Uses Vulkan backend for GPU acceleration│ ├── Serves OpenAI-compatible API│ └── Models: /usr/local/share/llama/models/*.gguf│└── Linux Jail (optional, for hermes-agent) ├── Ubuntu rootfs via linuxulator ├── hermes-agent installation └── Connects to llama-server via localhost:8080Quick Start
Section titled “Quick Start”Download a Model
# Create model directorysudo mkdir -p /usr/local/share/llama/modelssudo chown -R nobody:nobody /usr/local/share/llama
# Download Qwen3 4B (~2.3GB, fits on 8GB VRAM)fetch -o /usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf \ "https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"Ensure you have sufficient GPU VRAM. The Q4_K_M quantization requires approximately 4GB of VRAM.
Configure the Service
sudo sysrc llama_server_enable=YESsudo sysrc llama_server_model=/usr/local/share/llama/models/Qwen3-4B-Q4_K_M.ggufsudo sysrc llama_server_args="--host 0.0.0.0 --port 8080 --device Vulkan0 -ngl 99"sudo sysrc llama_server_user=nobodyStart the Server
sudo service llama-server startTest the API
# Health checkcurl http://localhost:8080/health
# Test inferencecurl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen3","messages":[{"role":"user","content":"Hello!"}]}'Service Configuration
Section titled “Service Configuration”The llama-server is configured via /etc/rc.conf:
| Variable | Description | Example |
|---|---|---|
llama_server_enable | Enable the service | YES |
llama_server_model | Path to GGUF model | /usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf |
llama_server_args | Server arguments | —host 0.0.0.0 —port 8080 —device Vulkan0 -ngl 99 |
llama_server_user | User to run as | nobody |
Key Arguments
Section titled “Key Arguments”| Argument | Description |
|---|---|
—device Vulkan0 | Use first Vulkan GPU |
-ngl 99 | Offload all layers to GPU (99 is more than any model has) |
—host 0.0.0.0 | Listen on all interfaces |
—port 8080 | API port |
hermes-agent is an AI agent framework that can use llama-server as its backend. Since it requires Linux, we run it in a Linux jail.
Create the Linux Jail
Using Sylve (or manually with jail.conf):
sylve jail create hermes-agent \ --type linux \ --distro ubuntu \ --release jammy \ --ip-inheritInstall hermes-agent
# Enter the jailsudo jexec $(jls -j hermes-agent jid) bash
# Install dependenciesapt update && apt install -y python3 python3-venv python3-pip git curl
# Install uvcurl -LsSf https://astral.sh/uv/install.sh | shexport PATH="$HOME/.local/bin:$PATH"
# Clone and installcd /optgit clone https://github.com/NousResearch/hermes-agent.gitcd hermes-agentpython3 -m venv venvsource venv/bin/activatepip install -e .Configure hermes-agent
Create ~/.hermes/config.yaml in the jail:
model: provider: custom default: qwen3 base_url: http://localhost:8080/v1
custom_providers: - name: llama-cpp base_url: http://localhost:8080/v1 api_key: llama-cpp
terminal: backend: local cwd: /rootCreate ~/.hermes/.env:
OPENAI_API_KEY=llama-cppOPENAI_BASE_URL=http://localhost:8080/v1Test hermes-agent
cd /opt/hermes-agentsource venv/bin/activatehermes chat -q "Hello, what model are you?"Helper Script
Section titled “Helper Script”Create a wrapper script to auto-start everything:
#!/bin/sh# ~/bin/hermes - Start llama-server and run hermes-agent
JAIL_HOSTNAME="hermes-agent"LLAMA_URL="http://localhost:8080/health"TIMEOUT=60
# Ensure llama-server is runningif ! service llama-server status >/dev/null 2>&1; then echo "Starting llama-server..." sudo service llama-server startfi
# Wait for readyfor i in $(seq 1 $TIMEOUT); do if curl -sf "$LLAMA_URL" >/dev/null 2>&1; then break fi sleep 1done
# Get jail name and run hermesJAIL_NAME=$(jls name host.hostname | awk -v h="$JAIL_HOSTNAME" '$2 == h {print $1}' | head -1)exec sudo jexec "$JAIL_NAME" bash -c 'cd /opt/hermes-agent && source venv/bin/activate && hermes chat "$@"' -- "$@"Model Options
Section titled “Model Options”Popular GGUF models from HuggingFace:
| Model | Size | VRAM | Notes |
|---|---|---|---|
| Qwen3-4B-Q4_K_M | 2.3GB | ~4GB | Good balance of quality/speed |
| Qwen3-4B-Q5_K_M | 2.9GB | ~5GB | Slightly higher quality |
| Qwen3-4B-Q8_0 | 4.3GB | ~6GB | Best quality for 4B |
| Llama-3.1-8B-Q4_K_M | 4.9GB | ~6GB | Larger, more capable |
Troubleshooting
Section titled “Troubleshooting”- Check the log:
tail /var/log/llama-server.log - Verify model exists:
ls -la /usr/local/share/llama/models/ - Check Vulkan:
vulkaninfo | head -20 - Check GPU:
nvidia-smi
- Verify GPU offload in logs: look for “offloaded X/X layers to GPU”
- Confirm Vulkan device: should show “using device Vulkan0”
- Monitor GPU usage:
nvidia-smi
- Check llama-server is running:
curl http://localhost:8080/health - Verify jail has network:
jexec <JID> curl http://localhost:8080/health - Check jail uses ip_inherit or has network configured
- Verify llama-server is listening:
sockstat -4 | grep 8080
File Locations
Section titled “File Locations”| File | Purpose |
|---|---|
/usr/local/share/llama/models/*.gguf | Model files |
/var/log/llama-server.log | Server logs |
/var/run/llama_server.pid | PID file |
/usr/local/etc/rc.d/llama-server | RC script |
~/.hermes/config.yaml (in jail) | hermes-agent config |