AI & LLM Integration

omfreebdy includes llama.cpp for running AI models locally with GPU acceleration via Vulkan.

Why Vulkan?

FreeBSD doesn’t support CUDA natively. CUDA requires Linux kernel modules, and Linux jails can’t access CUDA via linuxulator (results in Error 304). Vulkan works natively with NVIDIA GPUs via drm-kmod.

Performance is excellent: ~84 tokens/sec on RTX 3070 with Qwen3-4B.

Architecture

FreeBSD Host
├── llama-server (rc service, port 8080)
│   ├── Uses Vulkan backend for GPU acceleration
│   ├── Serves OpenAI-compatible API
│   └── Models: /usr/local/share/llama/models/*.gguf
│
└── Linux Jail (optional, for hermes-agent)
    ├── Ubuntu rootfs via linuxulator
    ├── hermes-agent installation
    └── Connects to llama-server via localhost:8080

Quick Start

Download a Model

# Create model directory
sudo mkdir -p /usr/local/share/llama/models
sudo chown -R nobody:nobody /usr/local/share/llama

# Download Qwen3 4B (~2.3GB, fits on 8GB VRAM)
fetch -o /usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"

Ensure you have sufficient GPU VRAM. The Q4_K_M quantization requires approximately 4GB of VRAM.

Configure the Service

sudo sysrc llama_server_enable=YES
sudo sysrc llama_server_model=/usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf
sudo sysrc llama_server_args="--host 0.0.0.0 --port 8080 --device Vulkan0 -ngl 99"
sudo sysrc llama_server_user=nobody

Start the Server

sudo service llama-server start

Test the API

# Health check
curl http://localhost:8080/health

# Test inference
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3","messages":[{"role":"user","content":"Hello!"}]}'

Service Configuration

The llama-server is configured via /etc/rc.conf:

Variable	Description	Example
`llama_server_enable`	Enable the service	`YES`
`llama_server_model`	Path to GGUF model	`/usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf`
`llama_server_args`	Server arguments	`—host 0.0.0.0 —port 8080 —device Vulkan0 -ngl 99`
`llama_server_user`	User to run as	`nobody`

Key Arguments

Argument	Description
`—device Vulkan0`	Use first Vulkan GPU
`-ngl 99`	Offload all layers to GPU (99 is more than any model has)
`—host 0.0.0.0`	Listen on all interfaces
`—port 8080`	API port

hermes-agent Setup (Optional)

hermes-agent is an AI agent framework that can use llama-server as its backend. Since it requires Linux, we run it in a Linux jail.

Create the Linux Jail

Using Sylve (or manually with jail.conf):

sylve jail create hermes-agent \
  --type linux \
  --distro ubuntu \
  --release jammy \
  --ip-inherit

Install hermes-agent

# Enter the jail
sudo jexec $(jls -j hermes-agent jid) bash

# Install dependencies
apt update && apt install -y python3 python3-venv python3-pip git curl

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Clone and install
cd /opt
git clone https://github.com/NousResearch/hermes-agent.git
cd hermes-agent
python3 -m venv venv
source venv/bin/activate
pip install -e .

Configure hermes-agent

Create ~/.hermes/config.yaml in the jail:

model:
  provider: custom
  default: qwen3
  base_url: http://localhost:8080/v1

custom_providers:
  - name: llama-cpp
    base_url: http://localhost:8080/v1
    api_key: llama-cpp

terminal:
  backend: local
  cwd: /root

Create ~/.hermes/.env:

OPENAI_API_KEY=llama-cpp
OPENAI_BASE_URL=http://localhost:8080/v1

Test hermes-agent

cd /opt/hermes-agent
source venv/bin/activate
hermes chat -q "Hello, what model are you?"

Helper Script

Create a wrapper script to auto-start everything:

#!/bin/sh
# ~/bin/hermes - Start llama-server and run hermes-agent

JAIL_HOSTNAME="hermes-agent"
LLAMA_URL="http://localhost:8080/health"
TIMEOUT=60

# Ensure llama-server is running
if ! service llama-server status >/dev/null 2>&1; then
    echo "Starting llama-server..."
    sudo service llama-server start
fi

# Wait for ready
for i in $(seq 1 $TIMEOUT); do
    if curl -sf "$LLAMA_URL" >/dev/null 2>&1; then
        break
    fi
    sleep 1
done

# Get jail name and run hermes
JAIL_NAME=$(jls name host.hostname | awk -v h="$JAIL_HOSTNAME" '$2 == h {print $1}' | head -1)
exec sudo jexec "$JAIL_NAME" bash -c 'cd /opt/hermes-agent && source venv/bin/activate && hermes chat "$@"' -- "$@"

Model Options

Popular GGUF models from HuggingFace:

Model	Size	VRAM	Notes
Qwen3-4B-Q4_K_M	2.3GB	~4GB	Good balance of quality/speed
Qwen3-4B-Q5_K_M	2.9GB	~5GB	Slightly higher quality
Qwen3-4B-Q8_0	4.3GB	~6GB	Best quality for 4B
Llama-3.1-8B-Q4_K_M	4.9GB	~6GB	Larger, more capable

Troubleshooting

llama-server won't start

Check the log: tail /var/log/llama-server.log
Verify model exists: ls -la /usr/local/share/llama/models/
Check Vulkan: vulkaninfo | head -20
Check GPU: nvidia-smi

Slow inference

Verify GPU offload in logs: look for “offloaded X/X layers to GPU”
Confirm Vulkan device: should show “using device Vulkan0”
Monitor GPU usage: nvidia-smi

Jail can't reach llama-server

Check llama-server is running: curl http://localhost:8080/health
Verify jail has network: jexec <JID> curl http://localhost:8080/health
Check jail uses ip_inherit or has network configured
Verify llama-server is listening: sockstat -4 | grep 8080

File Locations

File	Purpose
`/usr/local/share/llama/models/*.gguf`	Model files
`/var/log/llama-server.log`	Server logs
`/var/run/llama_server.pid`	PID file
`/usr/local/etc/rc.d/llama-server`	RC script
`~/.hermes/config.yaml` (in jail)	hermes-agent config