Skip to content

AI & LLM Integration

omfreebdy includes llama.cpp for running AI models locally with GPU acceleration via Vulkan.

Performance is excellent: ~84 tokens/sec on RTX 3070 with Qwen3-4B.

FreeBSD Host
├── llama-server (rc service, port 8080)
│ ├── Uses Vulkan backend for GPU acceleration
│ ├── Serves OpenAI-compatible API
│ └── Models: /usr/local/share/llama/models/*.gguf
└── Linux Jail (optional, for hermes-agent)
├── Ubuntu rootfs via linuxulator
├── hermes-agent installation
└── Connects to llama-server via localhost:8080
1

Download a Model

# Create model directory
sudo mkdir -p /usr/local/share/llama/models
sudo chown -R nobody:nobody /usr/local/share/llama
# Download Qwen3 4B (~2.3GB, fits on 8GB VRAM)
fetch -o /usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf \
"https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"
2

Configure the Service

sudo sysrc llama_server_enable=YES
sudo sysrc llama_server_model=/usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf
sudo sysrc llama_server_args="--host 0.0.0.0 --port 8080 --device Vulkan0 -ngl 99"
sudo sysrc llama_server_user=nobody
3

Start the Server

sudo service llama-server start
4

Test the API

# Health check
curl http://localhost:8080/health
# Test inference
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3","messages":[{"role":"user","content":"Hello!"}]}'

The llama-server is configured via /etc/rc.conf:

VariableDescriptionExample
llama_server_enableEnable the serviceYES
llama_server_modelPath to GGUF model/usr/local/share/llama/models/Qwen3-4B-Q4_K_M.gguf
llama_server_argsServer arguments—host 0.0.0.0 —port 8080 —device Vulkan0 -ngl 99
llama_server_userUser to run asnobody
ArgumentDescription
—device Vulkan0Use first Vulkan GPU
-ngl 99Offload all layers to GPU (99 is more than any model has)
—host 0.0.0.0Listen on all interfaces
—port 8080API port
hermes-agent Setup (Optional)

hermes-agent is an AI agent framework that can use llama-server as its backend. Since it requires Linux, we run it in a Linux jail.

1

Create the Linux Jail

Using Sylve (or manually with jail.conf):

sylve jail create hermes-agent \
--type linux \
--distro ubuntu \
--release jammy \
--ip-inherit
2

Install hermes-agent

# Enter the jail
sudo jexec $(jls -j hermes-agent jid) bash
# Install dependencies
apt update && apt install -y python3 python3-venv python3-pip git curl
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
# Clone and install
cd /opt
git clone https://github.com/NousResearch/hermes-agent.git
cd hermes-agent
python3 -m venv venv
source venv/bin/activate
pip install -e .
3

Configure hermes-agent

Create ~/.hermes/config.yaml in the jail:

model:
provider: custom
default: qwen3
base_url: http://localhost:8080/v1
custom_providers:
- name: llama-cpp
base_url: http://localhost:8080/v1
api_key: llama-cpp
terminal:
backend: local
cwd: /root

Create ~/.hermes/.env:

OPENAI_API_KEY=llama-cpp
OPENAI_BASE_URL=http://localhost:8080/v1
4

Test hermes-agent

cd /opt/hermes-agent
source venv/bin/activate
hermes chat -q "Hello, what model are you?"

Create a wrapper script to auto-start everything:

#!/bin/sh
# ~/bin/hermes - Start llama-server and run hermes-agent
JAIL_HOSTNAME="hermes-agent"
LLAMA_URL="http://localhost:8080/health"
TIMEOUT=60
# Ensure llama-server is running
if ! service llama-server status >/dev/null 2>&1; then
echo "Starting llama-server..."
sudo service llama-server start
fi
# Wait for ready
for i in $(seq 1 $TIMEOUT); do
if curl -sf "$LLAMA_URL" >/dev/null 2>&1; then
break
fi
sleep 1
done
# Get jail name and run hermes
JAIL_NAME=$(jls name host.hostname | awk -v h="$JAIL_HOSTNAME" '$2 == h {print $1}' | head -1)
exec sudo jexec "$JAIL_NAME" bash -c 'cd /opt/hermes-agent && source venv/bin/activate && hermes chat "$@"' -- "$@"

Popular GGUF models from HuggingFace:

ModelSizeVRAMNotes
Qwen3-4B-Q4_K_M2.3GB~4GBGood balance of quality/speed
Qwen3-4B-Q5_K_M2.9GB~5GBSlightly higher quality
Qwen3-4B-Q8_04.3GB~6GBBest quality for 4B
Llama-3.1-8B-Q4_K_M4.9GB~6GBLarger, more capable
llama-server won't start
  1. Check the log: tail /var/log/llama-server.log
  2. Verify model exists: ls -la /usr/local/share/llama/models/
  3. Check Vulkan: vulkaninfo | head -20
  4. Check GPU: nvidia-smi
Slow inference
  1. Verify GPU offload in logs: look for “offloaded X/X layers to GPU”
  2. Confirm Vulkan device: should show “using device Vulkan0”
  3. Monitor GPU usage: nvidia-smi
Jail can't reach llama-server
  1. Check llama-server is running: curl http://localhost:8080/health
  2. Verify jail has network: jexec <JID> curl http://localhost:8080/health
  3. Check jail uses ip_inherit or has network configured
  4. Verify llama-server is listening: sockstat -4 | grep 8080
FilePurpose
/usr/local/share/llama/models/*.ggufModel files
/var/log/llama-server.logServer logs
/var/run/llama_server.pidPID file
/usr/local/etc/rc.d/llama-serverRC script
~/.hermes/config.yaml (in jail)hermes-agent config