Project Journal

Choosing LLM models for Blackbox

dean

26 Apr 2025 • 3 min read

When we started setting up Blackbox, our self-hosted AI workstation, one of the biggest questions was:
Which local language models (LLMs) should we run?

With so many models available — from tiny 1B parameter experiments to heavyweight 70B giants — the challenge was finding a perfect fit:

Fast enough for everyday use (on CPU only — no fancy GPU here)
Light enough not to crush our hardware
Smart enough to feel "alive" in conversation
Private enough to never leave our home server

After weeks of digging, pulling, testing, and chatting, we built a small but powerful model stack tailored to Blackbox's hardware and our personal needs.

🖥️ Hardware Constraints (aka "Why Not 70B?")

Our device, Blackbox, is a Lenovo ThinkCentre M710q (SFF) upgraded with:

Intel i5-6500T CPU (quad-core)
16GB RAM
Samsung 980 500GB NVMe
1TB Crucial SATA SSD for media

There’s no dedicated GPU.

This means:

GPU-accelerated models were out
RAM-hungry models (like 13B, 34B, 70B) were impractical
Efficiency and small size were absolutely critical

The sweet spot ended up being models between 2B and 7B parameters — especially ones optimized for quantization (like 4-bit versions).

🌟 The Models We Chose

Model	Size	Strengths	Weaknesses	Relative Speed	Best For
Hermes 3:3B	3B params	Witty, human-like conversations	Can ramble or repeat when poorly prompted	⚡⚡⚡	Daily chatting, journaling
Mistral 7B Instruct	7B params	Logical, structured reasoning	Slower on CPU	⚡	Problem solving, technical tasks
Gemma 2B	2B params	Blazing fast, simple answers	Robotic tone	⚡⚡⚡⚡	Scripting, idea generation
Phi 2.7B	2.7B params	Very lightweight and concise	Hallucinates facts, overconfident tone	⚡⚡⚡	Fact checking, quick help

🌍 Where They Come From

A little about the creators:

Hermes 3:3B — Fine-tuned by Nous Research, famous for making open-weight conversational models that feel surprisingly human.
Mistral 7B Instruct — Released by Mistral AI, a French research group. Mistral models aim to be ultra-capable in small sizes.
(Mistral 7B is arguably the best 7B model in the world right now.)
Gemma 2B — Google’s DeepMind project focusing on extreme model efficiency without giving up safety guardrails.
Phi 2.7B — Built by Microsoft Research, trained mainly on synthetic textbook-quality data and code to be tight, smart, and CPU-friendly.

⚙️ Pulling the Models

Thanks to Ollama, installing models was dead simple:

ollama pull hermes3:3b
ollama pull mistral:latest
ollama pull gemma:2b
ollama pull phi:2.7b

Each command downloads the model and stores it locally under `/usr/share/ollama/.ollama/models` (or wherever your Ollama installation manages its models).

From there, the models are instantly available for serving via Open WebUI or the Ollama API.

🏎️ Speed vs Intelligence Tradeoff

One of the biggest insights during testing was speed is not optional.
If a model takes more than a minute to answer basic questions, the experience feels broken.

Here’s a rough feel based on CPU-only usage:

Model	Average Response Time (first token)
Gemma 2B	0.8 seconds
Phi 2.7B	1.2 seconds
Hermes 3:3B	1.5 seconds
Mistral 7B	3–4 seconds

So in practice:

Gemma is fast but often sounds robotic.
Hermes hits the perfect balance of natural speed and conversation flow.
Mistral feels slow for daily chatting, but shines when deep logical reasoning is needed.
Phi 2.7B was blazingly fast, but hallucinates too often to be useful. For example, when asking it to describe a leopard gecko, it described a cryptid that was half leopard, half gecko!

🧪 Use Cases We Matched

We didn’t just want "one model to rule them all."

Instead, we mapped models to flexible roles:

Hermes 3:3B → Default sandbox for conversations, journaling, creative drafting, and lightweight research.
Mistral 7B Instruct → Technical troubleshooting, direct answers, and fast iterations.
Gemma 2B → Quick scripting, simple idea generation, and CPU-friendly experiments.
Phi 2.7B → Fact-checking, lookup tasks, and memory-efficient background work.

By switching models on demand inside OpenWebUI, we can pick the right brain for the right job — without overloading the system.

🚀 Why Local LLMs Are Worth It

Despite the effort, the payoff is huge:

Zero cloud reliance
Absolute privacy
Customization freedom (custom prompts, system behavior)
Better cost control (no API keys or subscriptions)

It’s honestly addicting knowing that everything — from casual chats to deep research help — happens entirely inside Blackbox, with no outside servers.

And seeing a model like Hermes 3:3B laugh at jokes or suggest ideas almost instantly feels borderline magical.

✨ Final Thoughts

The world is rushing toward cloud-based AI subscriptions.
But having your own AI brain, running on your hardware, answering to nobody but you — that's the real future.

Blackbox might not have a fancy GPU yet (⚡ coming soon?) but it’s already a personal AI lab that rivals anything mainstream.

And we’re just getting started.