Choosing LLM models for Blackbox

When we started setting up Blackbox, our self-hosted AI workstation, one of the biggest questions was:
Which local language models (LLMs) should we run?
With so many models available — from tiny 1B parameter experiments to heavyweight 70B giants — the challenge was finding a perfect fit:
- Fast enough for everyday use (on CPU only — no fancy GPU here)
- Light enough not to crush our hardware
- Smart enough to feel "alive" in conversation
- Private enough to never leave our home server
After weeks of digging, pulling, testing, and chatting, we built a small but powerful model stack tailored to Blackbox's hardware and our personal needs.
🖥️ Hardware Constraints (aka "Why Not 70B?")
Our device, Blackbox, is a Lenovo ThinkCentre M710q (SFF) upgraded with:
- Intel i5-6500T CPU (quad-core)
- 16GB RAM
- Samsung 980 500GB NVMe
- 1TB Crucial SATA SSD for media
There’s no dedicated GPU.
This means:
- GPU-accelerated models were out
- RAM-hungry models (like 13B, 34B, 70B) were impractical
- Efficiency and small size were absolutely critical
The sweet spot ended up being models between 2B and 7B parameters — especially ones optimized for quantization (like 4-bit versions).
🌟 The Models We Chose
Model | Size | Strengths | Weaknesses | Relative Speed | Best For |
---|---|---|---|---|---|
Hermes 3:3B | 3B params | Witty, human-like conversations | Can ramble or repeat when poorly prompted | ⚡⚡⚡ | Daily chatting, journaling |
Mistral 7B Instruct | 7B params | Logical, structured reasoning | Slower on CPU | ⚡ | Problem solving, technical tasks |
Gemma 2B | 2B params | Blazing fast, simple answers | Robotic tone | ⚡⚡⚡⚡ | Scripting, idea generation |
Phi 2.7B | 2.7B params | Very lightweight and concise | Hallucinates facts, overconfident tone | ⚡⚡⚡ | Fact checking, quick help |
🌍 Where They Come From
A little about the creators:
- Hermes 3:3B — Fine-tuned by Nous Research, famous for making open-weight conversational models that feel surprisingly human.
- Mistral 7B Instruct — Released by Mistral AI, a French research group. Mistral models aim to be ultra-capable in small sizes.
(Mistral 7B is arguably the best 7B model in the world right now.) - Gemma 2B — Google’s DeepMind project focusing on extreme model efficiency without giving up safety guardrails.
- Phi 2.7B — Built by Microsoft Research, trained mainly on synthetic textbook-quality data and code to be tight, smart, and CPU-friendly.
⚙️ Pulling the Models
Thanks to Ollama, installing models was dead simple:
ollama pull hermes3:3b
ollama pull mistral:latest
ollama pull gemma:2b
ollama pull phi:2.7b
Each command downloads the model and stores it locally under `/usr/share/ollama/.ollama/models` (or wherever your Ollama installation manages its models).
From there, the models are instantly available for serving via Open WebUI or the Ollama API.
🏎️ Speed vs Intelligence Tradeoff
One of the biggest insights during testing was speed is not optional.
If a model takes more than a minute to answer basic questions, the experience feels broken.
Here’s a rough feel based on CPU-only usage:
Model | Average Response Time (first token) |
---|---|
Gemma 2B | 0.8 seconds |
Phi 2.7B | 1.2 seconds |
Hermes 3:3B | 1.5 seconds |
Mistral 7B | 3–4 seconds |
So in practice:
- Gemma is fast but often sounds robotic.
- Hermes hits the perfect balance of natural speed and conversation flow.
- Mistral feels slow for daily chatting, but shines when deep logical reasoning is needed.
- Phi 2.7B was blazingly fast, but hallucinates too often to be useful. For example, when asking it to describe a leopard gecko, it described a cryptid that was half leopard, half gecko!
🧪 Use Cases We Matched
We didn’t just want "one model to rule them all."
Instead, we mapped models to flexible roles:
- Hermes 3:3B → Default sandbox for conversations, journaling, creative drafting, and lightweight research.
- Mistral 7B Instruct → Technical troubleshooting, direct answers, and fast iterations.
- Gemma 2B → Quick scripting, simple idea generation, and CPU-friendly experiments.
- Phi 2.7B → Fact-checking, lookup tasks, and memory-efficient background work.
By switching models on demand inside OpenWebUI, we can pick the right brain for the right job — without overloading the system.
🚀 Why Local LLMs Are Worth It
Despite the effort, the payoff is huge:
- Zero cloud reliance
- Absolute privacy
- Customization freedom (custom prompts, system behavior)
- Better cost control (no API keys or subscriptions)
It’s honestly addicting knowing that everything — from casual chats to deep research help — happens entirely inside Blackbox, with no outside servers.
And seeing a model like Hermes 3:3B laugh at jokes or suggest ideas almost instantly feels borderline magical.
✨ Final Thoughts
The world is rushing toward cloud-based AI subscriptions.
But having your own AI brain, running on your hardware, answering to nobody but you — that's the real future.
Blackbox might not have a fancy GPU yet (⚡ coming soon?) but it’s already a personal AI lab that rivals anything mainstream.
And we’re just getting started.