Choosing LLM models for Blackbox

Choosing LLM models for Blackbox

When we started setting up Blackbox, our self-hosted AI workstation, one of the biggest questions was:
Which local language models (LLMs) should we run?

With so many models available — from tiny 1B parameter experiments to heavyweight 70B giants — the challenge was finding a perfect fit:

  • Fast enough for everyday use (on CPU only — no fancy GPU here)
  • Light enough not to crush our hardware
  • Smart enough to feel "alive" in conversation
  • Private enough to never leave our home server

After weeks of digging, pulling, testing, and chatting, we built a small but powerful model stack tailored to Blackbox's hardware and our personal needs.


🖥️ Hardware Constraints (aka "Why Not 70B?")

Our device, Blackbox, is a Lenovo ThinkCentre M710q (SFF) upgraded with:

  • Intel i5-6500T CPU (quad-core)
  • 16GB RAM
  • Samsung 980 500GB NVMe
  • 1TB Crucial SATA SSD for media

There’s no dedicated GPU.

This means:

  • GPU-accelerated models were out
  • RAM-hungry models (like 13B, 34B, 70B) were impractical
  • Efficiency and small size were absolutely critical

The sweet spot ended up being models between 2B and 7B parameters — especially ones optimized for quantization (like 4-bit versions).


🌟 The Models We Chose

Model Size Strengths Weaknesses Relative Speed Best For
Hermes 3:3B 3B params Witty, human-like conversations Can ramble or repeat when poorly prompted ⚡⚡⚡ Daily chatting, journaling
Mistral 7B Instruct 7B params Logical, structured reasoning Slower on CPU Problem solving, technical tasks
Gemma 2B 2B params Blazing fast, simple answers Robotic tone ⚡⚡⚡⚡ Scripting, idea generation
Phi 2.7B 2.7B params Very lightweight and concise Hallucinates facts, overconfident tone ⚡⚡⚡ Fact checking, quick help

🌍 Where They Come From

A little about the creators:

  • Hermes 3:3B — Fine-tuned by Nous Research, famous for making open-weight conversational models that feel surprisingly human.
  • Mistral 7B Instruct — Released by Mistral AI, a French research group. Mistral models aim to be ultra-capable in small sizes.
    (Mistral 7B is arguably the best 7B model in the world right now.)
  • Gemma 2B — Google’s DeepMind project focusing on extreme model efficiency without giving up safety guardrails.
  • Phi 2.7B — Built by Microsoft Research, trained mainly on synthetic textbook-quality data and code to be tight, smart, and CPU-friendly.

⚙️ Pulling the Models

Thanks to Ollama, installing models was dead simple:

ollama pull hermes3:3b
ollama pull mistral:latest
ollama pull gemma:2b
ollama pull phi:2.7b

Each command downloads the model and stores it locally under `/usr/share/ollama/.ollama/models` (or wherever your Ollama installation manages its models).  

From there, the models are instantly available for serving via Open WebUI or the Ollama API.


🏎️ Speed vs Intelligence Tradeoff

One of the biggest insights during testing was speed is not optional.
If a model takes more than a minute to answer basic questions, the experience feels broken.

Here’s a rough feel based on CPU-only usage:

Model Average Response Time (first token)
Gemma 2B 0.8 seconds
Phi 2.7B 1.2 seconds
Hermes 3:3B 1.5 seconds
Mistral 7B 3–4 seconds

So in practice:

  • Gemma is fast but often sounds robotic.
  • Hermes hits the perfect balance of natural speed and conversation flow.
  • Mistral feels slow for daily chatting, but shines when deep logical reasoning is needed.
  • Phi 2.7B was blazingly fast, but hallucinates too often to be useful. For example, when asking it to describe a leopard gecko, it described a cryptid that was half leopard, half gecko!

🧪 Use Cases We Matched

We didn’t just want "one model to rule them all."

Instead, we mapped models to flexible roles:

  • Hermes 3:3B → Default sandbox for conversations, journaling, creative drafting, and lightweight research.
  • Mistral 7B Instruct → Technical troubleshooting, direct answers, and fast iterations.
  • Gemma 2B → Quick scripting, simple idea generation, and CPU-friendly experiments.
  • Phi 2.7B → Fact-checking, lookup tasks, and memory-efficient background work.

By switching models on demand inside OpenWebUI, we can pick the right brain for the right job — without overloading the system.


🚀 Why Local LLMs Are Worth It

Despite the effort, the payoff is huge:

  • Zero cloud reliance
  • Absolute privacy
  • Customization freedom (custom prompts, system behavior)
  • Better cost control (no API keys or subscriptions)

It’s honestly addicting knowing that everything — from casual chats to deep research help — happens entirely inside Blackbox, with no outside servers.

And seeing a model like Hermes 3:3B laugh at jokes or suggest ideas almost instantly feels borderline magical.


✨ Final Thoughts

The world is rushing toward cloud-based AI subscriptions.
But having your own AI brain, running on your hardware, answering to nobody but you — that's the real future.

Blackbox might not have a fancy GPU yet (⚡ coming soon?) but it’s already a personal AI lab that rivals anything mainstream.

And we’re just getting started.