I Ran Local LLMs for a Month: The Honest Truth About What Actually Works

📖 You May Also Like:I Switched from ChatGPT to Dee · Cursor vs Claude Code vs Copil

Last month I did something extreme: I cut off cloud AI services entirely and used only local LLMs for all my daily tasks.

Three reasons drove this decision. First, data privacy — every time I pasted company code into ChatGPT, I felt uneasy. Second, network dependency — a few outages had already stalled my workflow. Third, curiosity — can 2026's local models actually get the job done?

After 30 days, here's my verdict: They can fight, but you need to pick your battles. Some scenarios work surprisingly well; others are a complete waste of time.

My Setup

Let me clarify my hardware first, so you know where I'm coming from:

Primary machine: MacBook Pro M3 Max (64GB unified memory)
Backup machine: Custom desktop (RTX 4090 24GB, 64GB DDR5)
Software stack: Ollama + Open WebUI + LM Studio + Continue (VS Code extension)
Models tested: DeepSeek-R1-Distill-Qwen-32B (Q4 quantized), Llama 3.1-70B (Q4), Qwen2.5-32B, CodeLlama-34B, Phi-3-medium, Mistral-Small

Both the M3 Max and RTX 4090 are top-tier consumer hardware. If they struggle, regular users won't stand a chance.

Scenario 1: Code Completion & Generation ✅ Works with Caveats

This is the most mature use case for local models.

What Works Well

Using the Continue extension with Ollama gives a decent Copilot-like experience in VS Code. I mainly used the distilled DeepSeek-R1 (32B, Q4 quantization):

Inline completion: Single-line and short function completions work well. 1-2 second latency — slower than Copilot, but acceptable.
Bug fixing: Pasting error messages for analysis. Most common errors are correctly identified.
Simple refactoring: Extracting functions, renaming variables, and splitting modules — solid performance.

Where It Falls Apart

But complex scenarios quickly expose the limits:

Cross-file context: Local models have limited context windows. Asking "trace the data flow for this feature across the entire project" — the 32B model hallucinated badly at 4K context. Even at 8K it couldn't keep details straight. Qwen2.5-72B was better, but running it at Q4 on the RTX 4090 was painfully slow.
New tech stacks: I needed help with a Rust library released two months ago. The local model flat-out said "I don't know this library" — its training data didn't include it.
Major rewrites: Converting a module from sync to async Rust took 5 rounds of back-and-forth with the model because its output wouldn't compile. I would have finished faster just writing it myself.

Verdict

Code completion: Highly recommended. Works great with Continue Copilot-Local. Free, no privacy concerns.
Complex refactoring: Don't torture yourself. For cross-file changes, new libraries, or deep business logic — just use a cloud model.

Scenario 2: Document Analysis & Summarization ⚠️ Manage Your Expectations

This is often hyped as "deploy local LLMs to analyze 100-page documents with one click."

Real Experience

Using Open WebUI with Qwen2.5-32B, I tried:

A 45-page PDF contract
A 12-page research paper
A ~2000-line codebase

The PDF contract: Half failure. Local OCR and PDF parsing were problematic — the model received garbled text order. Worse, the 32B model started "forgetting" early content when processing long text. When I asked "what's clause 2 on page 3?", its answer referenced text that wasn't on page 3 or page 7.

The research paper: Barely usable. 12 pages was within the model's range, but its summary read like "pick one sentence from each paragraph" — mechanical, lacking real understanding and synthesis. The gap with GPT-5.5's summary quality was noticeable.

The codebase: Failed. The model couldn't keep the overall structure of 2000 lines in context. When I asked "where's the main entry point?", it pointed to a wrong file. I ended up manually feeding key files one by one, which worked better but was labor-intensive.

Optimization Attempt

I tried RAG (Retrieval-Augmented Generation) with ChromaDB for vector storage and a local model for embeddings:

Text → Chunking (512 tokens) → Embedding → Vector Search → Relevant Chunks + Query → LLM Answer

This improved results noticeably, but the setup complexity skyrocketed. The two days I spent configuring RAG would have been enough to just read those three documents.

Verdict

Short documents (<10 pages): Local models can handle this, but quality lags behind cloud.
Long documents: Don't bother unless you're willing to set up a full RAG pipeline.
Code analysis: File-by-file analysis works; holistic analysis doesn't.

Scenario 3: Daily Q&A & Brainstorming ✅✅ This Is Surprisingly Good

This was the most unexpected win for local models.

Why It Works

Daily Q&A has specific characteristics:
- Doesn't require a vast knowledge base — local models handle common-sense questions fine
- Low latency (local inference, no network round-trip)
- No content restrictions (you know what I mean)

With DeepSeek-R1-32B and Llama 3.1-70B:
- "Brainstorm name ideas for this product feature"
- "Rewrite this paragraph more concisely"
- "What are the pros and cons of different approaches to this problem?"

In these scenarios, I'd say the quality gap with cloud models is within 10%, but the benefits are completely free, zero latency, and ask whatever you want.

One underrated advantage: at 2 AM when you're deep in code, local models respond consistently. No API rate limiting during peak hours.

System Prompt Tuning

I found local models rely much more on system prompts than cloud models. A good system prompt can double output quality.

My current default:

You are my work assistant. If you're not sure about something, say "I'm not sure" instead of guessing.
Keep answers concise and direct. When I need code, give me code, not a description of the approach.

Adding this cut Llama 3.1's verbosity by about 60%.

Scenario 4: Translation & Proofreading ✅ Another Strong Use Case

Chinese-English translation is a sweet spot for local models. I don't need deep contextual understanding — just faithful translation.

Side-by-Side Test

I had both DeepSeek-R1-32B (local) and GPT-5.5 (cloud) translate the same technical passage:

Original:

"The system employs a federated learning paradigm where model gradients are aggregated across distributed nodes without exposing raw training data, ensuring differential privacy guarantees."

GPT-5.5 (CN):

"该系统采用联邦学习范式，在分布式节点之间聚合模型梯度而不暴露原始训练数据，确保差分隐私保证。"

Local DeepSeek-R1-32B (CN):

"系统采用了联邦学习范式——模型梯度在分布式节点之间聚合，原始训练数据不会暴露，从而提供差分隐私保障。"

Honestly, the local version reads more naturally with that well-placed em-dash. The quality gap is negligible.

Batch Translation

I wrote a script to batch-translate Markdown files via Ollama's API — 100 paragraphs in about 7 minutes, with good enough quality. Doing the same with cloud API would cost $20-30 monthly.

Scenario 5: Real-time Information & Web Search ❌ Don't Even Try

Local LLMs can't access the internet. This is their biggest weakness.

I tried a workaround:
1. Search manually
2. Paste search results to local model
3. Have the model answer based on those results

Every step was tedious and results were inconsistent:
- Search quality varies wildly; the model sometimes gets misled by bad results
- Switching between browser and local model is a terrible UX
- Time-sensitive info (stock prices, weather, breaking news) is completely out of reach

Verdict: Stick with cloud for real-time queries. Local models have zero advantage here.

Performance Benchmarks

Here are real-world numbers from my M3 Max (64GB):

Model	Parameters	Quantization	Speed	VRAM Usage
DeepSeek-R1-Distill-Qwen-32B	32B	Q4_K_M	18-22 tok/s	~20GB
Llama 3.1-70B	70B	Q4_K_M	6-8 tok/s	~42GB
Qwen2.5-32B	32B	Q4_K_M	20-25 tok/s	~19GB
CodeLlama-34B	34B	Q4_K_M	17-20 tok/s	~20GB
Phi-3-medium	14B	Q4_K_M	35-40 tok/s	~9GB
Mistral-Small	22B	Q4_K_M	25-30 tok/s	~14GB

Note: Running a 70B model on the M3 Max requires closing all other apps. Speed is noticeably slow. For daily use, I recommend the 32B range.

The RTX 4090 showed similar numbers, but with better CUDA optimization, speeds were about 5-10% faster.

My Recommended Stack

After a month of experimentation, here's my daily setup:

MacBook Pro (travel/office):
- Ollama + Open WebUI (great UI, highly recommend)
- DeepSeek-R1-32B or Qwen2.5-32B (Q&A and writing)
- Continue VS Code extension + CodeLlama (code completion)
- Phi-3 (quick tasks, fast response)

Desktop (lab/development):
- Ollama (better multi-model management than LM Studio)
- Llama 3.1-70B (when quality matters most)
- ChromaDB + local embeddings (RAG for long docs)

This combo for a month:
- API bill: from ~$150 to $0
- Privacy concerns: completely gone
- User experience: acceptable for most scenarios

Final Verdict: Who Should (and Shouldn't) Use Local LLMs

✅ Good Fit

Privacy-conscious users — company code, sensitive data, personal information stays local
Heavy users — hundreds of requests daily. Local is free.
Unreliable internet — on flights, subways, remote areas
Tinkerers — setting up RAG, tuning parameters, testing models is fun
Programming beginners — local code completion is good enough, no extra cost

❌ Bad Fit

Quality chasers — cloud models still win on every metric for best-in-class answers
Non-technical users — installing Ollama, downloading models, tuning parameters is too complex
Anyone needing fresh information — real-time data, latest news, online docs
Laptop commuters — 16GB MacBook Air will struggle badly with 32B models
Anyone who values time over money — the setup and debugging cost often exceeds API fees

What I Learned

After the month-long experiment, I didn't abandon cloud models entirely. My strategy is now hybrid:

Code completion, daily Q&A, translation, document summaries → Local models
Complex refactoring, long-form analysis, web search, creative writing → Cloud models
Sensitive data, offline environments → Local models only

This combo cut my monthly API bill from ~$150 to near zero, without significantly sacrificing core experience.

2026's local LLMs have reached "usable" status. They're not cloud model replacements — they're excellent complements, especially for privacy and cost.

If you want to try, start with Ollama and Qwen2.5-32B. You'll be up and running in ten minutes. Then explore based on your needs. But remember one principle: don't expect local models to handle every scenario. Figure out what you actually need this month.

A month of hard lessons distilled into one sentence: a tool's value depends on the job. Local LLMs are good enough for some things. Admitting their limits is also a form of wisdom.

I Ran Local LLMs for a Month: The Honest Truth About What Actually Works

My Setup

Scenario 1: Code Completion & Generation ✅ Works with Caveats

What Works Well

Where It Falls Apart

Verdict

Scenario 2: Document Analysis & Summarization ⚠️ Manage Your Expectations

Real Experience

Optimization Attempt

Verdict

Scenario 3: Daily Q&A & Brainstorming ✅✅ This Is Surprisingly Good

Why It Works

System Prompt Tuning

Scenario 4: Translation & Proofreading ✅ Another Strong Use Case

Side-by-Side Test

Batch Translation

Scenario 5: Real-time Information & Web Search ❌ Don't Even Try

Performance Benchmarks

My Recommended Stack

Final Verdict: Who Should (and Shouldn't) Use Local LLMs

✅ Good Fit

❌ Bad Fit

What I Learned

Related Articles

💬 Comments