Artificial Intelligence

Deploying Local LLMs for Enterprise: Ollama, vLLM, and RAG Pipelines

Cloud AI APIs are convenient but expensive, rate-limited, and send your data to third parties. Here's how enterprises deploy private LLMs with full data sovereignty — and when it actually makes sense.

13 min readMarch 6, 2026

Filip Golubovic

Founder & Lead Engineer

Why Local LLMs Are No Longer Optional

OpenAI's GPT-4o costs $5 per million input tokens. At enterprise scale — thousands of customer interactions, document processing pipelines, internal knowledge queries — that compounds into six-figure annual API bills. And every request sends your proprietary data through a third-party API. For regulated industries, this is a compliance liability. For competitive businesses, it's a strategic risk.

The open-source LLM landscape has matured dramatically. Meta's Llama 3.1 (405B parameters), Mistral's Mixtral, Alibaba's Qwen 2.5, and DeepSeek-V3 now match or exceed GPT-4 on many benchmarks (Hugging Face Open LLM Leaderboard, 2025). The infrastructure to serve these models — Ollama, vLLM, TGI — has become production-ready. The question has shifted from "Can we run local LLMs?" to "How do we deploy them properly?"

Ollama vs vLLM: Choosing Your Serving Layer

Two frameworks dominate local LLM serving, and they solve different problems:

Ollama: Simplicity First

One-command model downloads and serving — `ollama run llama3.1` and you're live
Built-in model management (pull, list, remove, copy)
OpenAI-compatible API endpoint — drop-in replacement for existing code
Automatic quantization support (GGUF format via llama.cpp)
Runs on consumer hardware — M-series Macs, single GPU workstations
Best for: development, prototyping, small-scale internal tools, teams under 50 users

vLLM: Production Performance

PagedAttention algorithm — 2-4x higher throughput than naive serving (Kwon et al., UC Berkeley, 2023)
Continuous batching for handling concurrent requests efficiently
Tensor parallelism for multi-GPU deployments
OpenAI-compatible API server with streaming support
Supports GPTQ, AWQ, and SqueezeLLM quantization
Best for: production workloads, high-concurrency environments, enterprise deployments serving hundreds of users

Start with Ollama for validation, deploy with vLLM for production. This is the pattern we use with every enterprise client.

Model Selection: Not Bigger, Smarter

Deploying the largest model available is rarely the right choice. The decision depends on task complexity, latency requirements, and available hardware:

**Classification, extraction, routing:** 7-8B models (Llama 3.1 8B, Mistral 7B). Fast, cheap to serve, excellent for structured tasks
**Summarization, Q&A, content generation:** 13-34B models (Qwen 2.5 32B, Mixtral 8x7B). Best balance of quality and cost
**Complex reasoning, code generation, analysis:** 70B+ models (Llama 3.1 70B, DeepSeek-V3). Requires multi-GPU but matches cloud API quality
**Specialized domains:** Fine-tuned smaller models outperform generic large models. A fine-tuned 8B model on your domain data often beats a generic 70B model

Stanford's HELM benchmark (Holistic Evaluation of Language Models) provides task-specific comparisons that inform these choices — generic leaderboard rankings don't capture domain-specific performance.

RAG Architecture: Making LLMs Useful

A raw LLM knows nothing about your business. Retrieval-Augmented Generation (RAG) bridges this gap by injecting relevant context from your data into every query. The architecture has three layers:

Ingestion Pipeline

Document parsing (PDFs, Confluence, Notion, Slack, email) using Unstructured.io or LlamaParse
Intelligent chunking — not fixed-size splits but semantic boundaries (paragraphs, sections, topics)
Embedding generation using models like BGE, E5, or Nomic Embed
Vector storage in Pinecone, Weaviate, Qdrant, or PostgreSQL with pgvector
Metadata extraction for filtering (date, source, department, document type)

Retrieval Layer

Hybrid search combining vector similarity and keyword matching (BM25 + embeddings)
Re-ranking with cross-encoder models (Cohere Rerank, BGE Reranker) for precision
Query expansion and decomposition for complex questions
Contextual compression to fit more relevant information within the context window
Citation tracking — every generated answer links back to source documents

Generation Layer

Prompt engineering with retrieved context, system instructions, and guardrails
Structured output enforcement (JSON mode, function calling)
Hallucination detection through source verification
Response caching for repeated queries (materially reducing compute cost on high-repetition workloads)
Feedback loops for continuous improvement of retrieval quality

The most common RAG failure isn't the LLM — it's the chunking strategy. Poor chunks mean irrelevant retrieval, which means hallucinated answers regardless of model quality.

Infrastructure Requirements

Hardware requirements depend on model size and concurrency needs:

**7-8B models (quantized):** Single NVIDIA A10G or L4 GPU (24GB VRAM). ~$0.50-1.00/hour on AWS. Handles 20-50 concurrent users.
**13-34B models (quantized):** Single A100 40GB or 2x A10G GPUs. ~$1.50-3.00/hour. Handles 50-200 concurrent users.
**70B+ models (quantized):** 2-4x A100 80GB GPUs with tensor parallelism. ~$6-12/hour. Handles 200+ concurrent users.
**Vector database:** Managed Qdrant or pgvector on RDS. 1M documents ≈ 5-10GB storage. ~$100-300/month.
**Embedding inference:** Dedicated GPU instance or CPU inference for smaller embedding models. ~$50-200/month.

AWS SageMaker, Google Cloud Vertex AI, and Azure ML all offer managed GPU instances with auto-scaling. For maximum control, bare-metal providers like Lambda Labs, CoreWeave, or RunPod offer meaningful cost advantages over hyperscalers — the gap varies by GPU tier and commitment, but is consistently material for sustained workloads.

Cost Comparison: Cloud API vs. Self-Hosted

For an enterprise processing 10 million tokens per day (roughly 500 support conversations or 200 document analyses):

**OpenAI GPT-4o:** ~$50/day input + $150/day output = ~$6,000/month
**Anthropic Claude 3.5 Sonnet:** ~$30/day input + $150/day output = ~$5,400/month
**Self-hosted Llama 3.1 70B (2x A100):** ~$4,320/month GPU cost + ~$300/month infrastructure = ~$4,620/month with unlimited throughput and full data control
**Self-hosted Llama 3.1 8B (1x A10G):** ~$720/month all-in, suitable for most focused tasks
**Key insight:** Self-hosting breaks even at roughly 5M tokens/day. Below that, APIs are more economical. Above that, self-hosting compounds savings.

Data Sovereignty and Compliance

For financial services, healthcare, legal, and government clients, data sovereignty isn't optional. The EU AI Act (Regulation 2024/1689) imposes specific obligations on AI deployers regarding transparency, human oversight, and data governance. Self-hosted LLMs address these requirements directly:

Data never leaves your infrastructure — zero third-party data processing
Full audit trails of every query and response (required for GDPR Article 30 records of processing)
Data residency compliance — run models in specific AWS regions to satisfy GDPR, CCPA, or local regulations
Custom data retention policies — delete training data and logs on your schedule, not the vendor's
Penetration testing and security auditing of your AI stack — impossible with cloud APIs
Model versioning and rollback capabilities for reproducible outputs

Production Deployment Checklist

Load testing with realistic traffic patterns before going live
Auto-scaling configuration based on queue depth, not just CPU/GPU utilization
Health checks and automatic restart for model serving processes
Request rate limiting and authentication on API endpoints
Monitoring dashboards: tokens/second, latency percentiles (p50, p95, p99), error rates
Prompt injection guardrails and output filtering
Automated model updates and A/B testing framework
Disaster recovery: model weights in S3/GCS, infrastructure-as-code with Terraform
Cost monitoring alerts — GPU instances left running are expensive mistakes

When to Stay on Cloud APIs

Self-hosting isn't always the answer. Stay on cloud APIs when:

Token volume is under 5M/day — API costs are lower than GPU infrastructure
You need frontier model capabilities (GPT-4o, Claude Opus) that open-source hasn't matched
Your team lacks ML infrastructure experience and doesn't want to hire for it
Speed to market matters more than long-term cost optimization
You're experimenting with multiple models and don't want infrastructure commitment

The pragmatic approach: use cloud APIs for complex reasoning tasks where frontier models excel, self-host smaller models for high-volume structured tasks (classification, extraction, embeddings), and build RAG pipelines that work with both. This hybrid architecture gives you the best of both worlds.

The enterprise LLM stack in 2026 isn't "cloud or local" — it's a deliberate architecture where each model deployment serves the use case it's best suited for, with data sovereignty and cost efficiency as primary constraints.

Explore our AI engineering services — private LLM deployment, RAG systems, AI agents, and governance frameworks built for production.

Building something like this?

Tell us what you're working on.

Scope your project View services See our work