Agrus
Deployment guide

Private LLM: a deployment guide for enterprise teams

How to deploy a private LLM on your own infrastructure — VPC, on-prem, or dedicated cloud — with compliance, cost, and model-selection tradeoffs spelled out. Written by engineers who deploy them.

If you're reading this, someone in your organization has said the words “private LLM” in a meeting and another person has nodded gravely. This guide is for both of them.

1. What is a private LLM, exactly?

A private LLM is a large language model deployed in a way that gives a single organization full control over the prompts, completions, fine-tuning data, and model weights involved in inference. Nothing leaves the organization's perimeter to a shared multi-tenant SaaS vendor.

The category covers three real architectures:

  • Self-hosted, open weights. You run an open-weights model (Llama 3.3, Qwen 2.5, Mistral Large, DeepSeek-V3) on infrastructure you control — your VPC, your on-prem cluster, or a dedicated cloud tenancy. The weights, the inference pipeline, and every byte of prompt data are yours.
  • Dedicated frontier-model tenancy. A vendor (Anthropic, OpenAI, Google, AWS Bedrock) gives you a contractually isolated deployment of their model. The weights stay with the vendor, but your inference pipeline is single-tenant, with audited no-training and no-logging commitments.
  • Hybrid. Most production deployments end up hybrid: an open-source model handles 80-95% of routine, retrieval-augmented queries; a dedicated frontier-model path handles the long tail. Routing is deterministic and audited.

What private LLM is not: it isn't a flag you toggle on your ChatGPT Enterprise account. It isn't a checkbox on a vendor's pricing page. It's a deployment shape — and shape matters because regulators and CISOs ask questions that checkboxes can't answer.

2. Why your CISO is asking about this

Three pressures show up in nearly every conversation we have with a CIO or CISO across our six verticals (healthcare, insurance, legal, private equity, family offices, corporate intelligence).

2.1 Regulatory pressure

HIPAA, SOC 2, ISO 27001, ABA Model Rule 1.6 (attorney-client privilege), NAIC model laws for insurance, FINRA rules for broker-dealers, the EU AI Act, Canada's PIPEDA — each of these treats unencrypted, multi-tenant SaaS AI access to regulated data as a control failure. The default architecture of ChatGPT or Claude.ai sends prompts to a vendor's shared inference cluster. That's not a workable posture for most regulated workloads.

2.2 Contractual pressure

Customer data agreements increasingly contain explicit clauses barring re-disclosure to AI vendors. A B2B SaaS company building an AI-powered feature on top of customer data has to be able to answer, in writing: “Does any customer prompt or document flow to OpenAI, Anthropic, Google, or any third party?” For the kind of buyer who matters, the only acceptable answer is no.

2.3 Competitive pressure

Even where regulators don't care, your CISO probably does: the question of whether your team's prompts, internal documents, and code snippets contribute to a model that other customers use is a competitive question. The boring, true answer is that responsible vendors don't train on customer data by default. The defensible answer is to deploy in a way where the question is moot.

3. The three deployment patterns

We deploy private LLMs in one of three shapes. The right shape is usually obvious within thirty minutes of understanding the compliance posture, the data volume, and where the data lives.

3.1 VPC deployment

The most common shape. The LLM (open weights, typically) runs inside your existing AWS, GCP, or Azure VPC. Inference traffic stays inside your private network. Audit logs go to your existing SIEM. Cost: roughly $4K-$80K/month depending on model size, concurrency, and whether you reserve capacity.

Best for: healthcare, insurance, professional services. Most CIOs are comfortable with workloads inside an existing VPC. SOC 2 and ISO 27001 evidence is easier here because the controls extend your existing program rather than introducing a new boundary.

3.2 Dedicated cloud tenancy

A separate cloud account or tenancy dedicated to the AI workload, outside your production VPC but still in a cloud you control. This is the pattern we recommend when (a) you want to keep cloud-spend visibility distinct for AI, (b) the workload has different network egress requirements (model downloads, base-model upgrades), or (c) you're running multiple AI workloads with different sensitivity levels.

3.3 On-prem deployment

The model runs on hardware you own, in a data centre you control. Adoption has narrowed: in 2026 it's used mostly by defense contractors, healthcare systems with strict data-residency rules, certain investigation firms, and customers in jurisdictions (e.g. several Middle Eastern countries, parts of Canada's health sector) where regulators effectively require on-prem.

On-prem is real engineering: hardware procurement, sizing, cooling, redundancy, hardware-failure runbooks. Plan for 6-12 weeks from hardware-order to production for the first deployment, then 1-3 weeks for subsequent replicas.

4. Open-source vs frontier models

In 2026, the open-source vs frontier debate is less binary than it was even six months ago. The defaults we use:

4.1 Open-source models we routinely deploy

  • Llama 3.3 70B / 8B. The strongest general-purpose open weights in production use. Good multilingual coverage, strong tool-use through community fine-tunes.
  • Qwen 2.5 72B / 32B / 14B / 7B. Excellent for multilingual retrieval-augmented use cases. Strong reasoning at 32B and above. The 7B is our most-frequent “agent worker” model.
  • DeepSeek-V3 / R1. Reasoning-heavy use cases. Slower per-token, but the reasoning trace is genuinely useful for compliance audits — you can show the regulator what the model was thinking.
  • Mistral / Mixtral families. Strong instruction- following and lower latency. We use Mistral Small / Medium where latency budget is tight.

4.2 When we recommend frontier models, privately

Frontier models (Claude Opus/Sonnet, GPT-4-class, Gemini Pro) still lead on (a) genuinely novel reasoning, (b) long-context legal and clinical drafting, (c) some agentic workflows with high tool-call variety. When we recommend them, we route through dedicated tenancy contracts — Anthropic's on AWS Bedrock, OpenAI's Azure Enterprise, Google's Vertex AI — with audited no-training, zero-retention, and audit-log delivery. Never through consumer endpoints.

4.3 The framework

Our short rubric for picking a model:

  1. What's the smallest model that meets your accuracy floor on your real data? Test, don't guess.
  2. Does the use case need a reasoning trace for audit? If yes, prefer DeepSeek-R1 family or Claude Opus.
  3. Is there a latency budget under 500ms? If yes, prefer a 7B-14B open-source model with quantization, or a small frontier tier.
  4. Is the prompt or output high-stakes legal text? Use frontier (privately).
  5. Does the output need to be cited back to source documents? RAG matters more than model choice. Don't over-rotate on the LLM if your retrieval pipeline is weak.

5. The honest cost model

Most pricing pages obscure this. Here's how we model it for customers.

WorkloadModel classAll-in monthly
Single team, internal use, ~10K queries/day7B-13B open weights on 1 A100/H100$4K-$12K
Department-wide, ~100K queries/day32B-72B open weights on 2-4 H100$15K-$45K
Enterprise, customer-facing, ~1M queries/day70B+ open weights HA cluster + frontier fallback$50K-$200K
Frontier model in dedicated tenancyClaude Sonnet / GPT-classPer-token + tenancy minimum (vendor-quoted)

These are the numbers we actually quote, not optimistic 2024 estimates. They include inference compute, monitoring, an SRE on-call rotation, model upgrades, and quarterly evaluation. They don't include your data-prep costs, which can be larger than the inference bill for the first 6 months.

The single biggest cost-control lever isn't the model choice; it's retrieval quality. A team running 14B with excellent retrieval beats 70B with mediocre retrieval, on both cost and accuracy. We spend a disproportionate share of every engagement on retrieval evaluation.

6. Compliance overlays

Every private LLM deployment we ship has a compliance map. The map is a one-page document — produced before code — that names the regulatory regime, lists the relevant controls, and maps each one to a specific architectural decision. Your CISO signs it. We don't build until the map exists.

The regimes we see most frequently:

  • HIPAA + HITECH for healthcare. The Privacy Rule, Security Rule, and Breach Notification Rule each have a column in our map. Special attention to: minimum-necessary, BAAs for AI sub-processors, audit log retention.
  • SOC 2 for any vendor serving regulated customers. The five trust services criteria map onto LLM-specific controls (model-update change management, prompt-injection mitigation, eval drift monitoring).
  • ISO 27001 for European or multinational customers. Annex A controls applied to an AI workload have specific gotchas, especially around cryptographic protection of model weights and inference logs.
  • EU AI Act for any deployment touching EU subjects. The high-risk category is the hard one; we map each system component to the conformity assessment requirements early.
  • NAIC model laws and state insurance regulator guidance for insurance carriers. ABA Model Rule 1.6 and state bar guidance for law firms. AML / KYC frameworks for financial services.

7. A reference architecture

The shape of a typical private LLM deployment we ship for a healthcare or finance customer:

           ┌──────────────────────────────────────────────┐
           │  Customer perimeter (VPC or dedicated cloud) │
           │                                              │
   user → ─┼─►  Agent runtime  ◄────►  Eval & audit       │
           │       │                     pipeline         │
           │       ▼                                      │
           │  Tool router ──►  Retrieval (vector + BM25)  │
           │       │                  │                   │
           │       ▼                  ▼                   │
           │  Inference layer ──► Private LLM             │
           │     (open weights or dedicated frontier)     │
           │       │                                      │
           │       ▼                                      │
           │  Output guard + PHI scrubber + logger        │
           │                                              │
           └──────────────────────────────────────────────┘
                            │
                            ▼
                  Audit log (SIEM / customer-owned)

Five components, all inside the perimeter. The audit log goes to the customer's SIEM, not ours. The model weights live where the customer wants them. Prompt and completion data never leave the boundary.

The components we own day-to-day are the agent runtime, the tool router, the inference layer wrapper, the output guard, and the eval/audit pipeline. We integrate with whatever vector database, identity provider, and SIEM the customer already has — we don't force a stack swap.

8. When NOT to deploy a private LLM

The honest list of cases where we'll tell you not to do this:

  • The data isn't actually sensitive, and a SaaS vendor with a BAA/DPA covers your compliance posture. Don't pay $20K/month to replace a $200/month seat for marketing copy.
  • The use case is so small that the operational overhead exceeds the value. Below ~5K queries/day, the math is rarely on self-hosted's side.
  • You haven't shipped any AI-assisted workflow yet. Start with a SaaS pilot under tight data restrictions, prove the workflow, then bring it private. Don't build the private-deployment cathedral until you know the prayers.
  • You don't have a stakeholder who'll own the AI roadmap. Private LLMs are 30% engineering and 70% ongoing operation. Without an owner, the platform decays.

9. How Agrus deploys a private LLM

Our default engagement shape for a new private LLM deployment:

  1. 30-minute scoping call with an engineer and our compliance lead. Free. We tell you in this call whether the problem fits the private-LLM shape or not.
  2. Discovery Sprint (2-3 weeks, $15K-$30K fixed). Deliverables: working prototype, architecture document, compliance map your CISO can sign, model-selection recommendation, cost model. You can stop here and take the artifacts elsewhere — we've had customers do this.
  3. Build Engagement (8-16 weeks, T&M $200-$280/hr blended). Production deployment, monitoring, eval pipeline, runbook handover, customer-team enablement.
  4. Managed SLA ($4K-$15K/month per agent system). 24/7 on-call, model upgrades, drift monitoring, quarterly evaluation. Optional.

The 2-3 week Discovery Sprint is where most of the strategic risk comes out. If a private LLM is the wrong shape, we'll tell you at the end of week one and refund the rest. If the shape is right, the sprint produces the architecture, the compliance map, and the prototype that becomes the basis for the production build.

Next step

Scope a private LLM deployment in 30 minutes.

Engineer + compliance lead on the call. We tell you whether this is the right shape for your problem, regardless of whether you hire us.

Frequently asked questions

What is a private LLM?

A private LLM is a large language model deployed on infrastructure you control — your VPC, your on-prem servers, or a dedicated cloud tenancy — so that no prompts, completions, or fine-tuning data are sent to a third-party SaaS AI vendor. It can be an open-weights model (Llama, Qwen, Mistral, DeepSeek) you host yourself, or a frontier model (Claude, GPT, Gemini) accessed through a contract that guarantees no training on your data and no cross-tenant logging.

Is a private LLM the same as a self-hosted LLM?

Self-hosted is a strict subset of private. A self-hosted LLM is one where you run the model weights on your own compute. A private LLM also covers cases where a frontier vendor exposes a dedicated, audited, single-tenant deployment to you — the weights stay with the vendor, but no other customer shares your inference pipeline. Both put your data outside the multi-tenant SaaS perimeter.

Why would a CIO or CISO ask about a private LLM?

Three reasons recur. First, regulators: HIPAA, SOC 2, ISO 27001, ABA Model Rule 1.6, NAIC model laws, EU AI Act high-risk categories. Second, contractual: customer data agreements that prohibit re-disclosure to AI vendors. Third, competitive: a model trained on your team's questions and proprietary documents accidentally improving a competitor's experience is an exfiltration event.

Are open-source LLMs really good enough for production?

For most enterprise tasks, yes. Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek-V3 are within striking distance of GPT-4-class on most benchmarks that matter for retrieval-augmented enterprise tasks. For frontier reasoning, claim writing, or highly nuanced legal drafting, frontier models still lead. The right answer is usually a mix: open-source for the high-volume, sensitive, retrieval-augmented path; frontier for the long-tail, reasoning-heavy path — through a private deployment of that frontier model.

What does a private LLM cost?

A small private LLM (7B-13B parameters) running on two A100s or an H100 can serve a single department for $4,000-$12,000/month all-in, including hosting, monitoring, and SLA. A 70B-class model serving an enterprise workload typically runs $25,000-$80,000/month. These numbers move fast — newer hardware (B200, MI300X) and smaller models with comparable quality (Llama 3.3, Qwen 2.5) are pushing the floor down quarterly.

How fast can Agrus deploy a private LLM?

A Discovery Sprint produces a working prototype in 2-3 weeks at fixed price ($15K-$30K), including model selection, RAG setup on a sample of your data, and a compliance map your CISO can sign. A production-grade Build Engagement that follows typically lands in 8-16 weeks for a single department, 4-6 months for an enterprise-wide rollout.

Can we use Anthropic, OpenAI, Google or AWS models privately?

Yes, with the right contract. Anthropic offers Claude through AWS Bedrock and dedicated deployments with no-training guarantees. OpenAI has Azure OpenAI and enterprise zero-data-retention agreements. Google has Vertex AI. AWS Bedrock fronts multiple model families. We evaluate the actual contractual fine print — including audit rights and cross-region data residency — before recommending any of them.

When does a private LLM NOT make sense?

When your use case is genuinely non-sensitive, low-volume, and the SaaS option has a vendor-side BAA or DPA that satisfies your compliance team — for example, a marketing team using a frontier model for blog drafting where no customer data ever enters the prompt. Don't build a $30K/month private deployment to replace a $200/month seat license.


See also: AI Compliance hub, Dedicated Deployment service, RAG & LLM Ops, AI for Healthcare, AI for Private Equity.

Send us your data shape. We’ll send back a deployment plan.