— Issue 001 — Architecture — 14 min read — Apr 30, 2026

An LLM trained on internal documents, explained.

Most teams want an AI that knows their company. Most AI tools don't, and won't tell you. This essay walks through what's actually under the hood when an LLM is "trained on your internal documents," why retrieval-augmented generation is doing the real work, and what that means if you're choosing between fine-tuning, prompting, or building something private from scratch.

The phrase is doing a lot of work

"An LLM trained on your internal documents" is a sentence that gets used to mean three completely different things, and the difference matters.

The first meaning is fine-tuning — taking an existing model and adjusting its weights with a corpus of your own data so the model itself becomes a permanent expert on, say, your support tickets or your contract templates. This is technically training, in the strictest sense. It's also slow, expensive, hard to update, and almost never what teams actually need.

The second is in-context learning — pasting a document into a prompt and asking the model to answer questions about it. This isn't training at all. The model holds the document in its working memory for a single conversation and forgets it the moment the chat ends. It's useful for one-off tasks and falls apart anywhere beyond ten or twenty pages.

The third is retrieval-augmented generation, or RAG. This is the one that actually delivers the experience people imagine when they say "an LLM trained on internal documents." The model isn't trained on your data. The model is paired with a retrieval system that hands it the right slice of your data, on demand, every time someone asks a question. The output reads like a model that knows your company. The mechanism underneath is something subtler and a lot more practical.

— TL;DR

Most "private AI" products you'll see are not training models on your documents. They're using retrieval-augmented generation: pulling the right chunks of your data into the prompt at the moment of the question. The model stays generic. Your data stays yours. And the answers are grounded in real sources.

Why fine-tuning is usually the wrong tool

Fine-tuning sounds like the obvious answer. You have documents. You have an LLM. You combine them. The model "knows" your company.

In practice, fine-tuning has three problems for the kind of company knowledge most teams care about.

Your data changes constantly. A fine-tuned model is a snapshot. The day after you train it, somebody updates a runbook, closes a deal, deploys a fix. The model still believes the old version. To keep up, you'd need to retrain — and retraining is hours to days, not seconds. By the time the new version ships, the data has moved again.

You lose the source. When a fine-tuned model gives you an answer, it can't show you which document the answer came from, because the document isn't a document anymore — it's been smeared into the model's weights. For internal questions where the answer matters, "trust me, this is what I learned during training" is not good enough.

Permissions don't survive. If junior teammates and senior leadership are asking the same model, but only senior staff should see the cap table, the model has no way to enforce that. Once data is in the weights, everyone with access to the model has access to the data.

Fine-tuning works well for things like teaching a model a particular voice, a domain vocabulary, or a structured output format. It doesn't work well for "answer questions about my live, changing, permission-scoped company data."

What retrieval-augmented generation actually does

RAG is, at its core, a very simple idea wrapped in some careful engineering. Instead of training the model to know your data, you keep the data outside the model and hand it to the model whenever you need an answer.

The flow looks like this.

— RAG · simplified flow

Index Read every doc, Slack thread, ticket → split into chunks → embed

↓

Store Save the chunks and their vectors in a database the model can search

↓

Query User asks a question → embed the question → find the closest chunks

↓

Ground Hand those chunks to the LLM along with the question

↓

Answer LLM generates a response based only on what was retrieved, with sources

Three things to notice. First, the model itself is unchanged. It's still a generic LLM — the same one anyone else uses. Second, the data never enters the model permanently. It's loaded into the prompt at the moment of the question and discarded after. Third, because the retrieval system knows exactly which chunks were used, every answer can be linked back to its source.

That last point is what makes RAG defensible for company use. When the model says "we agreed to a 30-day refund window with Acme," it can show you the exact email thread, contract clause, or Slack message that backs the claim. If the retrieval pulled the wrong source, you'll see it. If the model misread the source, you'll see that too. There's no hallucination shielding because there's nothing to shield.

A fine-tuned model gives you an answer. A RAG system gives you an answer plus a receipt.

Embeddings, in one paragraph

The piece of RAG that does the heavy lifting — and the part that's easiest to get wrong — is embedding. An embedding is a numerical representation of a piece of text that captures its meaning, not just its words. Two sentences that say the same thing in different ways will have similar embeddings; two sentences that share words but mean different things won't. When a user asks a question, the question gets embedded too, and the retrieval system finds the chunks whose embeddings are closest to the question's embedding. That's why RAG can answer "what did we promise the customer about pricing?" using a thread that never said the word "promise" — because the meaning matches even when the keywords don't.

Why this beats keyword search

Native search inside Slack, Notion, Drive, or Jira is keyword-based. You ask for "refund policy" and it shows you every doc that contains the word "refund." That's useful when you remember the exact language. It's nearly useless when you don't, or when the right answer was written in different words by a different person two years ago. Embedding-based retrieval doesn't care about the exact words. It cares about the meaning. That's why a well-built RAG system feels like a coworker who actually read the docs, and native search feels like grep.

What "grounded" means and why it's not optional

"Grounded" is the word the field uses for an LLM whose answer is constrained to source material. Ungrounded, an LLM will produce an answer to almost anything because it was trained to be fluent. Fluency without grounding is hallucination. Confident, well-written hallucination is the worst possible failure mode for internal knowledge — it sounds exactly like a real answer, and the team will believe it.

A grounded LLM works inside a smaller frame. It's told: here's the question, here are the documents we found, answer using only what's in these documents and tell us when you can't. The model is still a generic LLM. The grounding happens through the prompt. But the behavioural shift is dramatic — the model becomes citable, correctable, and accountable.

The cost of this is honesty. A grounded system has to be willing to say "I don't know" or "the documents don't cover this" when the retrieval comes up empty. That's a feature, not a bug. An AI that lies fluently is more dangerous than an AI that admits it doesn't know.

The privacy dimension

Most teams who care about "an LLM trained on internal documents" care because they don't want to paste their data into a public chat tool. Fair concern. The architecture matters here.

In a properly designed RAG system, your private data is stored in a system you control or a vendor you trust. When a question comes in, only the relevant chunks are pulled into the prompt — not the entire corpus. The model that generates the answer can be a hosted API call, but the prompt content is ephemeral; it's not used to train the model and it's not retained by the provider beyond the request. The end result is that the AI has access to your data only at the moment of the question, only for the chunks that matter, and never as a permanent part of its weights.

This is the design pattern Archively uses. Tools connect through scoped, revocable access tokens. Data is encrypted in transit and at rest. Source permissions from your tools are honoured during retrieval — a junior teammate asking about the cap table won't get an answer if Drive doesn't already let them see it. Private content is never used to train external models. The audit trail is real: every query, every retrieval, every source, logged.

What this looks like in practice

The promise of an LLM that knows your company is, when delivered properly, more grounded and less mystical than the marketing makes it sound. You don't get a model that "thinks" about your business. You get a system that can find the right paragraph in the right document in milliseconds and ask a generic LLM to read it on your behalf.

That's plenty. It means you can ask what did we agree on the renewal terms with Acme? and get an answer pulled from the actual email thread. It means a new hire can ask how do we handle Friday deploys? and get the runbook, not a hallucination that sounds like a runbook. It means a founder can ask what's likely to ship next sprint? and get a synthesis of real Jira tickets and Slack debates, not a guess.

None of that requires fine-tuning. None of it requires giving up your data. It requires a retrieval system that knows where everything lives, an embedding model that captures meaning rather than words, and a generation step that's honest about its sources.

The short version

Fine-tuning bakes data into model weights. Slow, expensive, hard to update, no source attribution. Right answer for voice and format, wrong answer for live company knowledge.
In-context learning pastes documents into a prompt for one conversation. Fine for ad-hoc tasks. Doesn't scale.
RAG keeps data outside the model and retrieves the right chunks at query time. Fast to update, sources stay attached, permissions can be honoured, the model never owns the data. This is what most "private AI" or "LLM trained on internal documents" products actually are.

If you're evaluating tools that promise an AI grounded in your company's knowledge, the right question isn't "did you train the model on our data?" It's "where does our data live, who has access to it, and can you show me the source of every answer?" The architecture answers all three. The marketing usually doesn't.

Archively is built on the third pattern, designed for B2B SaaS teams who want the upside of AI on their own company knowledge and refuse to trade their privacy or their audit trail to get there. It connects to the tools you already use, retrieves what matters, hands it to the model, and shows you the source of every claim.

— Get early access

Build an LLM that actually knows your company.

Archively is in pre-launch. Join the waitlist to be first in line when access opens.

Join the waitlist ↗