RAG With Auth Inheritance: Permission-Aware Retrieval for Enterprise AI
Most enterprise RAG systems leak. The moment retrieval stops asking who wants the answer, it will surface documents the person was never allowed to open. Auth inheritance — making retrieval enforce the same permissions as the source systems — is what makes RAG safe to ship inside a company.
The problem nobody catches until launch
Retrieval-augmented generation demos well because everyone in the demo is an administrator. Every document is in scope, every answer is impressive, and nobody notices that the system never asked who was asking. The leak shows up later — the first time a real employee types a real question and the assistant answers, helpfully and fluently, from a document they were never allowed to open.
It might be another team's contract. A salary band. A board deck. The model didn't do anything wrong: it summarised the context it was handed. The mistake happened one layer down, where the retrieval system returned the nearest matching chunks without checking whether the person could see them. A vector database, left to its own devices, returns similarity — not authorisation.
What "auth inheritance" means
Auth inheritance is a simple invariant: the retrieval layer enforces exactly the same access control as the systems the content came from. If you cannot open a file in SharePoint, Drive, or the application it lives in, the RAG layer must not retrieve it for you — not as a citation, not as hidden context, not in a summary. The permissions ride along with the data instead of being dropped at the door of the vector store.
Stated as a test you can hold a system to: the RAG layer should be incapable of returning a chunk the asker could not have opened directly. If that property isn't enforced in code, you don't have permission-aware RAG — you have a search engine that ignores permissions, wired to a model that will read whatever it's given.
Where permission-aware RAG actually breaks
The failure modes are predictable, and most of them hide until production:
- Index-time permissions go stale. Access captured when a document was ingested keeps answering questions long after that access was revoked. Yesterday's offboarded employee is still reachable through the embeddings.
- Chunk-level drift. A document's access list has to propagate to every chunk derived from it — and to every summary, title, and synthetic Q&A generated on top of it. Miss one derivation and the content leaks through the side door.
- Group and role resolution. Access lists reference groups, not just individuals. If you resolve membership from a snapshot taken at ingestion rather than live against the identity provider, your permissions are always one reorg behind.
- Multi-tenant bleed. In a shared index, one tenant's vectors must never be searchable by another. A filter clause is not isolation; it's a single typo away from a cross-tenant disclosure.
- Embedding-level leakage. Even when you withhold the raw chunk text, feeding it to the model — or caching its embedding where another query path can reach it — can expose the underlying content. The protected surface is everything derived from the document, not just the document.
The patterns that work
None of this requires exotic infrastructure. It requires treating identity as a first-class input to retrieval rather than an afterthought.
- Filter at query time, not only index time. Store the allowed principals as metadata on each vector, then filter the search by the caller's resolved identity and groups before ranking — a pre-filter, not a post-filter. Post-filtering after top-k silently shrinks (or empties) results and tempts teams to widen k until something comes back.
- Propagate identity end-to-end. Carry the user's token through the retrieval call and resolve their groups live against the IdP at query time, so a revocation takes effect on the next question, not the next re-index.
- Partition hard boundaries physically. Give each tenant its own namespace or index. Reserve metadata filtering for in-tenant document permissions; don't lean on it for the boundary that would be a breach if it failed.
- Re-check at generation. Before the model cites its sources, confirm those exact documents are still authorised for the caller. The few milliseconds buy you a guarantee that survives stale caches.
- Log every retrieval with the principal and the document IDs. When someone asks whether the assistant could have leaked a document, an audit trail is the difference between a five-minute answer and a security incident.
The rule of thumb: if access control lives anywhere other than inside the retrieval call, it isn't access control — it's a suggestion. Permission-aware RAG enforces the invariant in code, every query, before ranking.
Index-time ACLs vs query-time filtering
The two are not a choice; they're two halves of the same mechanism. Index-time work attaches the authorisation metadata — which principals and groups may see this chunk — to every vector as it's written. Query-time work resolves the caller's live identity and uses it to constrain the search. Index-time tells the store what each chunk requires; query-time tells it what the asker currently has. You need both: index-time alone goes stale, and query-time alone has nothing to filter against.
# Query-time ACL pre-filter (conceptual)
principals = idp.resolve(user.token) # live: user + current groups
hits = vectors.search(
embedding = embed(question),
filter = { allowed_principals: { any_of: principals } }, # before ranking
top_k = 8,
)
sources = [h for h in hits if authz.can_read(user, h.doc_id)] # re-check
answer = llm.generate(question, context=sources)
audit.log(user, question, [h.doc_id for h in sources])Why this usually means a private deployment
Documents sensitive enough to need access-control inheritance are frequently too sensitive to send to a third-party model API in the first place. Once you're enforcing per-user retrieval over contracts, financials, or regulated records, the same risk assessment that demands auth inheritance tends to push the whole stack — model and vector store — onto infrastructure you control. The permission problem and the data-residency problem are usually the same project wearing two hats.
We cover that side of the decision in local LLM vs cloud AI API, and the deployment mechanics in deploying local LLMs for enterprise.
A checklist before you ship
- Can a user retrieve a chunk from a document they cannot open directly? Prove the answer is no, with a test.
- Does revoking access take effect on the next query, or the next re-index?
- Are group memberships resolved live, or frozen at ingestion?
- Is tenant isolation a physical boundary, or a filter clause?
- Are derived artefacts — summaries, titles, synthetic Q&A — inheriting the source document's permissions?
- Is every retrieval logged with the principal and the documents returned?
If any answer is uncertain, the system isn't ready for sensitive content yet. Auth inheritance is the line between an AI assistant your security team can sign off on and one they have to switch off.
This is the kind of work we do under AI engineering — private, permission-aware retrieval built to the access model your business already runs on.
Read next
Local LLM vs Cloud AI API: How to Choose
Cloud APIs (OpenAI, Anthropic, Gemini) get you producing results today. Local LLMs give you control over data, latency, and cost at scale. The right answer is usually both — here is how to decide which is which.
8 min readArtificial IntelligenceWhat Recent Research Says About Shipping LLM Agents in Production
Four recent papers and product announcements on LLM agents reveal where the real engineering work sits: citation verification, prompt coordination, GUI grounding, and voice reliability.
8 min readArtificial IntelligenceDeploying Local LLMs for Enterprise: Ollama, vLLM, and RAG Pipelines
Cloud AI APIs are convenient but expensive, rate-limited, and send your data to third parties. Here's how enterprises deploy private LLMs with full data sovereignty — and when it actually makes sense.
13 min read