Five Layers of Building RAG for Enterprise
Five Layers of Building RAG for Enterprise
In my earlier post about RAG, I shared the experience of building a RAG system from scratch. Back then, I was mostly excited that it worked at all. But after months of practice, debugging, and exposure to enterprise scenarios during my internship, I started thinking about a more fundamental question: How do you actually make RAG work in a real business setting?
Not a demo-level "ask a question, get an answer" — but a system that runs reliably, can be measured quantitatively, and degrades gracefully when things go wrong.
This post organizes my current thinking into five layers. It won't apply to every situation, but it's the framework I've arrived at through reflection and practice.
Layer 1: Top-Level Design — Scenario Definition
Many RAG projects fail not because the technology is lacking, but because the scenario wasn't properly defined upfront. Before writing a single line of code, three questions need answers:
Who are the users?
Internal employees and external customers have fundamentally different requirements. Internal users may tolerate some hallucination — they have domain knowledge to judge. External customers have zero tolerance — one wrong answer can become a complaint. This sets the quality bar for the entire system.
What does the data look like?
This is more complex than it sounds. Are you dealing with structured tables, semi-structured documents, or OCR results from scanned PDFs? Data cleaning difficulty directly determines the ceiling of your RAG system — I learned this firsthand. Parsing Markdown blog content and processing HTML from Rust documentation are entirely different challenges.
What are the timeliness requirements?
Does the knowledge base need real-time updates or daily batch syncing? This shapes the entire data pipeline architecture.
Layer 2: Core Pipeline — Data and Architecture
Once the scenario is clear, it's time to build the RAG pipeline. I break it into three sub-layers.
Data Layer: Cleaning SOP + Semantic Chunking + Metadata Annotation
"Garbage in, garbage out" is especially brutal in RAG. Data cleaning isn't a one-time task — it requires a standardized SOP:
- Cleaning rules: Noise removal (headers, footers, watermarks), format normalization, encoding fixes
- Semantic chunking: Don't just split by character count — split at semantic boundaries. I used LangChain's RecursiveCharacterTextSplitter for shinBlog's RAG, but enterprise scenarios may need document-type-specific strategies
- Metadata annotation: Chunks must not lose context. Each chunk needs source, title, section, and date metadata — critical for retrieval and citation tracking
Retrieval Layer: Hybrid Search → Reranking → Top K
Pure vector search has clear limitations — exact-match queries (product numbers, error codes) perform poorly. The mature approach is hybrid retrieval:
- Vector + BM25: Semantic similarity and keyword matching complement each other
- Rerank: Cross-encoders re-rank initial results, significantly improving relevance
- Top K control: More isn't better. Usually Top 3-5 is the sweet spot — too many adds noise, too few risks missing key information
I applied a similar approach in Kokoron's memory system — ChromaDB for vector search, combined with importance scoring for reranking.
Generation Layer: System Prompt Constraints + Citation
Retrieval is the means; generation is the goal. This layer solves two key problems:
- System Prompt constraints: Carefully designed prompts limit the model to answering based on retrieved documents, reducing hallucination
- Citation: Every answer should cite its sources. This improves trustworthiness and makes it easy for users to verify information. In enterprise settings, citation is almost mandatory
Layer 3: Quality Defense — Evaluation Loop
Building the pipeline isn't enough. Without evaluation, you can't tell whether the system is good or degrading.
Offline Evaluation
- Golden Dataset: Human-annotated standard Q&A pairs as a baseline
- Recall@K: Proportion of top-K retrieved documents that contain the correct answer — measures retrieval quality
- Faithfulness: Whether generated answers are faithful to retrieved documents rather than fabricated — measures generation quality
These metrics form the baseline for version iterations.
Online Monitoring
After launch, offline metrics alone aren't sufficient:
- Thumbs-down rate: Proportion of user-flagged bad answers
- Citation click rate: Whether users actually check the sources — high click rates indicate verification intent
- No-answer rate: How often the system says "I don't know" — too high means knowledge gaps, too low suggests forced answers
Layer 4: Stability — Performance Optimization
Enterprise scenarios aren't demos. You face real concurrency and cost pressure.
Semantic Cache
If 1,000 users ask the same type of question, why call the LLM 1,000 times?
Semantic caching returns cached results for semantically similar queries. Unlike traditional caching, "same" here means similarity in vector space. The impact can be dramatic — 10x QPS improvement, 90% reduction in LLM costs.
Implementation considerations:
- Cache invalidation (clear related caches when knowledge base updates)
- Similarity threshold tuning (too low returns irrelevant answers, too high negates the cache)
- Caution with personalization (different user contexts may need different answers for the same question)
Async Streaming
Time-to-first-token is a critical UX metric. Nobody wants to stare at a blank screen for 5 seconds.
SSE (Server-Sent Events) is the most common approach. I used FastAPI + SSE in shinBlog's RAG service. The target: first-token latency under 200ms.
Layer 5: Fallback — Safety Net
The last layer is often the most overlooked, yet the most critical — how to handle things gracefully when the system is uncertain or broken.
Rejection Mechanism
When retrieval similarity falls below a threshold (e.g., 0.6), refuse to answer instead of fabricating. Saying "I don't know" is always better than being wrong.
In Koclaw's memory system, I set a rule to not auto-inject memories with importance < 2 — the logic is the same: low-quality information is worse than none.
Rule Intervention
Certain high-sensitivity queries shouldn't go through the LLM:
- Queries containing personal privacy information
- Questions about legal, medical, or other professional domains
- Queries with specific sensitive keywords
These should be routed to rule engines or human handlers.
Degradation Strategy
When the vector database goes down, the system shouldn't crash — it should fall back to BM25 keyword search. Quality drops, but basic availability is maintained.
Personal Reflections
Building RAG looks simple — chunk, embed, retrieve, generate. Four steps, done. But making it run reliably in production requires far more consideration than those four steps.
From shinBlog's RAG assistant, to Koclaw's memory retrieval system, to enterprise RAG requirements I encountered during my internship — each round of practice reinforced the same lesson: the technical solution is 30% of the work; the other 70% is engineering, observability, and edge case handling.
This five-layer framework isn't a destination but a starting point for thinking. If you're working on RAG, I hope these thoughts are useful.
Questions or discussions welcome via GitHub.