← Back to mbRAG
Blog
mbRAG — Articles
Technical deep-dives on retrieval-augmented generation with mbRAG
mbRAG
How mbRAG Delivers Research-Grade Accuracy with 4 Pipeline Levels
April 2026 · MokingBird Team
← Back to Blog
mbRAG
How mbRAG Delivers Research-Grade Accuracy with 4 Pipeline Levels
April 13, 2026 · MokingBird Team · Tags: mbRAG, RAG, local AI, LLM, accuracy
Retrieval-Augmented Generation is one of the most practically useful ideas in applied AI. Instead of asking a language model to answer from its training data alone, RAG retrieves relevant passages from your documents and gives them to the model as context. The result is answers that are grounded in your actual data.
The idea is simple. The implementation is where most frameworks fail.
The Problem with Existing RAG Frameworks
LangChain made RAG accessible. It also made it fragile. Breaking changes between versions. Vendor lock-in through obscured abstractions. Poor retrieval accuracy on anything beyond toy datasets.
The deeper problem is that most RAG implementations use a single retrieval strategy — typically simple cosine similarity — and tune nothing else. This breaks down for complex multi-part questions, documents with overlapping information, long documents where chunk context is lost, and queries that require reasoning across multiple passages.
mbRAG was built to solve these failure modes by implementing every major RAG approach from scratch in a unified, stable system.
Four Levels, One System
| Level | Latency | Best for |
| L1 Basic | ~0.8s | Fast lookup, simple Q&A |
| L2 Enhanced | ~1.5s | Improved recall, standard workflows |
| L3 Smart | ~2.2s | Context-aware, complex documents |
| L4 Advanced | ~3.5s | Maximum accuracy, research-grade |
6-Level Contextual Retrieval: The Signature Innovation
Standard RAG splits documents into chunks and indexes them. The problem: chunks lose context. MokingBird's 6-Level Contextual Retrieval enriches every chunk before indexing:
| Level | Context added |
| L1 | Document title and section heading |
| L2 | + Preceding paragraph summary |
| L3 | + Following paragraph summary |
| L4 | + Document-level metadata |
| L5 | + Entity relationships from surrounding text |
| L6 | + Full document summary as additional context signal |
This is the primary driver of the 40–50% accuracy improvement on complex document queries.
What You Can Connect
- 17 document formats: PDF (with multi-engine fallback), DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, Images with OCR, Web content, and more
- 10 LLM providers: OpenAI, Anthropic Claude, Google Gemini, Ollama, HuggingFace, vLLM, llama.cpp, and more
- 8 embedding providers including local sentence-transformers and Ollama
- 6 vector store backends: ChromaDB, FAISS, Qdrant, Pinecone, Weaviate, Milvus
- 6 retrieval strategies including Ensemble and Contextual retrieval
Fully Local, Fully Yours
mbRAG runs on your hardware. Your documents are indexed locally. Your vector stores are stored on your filesystem. If you use a local LLM (Ollama, llama.cpp), the entire pipeline runs without a single network request. Complete air-gap operation.
Download Free
mbRAG is available as part of MokingBird AI — free to download. Advanced features are available in the Premium tier. Download at ai.mokingbird.xyz.
---
title: "How mbRAG Delivers Research-Grade Accuracy with 4 Pipeline Levels"
date: "2026-04-13"
author: "MokingBird Team"
tags: ["mbRAG", "RAG", "retrieval-augmented generation", "local AI", "LLM", "accuracy"]
---
# How mbRAG Delivers Research-Grade Accuracy with 4 Pipeline Levels
Retrieval-Augmented Generation is one of the most practically useful ideas in applied AI. Instead of asking a language model to answer from its training data alone — which is frozen at a cutoff date and doesn't include your specific documents — RAG retrieves relevant passages from your documents and gives them to the model as context. The result is answers that are grounded in your actual data.
The idea is simple. The implementation is where most frameworks fail.
---
## The Problem with Existing RAG Frameworks
LangChain made RAG accessible. It also made it fragile. Breaking changes between versions. Vendor lock-in through obscured abstractions. Poor retrieval accuracy on anything beyond toy datasets. A configuration surface that exposes implementation details you shouldn't have to care about.
The deeper problem is that most RAG implementations use a single retrieval strategy — typically simple cosine similarity over dense embeddings — and tune nothing else. This works acceptably for simple queries over well-structured documents. It breaks down for:
- Complex, multi-part questions
- Documents with overlapping or contradictory information
- Long documents where chunk context is lost
- Queries that require reasoning across multiple passages
mbRAG was built to solve these failure modes. Not by wrapping LangChain, but by implementing every major RAG approach from scratch in a unified, stable system.
---
## Four Levels, One System
mbRAG organizes its capabilities into four pipeline levels. You choose based on your latency requirements and accuracy needs.
### L1 Basic (~0.8s)
**Similarity search + direct generation**
Dense embedding lookup over your vector store, followed by direct LLM generation from the top-k retrieved chunks. Fast, reliable for straightforward Q&A, appropriate when latency matters more than maximum accuracy.
Best for: chatbots, real-time lookup, simple factual queries over clean documents.
### L2 Enhanced (~1.5s)
**Reranking + improved context**
Adds a reranking step after initial retrieval. Candidate chunks are scored a second time using a cross-encoder (or LLM-based reranker) that considers the full query-chunk pair — not just the embedding similarity. This step alone typically improves precision by 15–25% over L1 on complex queries.
Best for: knowledge bases, document Q&A, customer support systems.
### L3 Smart (~2.2s)
**Multi-query + ensemble retrieval**
Generates multiple query variations from your original question (multi-query expansion), retrieves results for each variation, and combines them using an ensemble approach that merges sparse retrieval (BM25) with dense retrieval. Covers more of the relevant result space and handles ambiguous queries significantly better.
Best for: research assistance, legal document review, technical documentation search.
### L4 Advanced (~3.5s)
**Full contextual retrieval + reasoning**
The full pipeline. Contextual retrieval at all 6 levels, multi-query expansion, ensemble retrieval, cross-encoder reranking, parent document retrieval (retrieve the full section around a relevant chunk rather than the chunk alone), and LLM-based reasoning over the assembled context.
This is the pipeline that delivers the 40–50% accuracy improvement over naive implementations on complex queries.
Best for: research-grade applications, sensitive document analysis, high-stakes Q&A where accuracy is critical.
---
## The 8-Step RetrievalOrchestrator
Every query through mbRAG (at any level) passes through the RetrievalOrchestrator — the core of the system. Here's what happens:
1. **Query analysis** — Parse the query, identify entity types, determine optimal retrieval strategy
2. **Query expansion** (L3+) — Generate multiple reformulations to cover different phrasings
3. **Embedding** — Embed the query using your configured embedding model
4. **Vector retrieval** — Fetch candidate chunks from your vector store
5. **Sparse retrieval** (L3+) — BM25 retrieval in parallel for ensemble fusion
6. **Reranking** (L2+) — Score candidates using cross-encoder for precision
7. **Context enhancement** — Apply contextual enrichment based on your level configuration
8. **Response generation** — Pass assembled context to the LLM with appropriate prompt structure
The result is a grounded response with source attribution — you know which passages informed each answer.
---
## 6-Level Contextual Retrieval: The Signature Innovation
Standard RAG splits documents into chunks and indexes them. The problem: chunks lose context. A chunk that says "the agreement was terminated on March 15th" doesn't tell you which agreement. A chunk that references "the method described above" doesn't include what that method is.
MokingBird's **6-Level Contextual Retrieval** enriches every chunk before indexing it:
| Level | Context added |
|-------|-------------|
| L1 | Document title and section heading |
| L2 | + Preceding paragraph summary |
| L3 | + Following paragraph summary |
| L4 | + Document-level metadata (author, date, type) |
| L5 | + Entity relationships extracted from surrounding text |
| L6 | + Full document summary as additional context signal |
Each level adds computational cost at indexing time but dramatically improves retrieval quality. At L6, every chunk carries rich contextual signals that allow the vector store to match it against queries that would have missed it entirely in a naive chunking setup.
This is the primary driver of the 40–50% accuracy improvement on complex document queries.
---
## What You Can Connect
**17 document formats:**
PDF (PyMuPDF + pypdf + pdfplumber with auto-fallback), DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, Images with OCR (Tesseract + EasyOCR), Web content, Multimodal documents.
**10 LLM providers:**
- Cloud: OpenAI (GPT-4, GPT-3.5), Anthropic Claude, Google Gemini
- Local: Ollama (llama3, mistral, phi, and any locally available model), HuggingFace Direct, vLLM, llama.cpp
**8 embedding providers:**
OpenAI, local sentence-transformers, Ollama, HuggingFace, Cohere, and custom API endpoints.
**6 vector store backends:**
ChromaDB (default, local), FAISS, Qdrant, Pinecone, Weaviate, Milvus.
**6 retrieval strategies:**
Similarity search, MMR (Maximal Marginal Relevance — diversity-aware retrieval), Multi-Query, Ensemble (sparse + dense fusion), Parent Document, Contextual (all 6 levels).
---
## Use Cases
**Legal document review.** A law firm has thousands of contracts. Using mbRAG at L4 with parent document retrieval, associates can query across the entire contract corpus and get answers with precise source citations — without uploading documents to any cloud service.
**Medical literature search.** A researcher is working with a corpus of clinical papers. mbRAG's 6-level contextual retrieval maintains the meaning of clinical concepts across chunk boundaries, reducing false retrievals that would otherwise introduce incorrect context into LLM responses.
**Enterprise knowledge base.** An engineering team wants to query across internal wikis, Confluence pages, Slack exports, and technical specifications. mbRAG's 17-format support ingests the full corpus; L3 multi-query expansion handles the varied phrasings engineers use when searching for the same concept.
**Software engineering assistance.** A developer feeds mbRAG their entire codebase documentation, API specs, and internal runbooks. Queries like "how does the auth middleware work with the rate limiter?" are answered accurately because contextual retrieval preserves the relationships between components across documents.
---
## Accuracy vs. Latency: Choosing Your Level
The right pipeline level depends on your use case:
| Use case | Recommended level | Reason |
|----------|------------------|--------|
| Real-time chat | L1–L2 | Sub-second response required |
| Document Q&A | L2–L3 | Good accuracy, acceptable latency |
| Research / analysis | L3–L4 | Maximum accuracy, latency acceptable |
| Batch processing | L4 | Latency irrelevant, accuracy is everything |
You can also configure mbRAG to automatically select the pipeline level based on query complexity — simple queries route to L1, complex multi-part queries escalate to L4.
---
## Fully Local, Fully Yours
mbRAG runs on your hardware. Your documents are indexed locally. Your vector stores are stored on your filesystem. Your queries are processed on your machine.
If you use a local LLM (Ollama, llama.cpp), the entire pipeline — from document ingestion to final answer — runs without a single network request. Complete air-gap operation.
If you connect to a cloud LLM using your own API key, your key is stored locally and transmitted only to the provider's API, never to MokingBird servers.
---
## Beyond a Single Chat Interface
RAG is often implemented as a chat feature. But retrieval is infrastructure — and mbRAG is designed for that scope.
The same core retrieval framework can power multiple product surfaces:
- **Internal knowledge assistants** — query HR policies, engineering runbooks, or company wikis
- **Documentation copilots** — help developers navigate large codebases or API documentation
- **Compliance-support retrieval** — surface relevant regulatory clauses or audit evidence
- **Research synthesis pipelines** — aggregate and query across large academic or technical corpora
- **Customer support tooling** — answer support queries from product documentation
When designed correctly, one well-built RAG layer serves all of these. mbRAG is built for that role.
---
## Download Free
mbRAG is available as part of MokingBird AI — free to download and use.
Advanced features (L3 and L4 pipelines, all 6 vector store backends, all 10 LLM providers, full analytics) are available in the Premium tier.
Download at [ai.mokingbird.xyz](https://ai.mokingbird.xyz).