How to Deploy RAG for Documentation Search: Complete Tutorial 2025 🤖

✍️Enzo

📅10/22/2025

⏱️12 min

👁️...

#RAG#AI#Documentation#Search#LLM#Vector Database#MkDocs#LangChain#ChromaDB#FastAPI#OpenAI#Semantic Search

You know that feeling? You're an SRE, it's 3 AM, a service is down, and you need to find THE rollback procedure in a 500-page doc... You scroll, you search, you curse the colleague who wrote "see previous section" without saying which one. 😩

Plot twist: What if I told you we can transform this nightmare into a fluid conversation with an AI documentation chatbot that knows all your documentation by heart? At our company, with 50 SREs juggling between incidents and maintenance, implementing a RAG system reduced critical information search time by 3x.

In this RAG tutorial, you'll learn how to implement a complete Retrieval-Augmented Generation system on MkDocs Material documentation using LangChain, ChromaDB, and FastAPI. This step-by-step guide shows you how to build an intelligent documentation assistant that delivers accurate answers with source citations.

The Problem: Documentation that works... but not really

Our context before RAG

Picture this: 50 SREs, 7 teams, MkDocs Material documentation with:

500+ pages of runbooks, procedures, API docs
Limited native search (no semantic search)
Complex tree navigation with 5 levels of depth
Broken links that multiply like gremlins

Our teams' daily routine:

# Classic scenario at 3 AM
1. Incident detected → Service X down
2. Search procedure → 15 minutes of navigation
3. "Oh no, this isn't the right version"
4. Re-search → 10 more minutes
5. Procedure found → FINALLY!

Total: 25 minutes lost in critical situation 😱

The painful stats:

27 minutes/day on average per SRE searching for info
43% of Slack questions about "where do we find this doc?"
67% of on-calls lose time on documentation search

Key insight: The problem wasn't our doc quality, but its cognitive accessibility!

Native MkDocs Material limitations and why you need semantic search

While MkDocs Material is excellent, traditional keyword search has significant limitations that a RAG implementation can solve:

❌ Keyword-only search (no semantic understanding)
❌ No context understanding across documents
❌ Results sometimes too numerous or off-topic
❌ Can't ask questions in natural language
❌ No cross-document info aggregation
❌ Search limited to titles and first paragraphs
❌ No notion of priority or urgency

The community has long requested semantic search improvements, as evidenced by this GitHub issue that remains open for several years.

Comparison with other solutions:

Solution	Semantic Search	AI Conversational	Existing Integration
Native MkDocs Material	❌	❌	✅
Algolia DocSearch	⚠️ Limited	❌	⚠️ Complex setup
RAG + LLM	✅	✅	✅
GitBook	✅	⚠️ Basic	❌ Migration required

Concrete example:

Question: "How to rollback the auth service urgently?"
MkDocs search: 47 results with "rollback", "auth", "service"
Time to find THE right info: 12 minutes 😤

The Solution: How to implement RAG for documentation search

Our RAG architecture with LangChain and ChromaDB

Here's how we built our AI-powered documentation assistant using a complete RAG stack:

# Complete RAG tech stack for documentation search
TECH_STACK = {
    "backend": "FastAPI",           # Fast REST API for RAG endpoints
    "embeddings": "OpenAI text-embedding-3-small",  # Vector embeddings (512 dimensions, $0.02/1M tokens)
    "vector_db": "ChromaDB",        # Vector database for semantic search (alternative: Pinecone, Weaviate)
    "llm": "GPT-4o-mini",          # LLM for response generation ($0.15/1M input tokens)
    "framework": "LangChain",      # RAG orchestration framework
    "docs_source": "MkDocs Material",
    "deployment": "Docker + K8s",
    "monitoring": "Prometheus + Grafana",  # RAG metrics tracking
    "cache": "Redis",              # Semantic cache for performance
}

RAG Workflow:

RAG Architecture Flow - How RAG processes documentation queries from MkDocs docs through vector database to LLM-generated answers — Complete RAG workflow: from documentation indexing to intelligent answer generation with source citations

RAG Sequence Diagram - Step-by-step interaction flow between user, API, and vector database — Detailed sequence diagram showing the RAG system's request-response cycle

How to integrate RAG with MkDocs Material documentation

The genius of our approach: no need to modify MkDocs! This RAG tutorial shows you how to scrape existing content and build a vector database index in parallel, enabling semantic search without changing your current documentation setup.

# Base configuration for MkDocs indexing
MKDOCS_CONFIG = {
    "docs_path": "/app/docs",
    "base_url": "https://docs.company.com",
    "chunk_size": 1000,      # Optimal for runbooks
    "chunk_overlap": 200,    # Maintains coherence
    "file_types": [".md"],
    "exclude_patterns": ["temp/", "drafts/"]
}

Step 1: Building the FastAPI backend for RAG

How to create the /ask endpoint with streaming responses

Here's the complete FastAPI implementation for our RAG system with OpenAI embeddings and streaming:

@app.post("/ask")
def ask_question_stream(request: QuestionRequest):
    question = request.question
    model = rag.llm

    # Base URL configuration
    BASE_DOCS_URL = "https://docs.company.com"

    # Optimized retriever for technical docs
    retriever = rag.vector_store.as_retriever(
        search_type="mmr",  # Maximum Marginal Relevance
        search_kwargs={
            "k": 8,           # 8 chunks for rich context
            "fetch_k": 20,    # Larger initial pool
            "lambda_mult": 0.7  # Balance relevance/diversity
        }
    )

    retrieved_docs = retriever.invoke(question)

    if not retrieved_docs:
        def empty_response():
            yield "❌ No relevant context found. Are you in the correct folder?"
        return StreamingResponse(empty_response(), media_type="text/plain")

    # Context construction with enriched metadata
    context_with_metadata = []
    sources_found = set()

    for i, doc in enumerate(retrieved_docs):
        relative_path = doc.metadata.get("relative_path", "Unknown")
        file_name = doc.metadata.get("file_name", "Unknown")
        chunk_id = doc.metadata.get("chunk_id", i)

        # Clickable URL generation (without extension)
        clean_path = relative_path.replace('.md', '').replace('.mdx', '')
        doc_url = f"{BASE_DOCS_URL}/{clean_path}/"

        sources_found.add((relative_path, doc_url))

        # Enriched context with section headers
        context_piece = f"""
Source: {relative_path}
URL: {doc_url}
Section: Chunk {chunk_id + 1}
Content:
{doc.page_content}
---"""
        context_with_metadata.append(context_piece)

    context = "\n".join(context_with_metadata)

    # System prompt optimized for SREs
    system_prompt = (
        "=' You are a specialized SRE documentation assistant. "
        "Your role is to help Site Reliability Engineers find accurate, "
        "actionable information quickly during incidents and maintenance.\n\n"

        "📋 RESPONSE GUIDELINES:\n"
        "- Provide clear, step-by-step answers when possible\n"
        "- Prioritize emergency procedures and troubleshooting steps\n"
        "- Always cite specific documentation sources\n"
        "- Include direct links to full documentation\n"
        "- If multiple approaches exist, mention alternatives\n\n"

        "🎯 FORMAT YOUR RESPONSE:\n"
        "## Answer\n"
        "[Detailed response with actionable steps]\n\n"
        "## 📚 Sources\n"
        "[List each source with clickable links]\n\n"

        "Only use information from the provided context. "
        "If unsure, acknowledge limitations explicitly."
    )

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "system", "content": f"Context:\n{context}"},
        {"role": "user", "content": f"Question: {question}"}
    ]

    def response_stream():
        yield f"=
 Analyzing {len(retrieved_docs)} chunks from {len(sources_found)} documentation files...\n\n"

        for chunk in model.stream(messages):
            if chunk.content:
                yield chunk.content

        # Clickable sources at the end of response
        yield "\n\n---\n📖 **Complete documentation links:**\n"
        for relative_path, doc_url in sorted(sources_found):
            yield f"" [{relative_path}]({doc_url})\n"

    return StreamingResponse(response_stream(), media_type="text/plain")

The key improvements we added:

Automatic URLs: Each source becomes a clickable link
Adaptive prompt: System detects question type (tutorial, API, troubleshooting...)
Streaming: Real-time response, no waiting
Enriched metadata: Clear context and provenance

Key optimizations applied

🎯 Enhanced retrieval:

MMR (Maximum Marginal Relevance): Avoids redundant chunks
k=8: Sweet spot between context and relevance for technical docs
lambda_mult=0.7: Optimal diversity/similarity balance

💡 Pro tip: These parameters were adjusted after 2 weeks of testing with our SRE teams!

Step 2: Implementing document indexing with LangChain

How to build an automated indexing script for vector embeddings

import os
import yaml
from pathlib import Path
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# LangChain-based document indexer for RAG implementation
class MkDocsIndexer:
    def __init__(self, docs_path: str, base_url: str):
        self.docs_path = Path(docs_path)
        self.base_url = base_url
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""]
        )

    def load_mkdocs_config(self):
        """Load MkDocs config to respect structure"""
        config_path = self.docs_path / "mkdocs.yml"
        if config_path.exists():
            with open(config_path, 'r') as f:
                return yaml.safe_load(f)
        return {}

    def extract_metadata(self, file_path: Path) -> dict:
        """Extract enriched metadata for SRE docs"""
        relative_path = file_path.relative_to(self.docs_path)

        # Parse front matter for tags and metadata
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()

        metadata = {
            "source": str(file_path),
            "relative_path": str(relative_path),
            "file_name": file_path.stem,
            "last_modified": file_path.stat().st_mtime
        }

        # Automatic doc type detection
        if "runbook" in str(relative_path).lower():
            metadata["doc_type"] = "runbook"
        elif "api" in str(relative_path).lower():
            metadata["doc_type"] = "api_doc"
        elif "troubleshoot" in str(relative_path).lower():
            metadata["doc_type"] = "troubleshooting"
        else:
            metadata["doc_type"] = "general"

        return metadata

    def process_documents(self):
        """Process all MkDocs documents"""
        loader = DirectoryLoader(
            str(self.docs_path),
            glob="**/*.md",
            loader_cls=None,
            show_progress=True
        )

        documents = loader.load()
        processed_docs = []

        for doc in documents:
            # Enrich with metadata
            enhanced_metadata = self.extract_metadata(Path(doc.metadata["source"]))
            doc.metadata.update(enhanced_metadata)

            # Intelligent splitting by sections
            chunks = self.text_splitter.split_documents([doc])

            # Add chunk_id for navigation
            for i, chunk in enumerate(chunks):
                chunk.metadata["chunk_id"] = i
                processed_docs.append(chunk)

        return processed_docs

# Usage
indexer = MkDocsIndexer("/app/docs", "https://docs.company.com")
documents = indexer.process_documents()

SRE document type management

# Automatic classification by content type
DOC_TYPES_CONFIG = {
    "runbook": {
        "weight": 1.5,      # High priority for incidents
        "keywords": ["incident", "rollback", "emergency", "critical"]
    },
    "api_doc": {
        "weight": 1.2,
        "keywords": ["endpoint", "authentication", "request", "response"]
    },
    "troubleshooting": {
        "weight": 1.4,      # High priority for debugging
        "keywords": ["error", "debug", "logs", "diagnostic"]
    },
    "general": {
        "weight": 1.0,
        "keywords": []
    }
}

Step 3: Integration into SRE workflow

Deployment with Docker and Kubernetes

# docker-compose.yml for local dev
version: '3.8'
services:
  rag-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DOCS_PATH=/app/docs
      - BASE_DOCS_URL=https://docs.company.com
    volumes:
      - ./docs:/app/docs:ro
      - ./vector_db:/app/vector_db
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Simple frontend for testing
  rag-frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      - REACT_APP_API_URL=http://localhost:8000

User interface for SREs

// Simple but effective React component
function RAGChat() {
    const [question, setQuestion] = useState('');
    const [response, setResponse] = useState('');
    const [loading, setLoading] = useState(false);

    const askQuestion = async () => {
        setLoading(true);
        setResponse('');

        try {
            const response = await fetch('/api/ask', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ question })
            });

            const reader = response.body.getReader();
            const decoder = new TextDecoder();

            while (true) {
                const { done, value } = await reader.read();
                if (done) break;

                const chunk = decoder.decode(value);
                setResponse(prev => prev + chunk);
            }
        } catch (error) {
            setResponse('❌ Error: ' + error.message);
        }

        setLoading(false);
    };

    return (
        <div className="rag-chat">
            <div className="quick-questions">
                <h3>🚀 Quick SRE Questions:</h3>
                <button onClick={() => setQuestion("How to rollback the auth service?")}>
                    Rollback Auth Service
                </button>
                <button onClick={() => setQuestion("Critical incident procedure?")}>
                    Critical Incident
                </button>
                <button onClick={() => setQuestion("Debug 502 gateway error?")}>
                    Debug 502 Error
                </button>
            </div>

            <textarea
                value={question}
                onChange={(e) => setQuestion(e.target.value)}
                placeholder="Ask your question about our documentation..."
                rows={3}
            />

            <button onClick={askQuestion} disabled={loading}>
                {loading ? '= 🔍 Searching...' : '>🤖 Ask RAG'}
            </button>

            {response && (
                <div className="response"
                     dangerouslySetInnerHTML={{__html: marked(response)}} />
            )}
        </div>
    );
}

✅ Best Practices: What we learned in the field

The DOs: What you absolutely must do

🎯 Chunking and indexing

# ✅ DO: Respect logical doc structure
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Optimal for technical docs
    chunk_overlap=200,      # Maintains context
    separators=[
        "\n## ",           # Main sections first
        "\n### ",          # Then subsections
        "\n\n",            # Paragraphs
        "\n", " ", ""      # Finally words/characters
    ]
)

# ✅ DO: Enrich metadata
metadata = {
    "doc_type": "runbook",     # Classification
    "urgency": "critical",     # Priority level
    "last_updated": timestamp, # Freshness
    "team": "platform",       # Ownership
    "tags": ["k8s", "auth"]   # Key concepts
}

🔍 Intelligent retrieval configuration

# ✅ DO: Adjust according to question type
def get_retriever_config(question_type):
    if "emergency" in question.lower() or "incident" in question.lower():
        return {"k": 12, "doc_types": ["runbook", "troubleshooting"]}
    elif "api" in question.lower():
        return {"k": 6, "doc_types": ["api_doc"]}
    else:
        return {"k": 8, "doc_types": "all"}

🧠 Adaptive prompt engineering

# ✅ DO: Adapt prompt according to SRE context
def build_system_prompt(urgency_level, doc_types):
    base_prompt = "You are a specialized SRE assistant."

    if urgency_level == "critical":
        return base_prompt + """
        🚨 CRITICAL INCIDENT MODE:
        - Prioritize immediate actionable steps
        - Include rollback procedures when relevant
        - Mention escalation contacts if available
        - Be concise but complete
        """
    elif "api" in doc_types:
        return base_prompt + """
        📡 API DOCUMENTATION MODE:
        - Provide exact endpoint syntax
        - Include authentication details
        - Show request/response examples
        - Mention rate limits and error codes
        """

    return base_prompt + "Standard documentation assistance mode."

📈 Monitoring and metrics

# ✅ DO: Track important metrics
METRICS_TO_TRACK = {
    "usage": ["questions_per_day", "unique_users", "peak_hours"],
    "quality": ["avg_response_time", "user_satisfaction", "sources_clicked"],
    "content": ["most_asked_topics", "unused_docs", "missing_answers"],
    "performance": ["search_latency", "llm_response_time", "error_rate"]
}

# ✅ DO: Structured logs for analytics
logger.info("rag_query", extra={
    "question": hash(question),  # Privacy-safe
    "doc_count": len(retrieved_docs),
    "response_time": response_time,
    "user_id": user_id,
    "urgency": urgency_level
})

🔒 Security and privacy

# ✅ DO: Implement guardrails
def validate_question(question: str) -> bool:
    """Verify the question is appropriate"""

    # No sensitive data in logs
    if any(pattern in question.lower() for pattern in
           ["password", "secret", "token", "key"]):
        return False

    # Size limit to prevent abuse
    if len(question) > 500:
        return False

    return True

# ✅ DO: Anonymize logs
def sanitize_for_logs(text: str) -> str:
    """Remove sensitive info from logs"""
    patterns = [
        r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b',  # IPs
        r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',  # Emails
        r'\b(?:api[-_]?key|token|secret)[-_]?\w*\b'  # Credentials
    ]

    for pattern in patterns:
        text = re.sub(pattern, '[REDACTED]', text, flags=re.IGNORECASE)

    return text

The DON'Ts: Pitfalls to absolutely avoid

❌ DON'T: Neglect data freshness

# ❌ DON'T: Static index without updates
# Problem: Outdated docs = bad advice during incidents!

# ✅ DO: Automatic update system
def schedule_index_updates():
    """Update index when docs change"""

    # Webhook from Git for real-time triggers
    @app.post("/webhook/docs-updated")
    def handle_docs_update():
        asyncio.create_task(reindex_documents())

    # Backup: periodic modification scan
    scheduler.add_job(
        func=check_for_updates,
        trigger="interval",
        minutes=30,
        id='docs_freshness_check'
    )

❌ DON'T: Ignore user context

# ❌ DON'T: Identical response for everyone
# Problem: Junior vs Senior SRE = different needs

# ✅ DO: Adapt according to user
def personalize_response(user_profile, question, base_answer):
    if user_profile.experience_level == "junior":
        return add_explanatory_context(base_answer)
    elif user_profile.team == "security":
        return emphasize_security_aspects(base_answer)
    elif user_profile.on_call_status:
        return prioritize_quick_actions(base_answer)

    return base_answer

❌ DON'T: Blindly trust the LLM

# ❌ DON'T: No validation of critical responses
# Problem: Hallucination = aggravated incident!

# ✅ DO: Validation for critical procedures
def validate_critical_response(question, response, doc_sources):
    """Validate responses for sensitive procedures"""

    critical_keywords = ["delete", "drop", "destroy", "remove", "rollback"]

    if any(keyword in question.lower() for keyword in critical_keywords):
        # Require explicit and recent source
        if not doc_sources or not has_recent_source(doc_sources):
            return add_validation_warning(response)

        # Double-check with pattern matching
        if not validate_procedure_steps(response):
            return add_uncertainty_disclaimer(response)

    return response

def add_validation_warning(response):
    return f"""
 ⚠️  **WARNING: Critical procedure detected**
This response concerns a sensitive operation.
Please verify in official documentation before executing.

{response}

🔗 **Validation required**: Consult a Senior SRE if in doubt
"""

❌ DON'T: Forget production performance

# ❌ DON'T: No intelligent caching
# Problem: Repetitive questions = exploded OpenAI costs

# ✅ DO: Semantic cache with adaptive TTL
from functools import lru_cache
import hashlib

class SemanticCache:
    def __init__(self):
        self.cache = {}
        self.similarity_threshold = 0.92

    def get_cache_key(self, question: str) -> str:
        """Key based on question embedding"""
        embedding = get_question_embedding(question)
        return hashlib.md5(str(embedding).encode()).hexdigest()

    def should_cache_response(self, question: str) -> bool:
        """Decide if a response deserves caching"""
        # Cache frequent questions longer
        if any(term in question.lower() for term in
               ["how to", "what is", "explain"]):
            return True

        # No cache for questions with timestamps/IDs
        if re.search(r'\b\d{10,}\b', question):
            return False

        return True

❌ DON'T: Neglect user experience

# ❌ DON'T: Too technical responses for everyone
# Problem: Manager asking question = unreadable response

# ✅ DO: Automatic level adaptation
def adjust_technical_level(response: str, user_role: str) -> str:
    """Adapt technical level according to user"""

    if user_role in ["manager", "product", "business"]:
        return simplify_technical_terms(response)
    elif user_role in ["intern", "junior"]:
        return add_educational_context(response)
    elif user_role in ["senior", "staff", "principal"]:
        return add_advanced_details(response)

    return response

def simplify_technical_terms(text: str) -> str:
    """Replace jargon with simple terms"""
    replacements = {
        "rollback": "revert to previous version",
        "pod": "application container",
        "ingress": "traffic entry point",
        "namespace": "isolated environment"
    }

    for tech_term, simple_term in replacements.items():
        text = text.replace(tech_term, f"{simple_term} ({tech_term})")

    return text

📊 Results: Real-world RAG implementation metrics

Concrete impact after 3 months of RAG deployment

# Before/after RAG metrics
RESULTS = {
    "average_search_time": {
        "before": "18 minutes/day/SRE",
        "after": "6 minutes/day/SRE",
        "improvement": "-67%",
    },
    "incident_resolution": {
        "before": "MTTR = 23 minutes",
        "after": "MTTR = 16 minutes",
        "improvement": "-30%",
    },
    "team_satisfaction": {
        "before": "6.2/10",
        "after": "8.7/10",
        "improvement": "+40%",
    }
}

Top 5 most asked questions to RAG:

"How to rollback the API gateway service?" (67 times)
"P1 incident escalation procedure?" (54 times)
"Debug rate limiting errors?" (43 times)
"Emergency database access?" (38 times)
"Monitoring alerts configuration?" (31 times)

Conclusion: Implementing RAG for documentation success

Deploying a RAG system on MkDocs Material documentation is like hiring a senior SRE who knows all procedures by heart, never sleeps, and responds instantly during emergencies. This semantic search solution transforms how teams access knowledge.

Concrete benefits of our RAG implementation:

67% reduction in documentation search time
Semantic search with natural language queries
AI-powered responses with accurate source citations
Automatic detection of documentation gaps
Stress reduction during critical incidents

The best part? The RAG system improves automatically with LangChain's intelligent retrieval. The more questions your team asks, the better the vector database becomes at surfacing relevant content.

Ready to implement RAG for your documentation? This tutorial gives you everything needed to build an AI documentation assistant with FastAPI, ChromaDB, and OpenAI. Your "3 AM future self" will thank you! 😄

🔥 Bonus challenge: Measure the time your teams spend searching for info this week. Then re-measure in a month after implementing your RAG. The results will surprise you!

Next steps to implement your own RAG system

Evaluate your existing documentation and identify priority sources
Choose your RAG tech stack: vector database (ChromaDB/Pinecone), LLM (OpenAI/Claude), framework (LangChain)
Implement a RAG prototype with a subset of your documentation using this tutorial
Test semantic search quality and collect user feedback
Deploy progressively by adding sources and optimizing vector embeddings

💬 Stay in touch

📧 Email : tavernetech@gmail.com
🐙 GitHub : @DrakkarStorm
📺 YouTube : @TaverneTechh

Thank you for following me on this adventure! 🚀

This article was written with ❤️ for the DevOps community.

Sources and references

[1] LangChain Documentation

[2] RAG Paper: Retrieval-Augmented Generation

[3] ChromaDB Documentation

[4] OpenAI Embeddings Guide

[5] FastAPI Documentation

[6] MkDocs Material

[7] Building RAG Applications with LlamaIndex

Sources accessed on 26/11/2025