How to Deploy RAG for Documentation Search: Complete Tutorial 2025 π€
You know that feeling? You're an SRE, it's 3 AM, a service is down, and you need to find THE rollback procedure in a 500-page doc... You scroll, you search, you curse the colleague who wrote "see previous section" without saying which one. π©
Plot twist: What if I told you we can transform this nightmare into a fluid conversation with an AI documentation chatbot that knows all your documentation by heart? At our company, with 50 SREs juggling between incidents and maintenance, implementing a RAG system reduced critical information search time by 3x.
In this RAG tutorial, you'll learn how to implement a complete Retrieval-Augmented Generation system on MkDocs Material documentation using LangChain, ChromaDB, and FastAPI. This step-by-step guide shows you how to build an intelligent documentation assistant that delivers accurate answers with source citations.
The Problem: Documentation that works... but not really
Our context before RAG
Picture this: 50 SREs, 7 teams, MkDocs Material documentation with:
- 500+ pages of runbooks, procedures, API docs
- Limited native search (no semantic search)
- Complex tree navigation with 5 levels of depth
- Broken links that multiply like gremlins
Our teams' daily routine:
# Classic scenario at 3 AM
1. Incident detected β Service X down
2. Search procedure β 15 minutes of navigation
3. "Oh no, this isn't the right version"
4. Re-search β 10 more minutes
5. Procedure found β FINALLY!
Total: 25 minutes lost in critical situation π±The painful stats:
- 27 minutes/day on average per SRE searching for info
- 43% of Slack questions about "where do we find this doc?"
- 67% of on-calls lose time on documentation search
Key insight: The problem wasn't our doc quality, but its cognitive accessibility!
Native MkDocs Material limitations and why you need semantic search
While MkDocs Material is excellent, traditional keyword search has significant limitations that a RAG implementation can solve:
β Keyword-only search (no semantic understanding)
β No context understanding across documents
β Results sometimes too numerous or off-topic
β Can't ask questions in natural language
β No cross-document info aggregation
β Search limited to titles and first paragraphs
β No notion of priority or urgencyThe community has long requested semantic search improvements, as evidenced by this GitHub issue that remains open for several years.
Comparison with other solutions:
| Solution | Semantic Search | AI Conversational | Existing Integration |
|---|---|---|---|
| Native MkDocs Material | β | β | β |
| Algolia DocSearch | Β β οΈ Limited | β | Β β οΈ Complex setup |
| RAG + LLM | β | β | β |
| GitBook | β | Β β οΈ Basic | β Migration required |
Concrete example:
- Question: "How to rollback the auth service urgently?"
- MkDocs search: 47 results with "rollback", "auth", "service"
- Time to find THE right info: 12 minutes π€
The Solution: How to implement RAG for documentation search
Our RAG architecture with LangChain and ChromaDB
Here's how we built our AI-powered documentation assistant using a complete RAG stack:
# Complete RAG tech stack for documentation search
TECH_STACK = {
"backend": "FastAPI", # Fast REST API for RAG endpoints
"embeddings": "OpenAI text-embedding-3-small", # Vector embeddings (512 dimensions, $0.02/1M tokens)
"vector_db": "ChromaDB", # Vector database for semantic search (alternative: Pinecone, Weaviate)
"llm": "GPT-4o-mini", # LLM for response generation ($0.15/1M input tokens)
"framework": "LangChain", # RAG orchestration framework
"docs_source": "MkDocs Material",
"deployment": "Docker + K8s",
"monitoring": "Prometheus + Grafana", # RAG metrics tracking
"cache": "Redis", # Semantic cache for performance
}RAG Workflow:


How to integrate RAG with MkDocs Material documentation
The genius of our approach: no need to modify MkDocs! This RAG tutorial shows you how to scrape existing content and build a vector database index in parallel, enabling semantic search without changing your current documentation setup.
# Base configuration for MkDocs indexing
MKDOCS_CONFIG = {
"docs_path": "/app/docs",
"base_url": "https://docs.company.com",
"chunk_size": 1000, # Optimal for runbooks
"chunk_overlap": 200, # Maintains coherence
"file_types": [".md"],
"exclude_patterns": ["temp/", "drafts/"]
}Step 1: Building the FastAPI backend for RAG
How to create the /ask endpoint with streaming responses
Here's the complete FastAPI implementation for our RAG system with OpenAI embeddings and streaming:
@app.post("/ask")
def ask_question_stream(request: QuestionRequest):
question = request.question
model = rag.llm
# Base URL configuration
BASE_DOCS_URL = "https://docs.company.com"
# Optimized retriever for technical docs
retriever = rag.vector_store.as_retriever(
search_type="mmr", # Maximum Marginal Relevance
search_kwargs={
"k": 8, # 8 chunks for rich context
"fetch_k": 20, # Larger initial pool
"lambda_mult": 0.7 # Balance relevance/diversity
}
)
retrieved_docs = retriever.invoke(question)
if not retrieved_docs:
def empty_response():
yield "β No relevant context found. Are you in the correct folder?"
return StreamingResponse(empty_response(), media_type="text/plain")
# Context construction with enriched metadata
context_with_metadata = []
sources_found = set()
for i, doc in enumerate(retrieved_docs):
relative_path = doc.metadata.get("relative_path", "Unknown")
file_name = doc.metadata.get("file_name", "Unknown")
chunk_id = doc.metadata.get("chunk_id", i)
# Clickable URL generation (without extension)
clean_path = relative_path.replace('.md', '').replace('.mdx', '')
doc_url = f"{BASE_DOCS_URL}/{clean_path}/"
sources_found.add((relative_path, doc_url))
# Enriched context with section headers
context_piece = f"""
Source: {relative_path}
URL: {doc_url}
Section: Chunk {chunk_id + 1}
Content:
{doc.page_content}
---"""
context_with_metadata.append(context_piece)
context = "\n".join(context_with_metadata)
# System prompt optimized for SREs
system_prompt = (
"=' You are a specialized SRE documentation assistant. "
"Your role is to help Site Reliability Engineers find accurate, "
"actionable information quickly during incidents and maintenance.\n\n"
"π RESPONSE GUIDELINES:\n"
"- Provide clear, step-by-step answers when possible\n"
"- Prioritize emergency procedures and troubleshooting steps\n"
"- Always cite specific documentation sources\n"
"- Include direct links to full documentation\n"
"- If multiple approaches exist, mention alternatives\n\n"
"π― FORMAT YOUR RESPONSE:\n"
"## Answer\n"
"[Detailed response with actionable steps]\n\n"
"## π Sources\n"
"[List each source with clickable links]\n\n"
"Only use information from the provided context. "
"If unsure, acknowledge limitations explicitly."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": f"Question: {question}"}
]
def response_stream():
yield f"=
Analyzing {len(retrieved_docs)} chunks from {len(sources_found)} documentation files...\n\n"
for chunk in model.stream(messages):
if chunk.content:
yield chunk.content
# Clickable sources at the end of response
yield "\n\n---\nπ **Complete documentation links:**\n"
for relative_path, doc_url in sorted(sources_found):
yield f"" [{relative_path}]({doc_url})\n"
return StreamingResponse(response_stream(), media_type="text/plain")The key improvements we added:
- Automatic URLs: Each source becomes a clickable link
- Adaptive prompt: System detects question type (tutorial, API, troubleshooting...)
- Streaming: Real-time response, no waiting
- Enriched metadata: Clear context and provenance
Key optimizations applied
π― Enhanced retrieval:
- MMR (Maximum Marginal Relevance): Avoids redundant chunks
- k=8: Sweet spot between context and relevance for technical docs
- lambda_mult=0.7: Optimal diversity/similarity balance
π‘ Pro tip: These parameters were adjusted after 2 weeks of testing with our SRE teams!
Step 2: Implementing document indexing with LangChain
How to build an automated indexing script for vector embeddings
import os
import yaml
from pathlib import Path
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# LangChain-based document indexer for RAG implementation
class MkDocsIndexer:
def __init__(self, docs_path: str, base_url: str):
self.docs_path = Path(docs_path)
self.base_url = base_url
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""]
)
def load_mkdocs_config(self):
"""Load MkDocs config to respect structure"""
config_path = self.docs_path / "mkdocs.yml"
if config_path.exists():
with open(config_path, 'r') as f:
return yaml.safe_load(f)
return {}
def extract_metadata(self, file_path: Path) -> dict:
"""Extract enriched metadata for SRE docs"""
relative_path = file_path.relative_to(self.docs_path)
# Parse front matter for tags and metadata
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
metadata = {
"source": str(file_path),
"relative_path": str(relative_path),
"file_name": file_path.stem,
"last_modified": file_path.stat().st_mtime
}
# Automatic doc type detection
if "runbook" in str(relative_path).lower():
metadata["doc_type"] = "runbook"
elif "api" in str(relative_path).lower():
metadata["doc_type"] = "api_doc"
elif "troubleshoot" in str(relative_path).lower():
metadata["doc_type"] = "troubleshooting"
else:
metadata["doc_type"] = "general"
return metadata
def process_documents(self):
"""Process all MkDocs documents"""
loader = DirectoryLoader(
str(self.docs_path),
glob="**/*.md",
loader_cls=None,
show_progress=True
)
documents = loader.load()
processed_docs = []
for doc in documents:
# Enrich with metadata
enhanced_metadata = self.extract_metadata(Path(doc.metadata["source"]))
doc.metadata.update(enhanced_metadata)
# Intelligent splitting by sections
chunks = self.text_splitter.split_documents([doc])
# Add chunk_id for navigation
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
processed_docs.append(chunk)
return processed_docs
# Usage
indexer = MkDocsIndexer("/app/docs", "https://docs.company.com")
documents = indexer.process_documents()SRE document type management
# Automatic classification by content type
DOC_TYPES_CONFIG = {
"runbook": {
"weight": 1.5, # High priority for incidents
"keywords": ["incident", "rollback", "emergency", "critical"]
},
"api_doc": {
"weight": 1.2,
"keywords": ["endpoint", "authentication", "request", "response"]
},
"troubleshooting": {
"weight": 1.4, # High priority for debugging
"keywords": ["error", "debug", "logs", "diagnostic"]
},
"general": {
"weight": 1.0,
"keywords": []
}
}Step 3: Integration into SRE workflow
Deployment with Docker and Kubernetes
# docker-compose.yml for local dev
version: '3.8'
services:
rag-api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DOCS_PATH=/app/docs
- BASE_DOCS_URL=https://docs.company.com
volumes:
- ./docs:/app/docs:ro
- ./vector_db:/app/vector_db
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
# Simple frontend for testing
rag-frontend:
build: ./frontend
ports:
- "3000:3000"
environment:
- REACT_APP_API_URL=http://localhost:8000User interface for SREs
// Simple but effective React component
function RAGChat() {
const [question, setQuestion] = useState('');
const [response, setResponse] = useState('');
const [loading, setLoading] = useState(false);
const askQuestion = async () => {
setLoading(true);
setResponse('');
try {
const response = await fetch('/api/ask', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
setResponse(prev => prev + chunk);
}
} catch (error) {
setResponse('β Error: ' + error.message);
}
setLoading(false);
};
return (
<div className="rag-chat">
<div className="quick-questions">
<h3>π Quick SRE Questions:</h3>
<button onClick={() => setQuestion("How to rollback the auth service?")}>
Rollback Auth Service
</button>
<button onClick={() => setQuestion("Critical incident procedure?")}>
Critical Incident
</button>
<button onClick={() => setQuestion("Debug 502 gateway error?")}>
Debug 502 Error
</button>
</div>
<textarea
value={question}
onChange={(e) => setQuestion(e.target.value)}
placeholder="Ask your question about our documentation..."
rows={3}
/>
<button onClick={askQuestion} disabled={loading}>
{loading ? '= π Searching...' : '>π€ Ask RAG'}
</button>
{response && (
<div className="response"
dangerouslySetInnerHTML={{__html: marked(response)}} />
)}
</div>
);
}β Best Practices: What we learned in the field
The DOs: What you absolutely must do
π― Chunking and indexing
# β
DO: Respect logical doc structure
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Optimal for technical docs
chunk_overlap=200, # Maintains context
separators=[
"\n## ", # Main sections first
"\n### ", # Then subsections
"\n\n", # Paragraphs
"\n", " ", "" # Finally words/characters
]
)
# β
DO: Enrich metadata
metadata = {
"doc_type": "runbook", # Classification
"urgency": "critical", # Priority level
"last_updated": timestamp, # Freshness
"team": "platform", # Ownership
"tags": ["k8s", "auth"] # Key concepts
}π Intelligent retrieval configuration
# β
DO: Adjust according to question type
def get_retriever_config(question_type):
if "emergency" in question.lower() or "incident" in question.lower():
return {"k": 12, "doc_types": ["runbook", "troubleshooting"]}
elif "api" in question.lower():
return {"k": 6, "doc_types": ["api_doc"]}
else:
return {"k": 8, "doc_types": "all"}π§ Adaptive prompt engineering
# β
DO: Adapt prompt according to SRE context
def build_system_prompt(urgency_level, doc_types):
base_prompt = "You are a specialized SRE assistant."
if urgency_level == "critical":
return base_prompt + """
π¨ CRITICAL INCIDENT MODE:
- Prioritize immediate actionable steps
- Include rollback procedures when relevant
- Mention escalation contacts if available
- Be concise but complete
"""
elif "api" in doc_types:
return base_prompt + """
π‘ API DOCUMENTATION MODE:
- Provide exact endpoint syntax
- Include authentication details
- Show request/response examples
- Mention rate limits and error codes
"""
return base_prompt + "Standard documentation assistance mode."π Monitoring and metrics
# β
DO: Track important metrics
METRICS_TO_TRACK = {
"usage": ["questions_per_day", "unique_users", "peak_hours"],
"quality": ["avg_response_time", "user_satisfaction", "sources_clicked"],
"content": ["most_asked_topics", "unused_docs", "missing_answers"],
"performance": ["search_latency", "llm_response_time", "error_rate"]
}
# β
DO: Structured logs for analytics
logger.info("rag_query", extra={
"question": hash(question), # Privacy-safe
"doc_count": len(retrieved_docs),
"response_time": response_time,
"user_id": user_id,
"urgency": urgency_level
})π Security and privacy
# β
DO: Implement guardrails
def validate_question(question: str) -> bool:
"""Verify the question is appropriate"""
# No sensitive data in logs
if any(pattern in question.lower() for pattern in
["password", "secret", "token", "key"]):
return False
# Size limit to prevent abuse
if len(question) > 500:
return False
return True
# β
DO: Anonymize logs
def sanitize_for_logs(text: str) -> str:
"""Remove sensitive info from logs"""
patterns = [
r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', # IPs
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', # Emails
r'\b(?:api[-_]?key|token|secret)[-_]?\w*\b' # Credentials
]
for pattern in patterns:
text = re.sub(pattern, '[REDACTED]', text, flags=re.IGNORECASE)
return textThe DON'Ts: Pitfalls to absolutely avoid
β DON'T: Neglect data freshness
# β DON'T: Static index without updates
# Problem: Outdated docs = bad advice during incidents!
# β
DO: Automatic update system
def schedule_index_updates():
"""Update index when docs change"""
# Webhook from Git for real-time triggers
@app.post("/webhook/docs-updated")
def handle_docs_update():
asyncio.create_task(reindex_documents())
# Backup: periodic modification scan
scheduler.add_job(
func=check_for_updates,
trigger="interval",
minutes=30,
id='docs_freshness_check'
)β DON'T: Ignore user context
# β DON'T: Identical response for everyone
# Problem: Junior vs Senior SRE = different needs
# β
DO: Adapt according to user
def personalize_response(user_profile, question, base_answer):
if user_profile.experience_level == "junior":
return add_explanatory_context(base_answer)
elif user_profile.team == "security":
return emphasize_security_aspects(base_answer)
elif user_profile.on_call_status:
return prioritize_quick_actions(base_answer)
return base_answerβ DON'T: Blindly trust the LLM
# β DON'T: No validation of critical responses
# Problem: Hallucination = aggravated incident!
# β
DO: Validation for critical procedures
def validate_critical_response(question, response, doc_sources):
"""Validate responses for sensitive procedures"""
critical_keywords = ["delete", "drop", "destroy", "remove", "rollback"]
if any(keyword in question.lower() for keyword in critical_keywords):
# Require explicit and recent source
if not doc_sources or not has_recent_source(doc_sources):
return add_validation_warning(response)
# Double-check with pattern matching
if not validate_procedure_steps(response):
return add_uncertainty_disclaimer(response)
return response
def add_validation_warning(response):
return f"""
Β β οΈ **WARNING: Critical procedure detected**
This response concerns a sensitive operation.
Please verify in official documentation before executing.
{response}
π **Validation required**: Consult a Senior SRE if in doubt
"""β DON'T: Forget production performance
# β DON'T: No intelligent caching
# Problem: Repetitive questions = exploded OpenAI costs
# β
DO: Semantic cache with adaptive TTL
from functools import lru_cache
import hashlib
class SemanticCache:
def __init__(self):
self.cache = {}
self.similarity_threshold = 0.92
def get_cache_key(self, question: str) -> str:
"""Key based on question embedding"""
embedding = get_question_embedding(question)
return hashlib.md5(str(embedding).encode()).hexdigest()
def should_cache_response(self, question: str) -> bool:
"""Decide if a response deserves caching"""
# Cache frequent questions longer
if any(term in question.lower() for term in
["how to", "what is", "explain"]):
return True
# No cache for questions with timestamps/IDs
if re.search(r'\b\d{10,}\b', question):
return False
return Trueβ DON'T: Neglect user experience
# β DON'T: Too technical responses for everyone
# Problem: Manager asking question = unreadable response
# β
DO: Automatic level adaptation
def adjust_technical_level(response: str, user_role: str) -> str:
"""Adapt technical level according to user"""
if user_role in ["manager", "product", "business"]:
return simplify_technical_terms(response)
elif user_role in ["intern", "junior"]:
return add_educational_context(response)
elif user_role in ["senior", "staff", "principal"]:
return add_advanced_details(response)
return response
def simplify_technical_terms(text: str) -> str:
"""Replace jargon with simple terms"""
replacements = {
"rollback": "revert to previous version",
"pod": "application container",
"ingress": "traffic entry point",
"namespace": "isolated environment"
}
for tech_term, simple_term in replacements.items():
text = text.replace(tech_term, f"{simple_term} ({tech_term})")
return textπ Results: Real-world RAG implementation metrics
Concrete impact after 3 months of RAG deployment
# Before/after RAG metrics
RESULTS = {
"average_search_time": {
"before": "18 minutes/day/SRE",
"after": "6 minutes/day/SRE",
"improvement": "-67%",
},
"incident_resolution": {
"before": "MTTR = 23 minutes",
"after": "MTTR = 16 minutes",
"improvement": "-30%",
},
"team_satisfaction": {
"before": "6.2/10",
"after": "8.7/10",
"improvement": "+40%",
}
}Top 5 most asked questions to RAG:
- "How to rollback the API gateway service?" (67 times)
- "P1 incident escalation procedure?" (54 times)
- "Debug rate limiting errors?" (43 times)
- "Emergency database access?" (38 times)
- "Monitoring alerts configuration?" (31 times)
Conclusion: Implementing RAG for documentation success
Deploying a RAG system on MkDocs Material documentation is like hiring a senior SRE who knows all procedures by heart, never sleeps, and responds instantly during emergencies. This semantic search solution transforms how teams access knowledge.
Concrete benefits of our RAG implementation:
- 67% reduction in documentation search time
- Semantic search with natural language queries
- AI-powered responses with accurate source citations
- Automatic detection of documentation gaps
- Stress reduction during critical incidents
The best part? The RAG system improves automatically with LangChain's intelligent retrieval. The more questions your team asks, the better the vector database becomes at surfacing relevant content.
Ready to implement RAG for your documentation? This tutorial gives you everything needed to build an AI documentation assistant with FastAPI, ChromaDB, and OpenAI. Your "3 AM future self" will thank you! π
π₯ Bonus challenge: Measure the time your teams spend searching for info this week. Then re-measure in a month after implementing your RAG. The results will surprise you!
Next steps to implement your own RAG system
- Evaluate your existing documentation and identify priority sources
- Choose your RAG tech stack: vector database (ChromaDB/Pinecone), LLM (OpenAI/Claude), framework (LangChain)
- Implement a RAG prototype with a subset of your documentation using this tutorial
- Test semantic search quality and collect user feedback
- Deploy progressively by adding sources and optimizing vector embeddings
π¬ Stay in touch
- π§ Email : tavernetech@gmail.com
- π GitHub : @DrakkarStorm
- πΊ YouTube : @TaverneTechh
Thank you for following me on this adventure! π
This article was written with β€οΈ for the DevOps community.