Software Development
RAG Pipeline Builder
Generates a complete RAG pipeline with chunking strategy, embedding model selection, vector store setup, hybrid retrieval, and a quality evaluation harness. Useful for standing up production RAG without expert-level tuning. Engineers shipping LLM-backed document QA, search, or summarization features; AI engineers replacing naive semantic search with production RAG; founders building knowledge-base chatbots on proprietary content. most importantly — how to measure retrieval quality. Naive RAG implementations hallucinate confidently, retrieve irrelevant passages, or miss the right document entirely. A structured builder produces a pipeline with sensible defaults for the corpus plus an evaluation harness so regressions are visible.
One-Time Purchase
$19.99
RAG Pipeline — Internal Docs QA at Northbeam Analytics
Corpus: ~8,000 internal docs (markdown + PDF), avg 1,200 tokens each, ~10M tokens total. Query pattern: factual ("What's our international travel policy?"), lookup ("Who owns the auth service?"), and light synthesis ("Summarize recent on-call rotation changes"). Stack: TypeScript, Vercel AI SDK, pgvector in existing Postgres, Claude Sonnet for generation. Multi-tenant: No.
Summary
Headline
Recommended pipeline: token-aware chunking at 512 tokens with 64-token overlap, text-embedding-3-small for embeddings, hybrid retrieval (pgvector cosine + Postgres BM25 via tsvector), cross-encoder reranking with bge-reranker-base, and Claude Sonnet for generation with citation enforcement. The non-obvious choice is the reranker — it adds ~140ms but lifts recall@5 by 11 points on the eval set, which matters more than latency for an internal-docs use case.
Pipeline Stages
Stage budget per query (p50)
Stage 1 — Chunking
import { encode } from 'gpt-tokenizer'
export function chunkDocument(text: string, opts = { size: 512, overlap: 64 }) {
const tokens = encode(text)
const chunks: { tokens: number[]; start: number; end: number }[] = []
let i = 0
while (i < tokens.length) {
const end = Math.min(i + opts.size, tokens.length)
chunks.push({ tokens: tokens.slice(i, end), start: i, end })
if (end === tokens.length) break
i += opts.size - opts.overlap
}
return chunks
}
Why 512 / 64
512 tokens fits roughly one logical section of an internal doc. 64-token overlap preserves continuity across section boundaries without ballooning storage. Larger chunks (1024+) hurt retrieval precision; smaller chunks (256) fragment policy text and miss the answer. The eval set confirmed 512 as the sweet spot at recall@5 — bigger or smaller chunks both lost 4–7 points.
Stage 2 — Embedding
import OpenAI from 'openai'
const openai = new OpenAI()
export async function embed(texts: string[]) {
const res = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts,
dimensions: 1536,
})
return res.data.map((d) => d.embedding)
}
Why text-embedding-3-small
At this corpus size, text-embedding-3-small matches -large on the eval within margin while costing 6.5x less and embedding 2x faster. Re-evaluate if the corpus grows past ~50M tokens or if domain-specific terminology starts hurting recall.
Stage 3 — Hybrid Retrieval
Vector search alone misses keyword-anchored questions like "What is the SAML-SSO setup URL?" — the literal token "SAML-SSO" is what matters, and embeddings smooth over it. BM25 alone misses paraphrased questions. Reciprocal Rank Fusion (RRF) on the two ranked lists is the simplest combiner that works.
Stage 4 — Reranking
import { CrossEncoder } from '@xenova/transformers'
const reranker = await CrossEncoder.from_pretrained('BAAI/bge-reranker-base')
export async function rerank(query: string, candidates: Chunk[], topK = 5) {
const scores = await reranker.predict(candidates.map((c) => [query, c.text]))
return candidates
.map((c, i) => ({ ...c, score: scores[i] }))
.sort((a, b) => b.score - a.score)
.slice(0, topK)
}
Stage 5 — Generation
Claude Sonnet with a system prompt that enforces grounded answers and inline citations. The retrieval result is rendered as <doc id="123" path="...">...</doc> blocks, and the prompt requires every claim to cite at least one doc id.
Eval Metrics — 200-question Holdout
| Metric | Naive (vector-only) | Hybrid | Hybrid + Rerank | Status |
|---|---|---|---|---|
| Recall@5 | 0.68 | 0.74 | 0.85 | Ship |
| MRR | 0.51 | 0.59 | 0.71 | Ship |
| Answer faithfulness (human-rated 1–5) | 3.4 | 3.9 | 4.4 | Ship |
| Citation precision | 0.62 | 0.78 | 0.91 | Ship |
| p50 end-to-end latency | 1.55s | 1.62s | 1.83s | Watch |
| Hallucination rate | 12% | 6% | 2% | Ship |
Tradeoffs and Open Calls
Chunk size is corpus-specific
The 512-token chunk size was tuned on the eval set. If new doc types arrive (long-form RFCs, transcripts), revisit chunking. Long-form transcripts in particular usually need 1024-token chunks with smarter boundary detection (sentence-aware splitting).
Top-k tradeoff
Top-40 candidates into the reranker is the latency-optimal point on this corpus. Going to top-100 lifts recall by ~2pp but adds ~250ms. For an internal tool that's a bad trade; for a customer-facing search box where missed answers are visible, it might be worth it.
Reranker choice
bge-reranker-base runs on CPU at this scale and is the best precision-per-dollar option. bge-reranker-large is 3pp better on recall@5 but needs a GPU and roughly triples per-query cost. Worth revisiting only if the corpus or query volume grows materially.
Citation enforcement is load-bearing
The hallucination rate dropped from 12% to 2% mostly because the prompt now requires inline doc-id citations and a follow-up validator strips answers that fail citation parsing. Do not ship without the validator — without it, the generation step regresses to ~7% hallucination even with the reranker in place.
This sample illustrates the skill's output format. Northbeam Analytics is a fictional company used recurringly across these sample outputs. Real corpus statistics and eval results are never included in sample outputs.
View full sample →
All sales final. No refunds on digital products.
Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.
Also in AI Engineering
Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.
Best for
Engineering teams shipping LLM-backed document-QA, search, or summarization features who are past prototype but haven't owned a retrieval system before, AI engineers replacing naive top-k cosine search with hybrid retrieval and a real eval harness, and founders building knowledge-base chatbots on proprietary content where retrieval quality directly determines the product's perceived intelligence. Most valuable when the corpus is well-defined (a docs site, a support knowledge base, a contract repository — not "the open web") and the team needs sensible defaults plus a measurement loop rather than handcrafted retrieval research.
Not ideal for
Production RAG systems that are already running at scale and need targeted optimization (re-ranking model selection, query routing, multi-vector strategies) — the builder produces a sensible starting pipeline, not advanced tuning for a system that already works. Also a poor fit when the underlying problem is actually a search problem (structured data, exact-match queries) rather than a retrieval-augmented generation problem; a traditional search stack will outperform RAG on those workloads.
Included in this purchase
- Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
- Setup guidance for the right adapter in your workspace.
- One-time license for the purchased skill version.
Setup
Plan for a short setup in the repository or workspace where the skill will run. Some coding familiarity helps for implementation-heavy outputs.
Related Skills
$19.99
One-time license
$19.99
One-time license
$19.99
One-time license
Future Updates
This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.