What do you do with 100 technical books you'll never finish reading? I built an MCP.
I've been collecting high-quality technical books for years, mostly from Humble Bundle sales (IYKYK). Some I've read. Some I keep meaning to. But they're all sitting right there on my hard drive, over 100 of them, so the question became: how do I get more value out of what I already own? Keyword search doesn't cut it. If I want to answer "how do I justify decoupling to a skeptical executive?", the relevant passages say things like "selling architecture" or "business risk." You'd have to already know the answer to find it. And there was a second motivation: I'd been circling RAG at a theoretical distance for a while. I wanted to understand the patterns firsthand, not from diagrams.
So I built a semantic search engine over my own library, exposed as an MCP server that Claude can query. Everything runs locally: embedding, vector storage, retrieval. No data leaves my machine and there's no per-query cost. Books get chunked into passages, embedded into vectors where meaning maps to proximity, and indexed once upfront so queries are fast. When I ask Claude an architecture question, it searches the library, pulls the most relevant passages, and grounds its answer in them, with titles and page numbers attached.
The obvious payoff: I ask a question and get actual passages from Neal Ford, Vaughn Vernon, Svyatoslav Kotusev, and friends instead of an AI's vague recollection of them. Even the books I haven't gotten to yet are earning their disk space. The bigger shift is in my professional work. When I develop an artifact with AI now, whether it's an architecture recommendation, a position paper, or a decision record, the reasoning traces back to named authorities on my shelf, not whatever advice happened to get baked into a public model. "Software Architecture: The Hard Parts covers this trade-off directly: here's the passage" carries a different weight in a review than "the AI suggested it."
I also now understand the RAG trade-offs I used to nod along to: local vs. cloud, semantic vs. keyword, batch vs. real-time. Not as bullet points, but as choices I made and lived with. If you're an architect who's been circling RAG at a distance, this is the project I'd recommend. Small enough to finish. Real enough to teach you the patterns for real. So here's how to make what I made.
How to build it
The system is two pipelines. An indexing pipeline runs once (and again whenever you add books): extract text, chunk it, embed it, store it. A query pipeline runs every time you ask a question: embed the query, find the nearest passages, hand them to Claude. Everything below runs on a laptop with Python 3.11, a virtual environment, and four libraries: pypdf, ebooklib, sentence-transformers, and chromadb.
Step 1: Extract the text
PDFs and EPUBs need different handling. For PDFs, pypdf in layout mode gets you page-by-page text. For EPUBs, ebooklib plus BeautifulSoup strips the HTML down to chapter text. Keep the page or chapter number with each block of text; it becomes the citation metadata later, and citations are half the point.
One guard worth adding: if a PDF averages under 100 characters per page, it's probably a scanned image without a text layer. Flag it and skip it rather than indexing garbage.
Step 2: Chunk it (this is where tutorials will mislead you)
Embedding models work on passages, so the text has to be divided. The common recommendation is small chunks, a few hundred tokens. That works for documentation and FAQs. It fails for architecture books, because these authors develop an argument across several paragraphs before landing the conclusion. Cut too early and you sever the argument from its payoff, and your search results read like sentences ripped out of context.
I landed on chunks of about 2,400 characters, roughly a page of dense text, with a 300-character overlap between consecutive chunks so no sentence gets orphaned at a boundary. The chunker also walks backward from the target position to find the nearest sentence boundary instead of cutting mid-word. This was the design decision I could only have reached by watching my own search results fail, and it's the first thing I'd tell anyone building one of these: chunking strategy is a first-class architecture decision, not a config value to copy from a tutorial.
Step 3: Embed the passages
Each chunk goes through a sentence embedding model. I used all-MiniLM-L6-v2 from the sentence-transformers library: small, fast, runs locally, and produces a 384-dimensional vector per passage. The geometry is the whole trick: passages with similar meanings land near each other in that space, regardless of vocabulary. That's what lets "justify decoupling to a skeptical executive" find a passage about "selling architecture."
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(chunks, batch_size=64, normalize_embeddings=True)
Normalizing the embeddings matters: it makes cosine similarity a simple dot product, which keeps search fast.
Step 4: Store them in ChromaDB
ChromaDB is a local vector database that persists to disk and needs no server to run. Each entry gets an ID (book ID plus chunk index), the embedding, the original passage text, and metadata (title, author, page range). Insert in batches of a few hundred. If a book was indexed before and changed, delete its old entries first.
Step 5: Track what's indexed
A small JSON state file maps each book file to its SHA-256 hash. On each indexing run, hash every file and compare: new or changed files get indexed, everything else gets skipped. Add a skip list for the files you don't want in the index (in my case, fiction and game books that came along in the bundles). This turns re-indexing from an hour-long job into a few seconds when nothing changed.
Step 6: Expose it as an MCP server
MCP, the Model Context Protocol, is an open standard for connecting AI assistants to external tools over JSON-RPC. The server is a Python process that loads the embedding model and the ChromaDB client once at startup (this matters: loading per-query would add seconds to every search, loading once makes queries take milliseconds) and exposes three tools.
The first, search_books, embeds a query and returns the ten nearest passages. The second, list_books, returns the catalog so Claude knows what's on the shelf. The third, get_book_passages, reads sequential chunks from a specific book, which lets Claude pull the surrounding context after a search hit. Three tools is deliberate. Each maps to a distinct need (find, browse, read around), and the tool descriptions in the schema are worth writing carefully, because Claude reads them to decide when to call what.
Step 7: Wire it into Claude
Claude Code (and Claude Desktop) read an MCP configuration file that names the server, the command to launch it, and environment variables pointing at your books directory and ChromaDB path. From then on, the server starts automatically and the tools show up in every conversation. Ask an architecture question and watch the citations arrive.
Where it tops out
For a personal library of a hundred books, this architecture runs comfortably on one laptop with subsecond queries. The same conceptual layers, extraction, chunking, embedding, vector storage, and an API in front, appear in enterprise-scale RAG systems. Only the infrastructure underneath changes: at hundreds of thousands of passages you'd shard, and at millions you'd reach for a distributed vector database. Build the laptop version and the enterprise diagrams stop being abstract.
I wrapped the project up with a component-by-component guide for my colleagues, a path to understand what I built and how, so they can take the idea and run with it. Writing that guide exposed every place I only half-understood my own system, which may have been the most valuable consequence of all. Happy to share it if there's interest.
Comments
Post a Comment