Building a Search Engine in Rust: Crawl, Index, Rank
A search engine is two deceptively small verbs stacked on top of each other: go find every page, and then find the one I actually want. Each sounds trivial until you build it, and then each turns into its own little world. I wanted to feel both, so I wrote RSeek — a web crawler and full-text search engine in Rust, small enough to read in an afternoon and complete enough to actually rank results.
The crawler: a polite flood
Crawling is breadth-first search over the web graph: start at a seed URL, fetch it, scrape its links, fetch those, and keep going. The catch is that “keep going” can mean thousands of pages at once, and you want to be fast without being a menace. Rust’s async story makes the balance pleasant. I fetch with a hyper client over Tokio, and put a Semaphore in front of it so only N requests are ever in flight:
let semaphore = Arc::new(Semaphore::new(concurrency)); // politeness + a resource cap
let visited = Arc::new(Mutex::new(HashSet::new())); // never fetch the same URL twice
let permit = semaphore.clone().acquire_owned().await?;
tokio::spawn(async move {
let page = crawl_url(url, client).await; // fetch + parse off-thread
drop(permit); // release the slot for the next URL
// ... extract links, enqueue the unseen ones ...
});
Two small pieces of shared state do the heavy lifting: the semaphore bounds concurrency, and a HashSet of visited URLs behind a Mutex keeps the crawl from chasing its own tail. Link extraction leans on the scraper crate — a single CSS selector pulls every anchor, and a base-URL join turns relative hrefs into absolute ones:
let selector = Selector::parse("a[href]").unwrap();
document.select(&selector)
.filter_map(|el| el.value().attr("href"))
.filter(|href| href.starts_with("http"))
.map(String::from)
.collect::<Vec<_>>()
The index: turning pages into something searchable
A crawled page on its own is useless to a searcher; it has to become an inverted index — a map from each word to the documents that contain it. I used the probly_search crate, adding every page as a document with two fields, its title and its body, run through a simple whitespace tokenizer:
let index = Arc::new(Mutex::new(Index::<usize>::new(2))); // 2 fields: title, content
index.add_document(
&[extract_title, extract_content], // how to pull each field
tokenizer,
doc_id,
&page,
);
Ranking: why BM25 beats counting
The part that makes it a search engine rather than a grep is ranking. The naive instinct is to count keyword matches, but that’s wrong in two ways: a page that says “rust” fifty times isn’t fifty times more relevant, and a match on a rare word should count for far more than a match on “the.” BM25 — the algorithm that quietly powers most real text search — bakes in exactly those intuitions. It rewards term frequency but with diminishing returns, scales each term by how rare it is across the corpus (IDF), and normalizes for document length so a long page can’t win just by being long. RSeek leans on a ready BM25 scorer, and the moment results came back ordered by genuine relevance instead of raw counts, the toy turned into a tool:
let results = index.query(&query, &mut bm25::new(), tokenizer, &[1., 1.]);
for r in results { println!("{} (score {:.3})", r.key, r.score); }
What it taught me
RSeek is small on purpose, and it skips the hard problems a production engine obsesses over — persistence, distributed indexing, crawl politeness rules, deduplication at scale. But building the two halves end to end made the shape of the problem click. Gathering is a concurrency problem, and Rust’s Semaphore-plus-spawn model made it feel almost relaxing. Ranking is an information problem, and BM25 is one of those rare formulas that’s simple enough to hold in your head and yet good enough to have outlasted decades of fancier ideas. Put the two together and you have, in a few hundred lines, the same skeleton that sits underneath the search box you used today.
The code is on GitHub at github.com/arazmj/rseek.