Development Log 1: The Joy of Search (Part 1)

Journey into building a fully local search engine in Rust, learning concepts deployed in modern search applications such as Google & Perplexity.

Nov 24, 2024

I’m excited to be producing my very first development log! After a few interesting articles I have posted on my Unity newsletter, I thought that this one deserved some well-earned love. This is the culmination of my work building a fully local search engine from scratch, in the multi-paradigm, memory-safe and efficient systems programming language, Rust!

Introduction; Why Search?

For starters, what got me interested in the problem of search, and why choose it for a personal project? Throughout projects I have completed in my first year, I found a massive efficiency boost through the use of AI-powered search tools, such as Perplexity and Phind.

As many ML-savvy readers will be aware of, these have all of the benefits of existing chatbots, such as key summarisation and human-friendly output, but without all of the pesky issues with hallucinations. To put it plainly, citing reliable sources and using these in the input context is much more effective than any antipsychotic.

These further allowed me to really understand search as the extremely powerful bedrock technology for information interfacing, such as database indices (i.e. in Spotify), documentation search for different programming languages, and anywhere you have a lot of natural language to sift through!

Alongside this, the concepts necessary to understand the entire search engine technology stack requires learning concepts in ML, algorithmic optimisations and more. It was also a fantastic way to really test the limits of the Rust programming language, and unsurprisingly, I am still nowhere near the bottom of that iceberg.

System Overview

Before going into technical detail about the core engine, I wanted first to overview how the entire application fits together. In Part 2, I hope to be covering this in more detail, including how to implement a web application in Rust with Rocket, language interoperability with FFI’s (foreign function interfaces), and other key concepts to building polyglot apps.

This is a very rough sketch which I hope to improve on in Part 2. In this article we will be discussing primaily the core engine components (Crawler, Parser, Indexer, Ranker), meta search and summarisation services.

Search Pipeline

To start off, I will outline the basics of the core search engine, which takes some query from the user and outputs a list of ordered web documents for you to scan through. The distinction here must be made between the core engine and the browser, the browser does all of the graphical rendering work for each web document in terms of both static HTML elements and dynamic (canonically) JavaScript elements when they’re accessed by a user, this is a separate endeavour from the core engine.

The core engine can be functionally split into four areas; crawling, parsing, indexing and ranking. I will discuss these in order, including elements of my personal implementation and possible alternatives. Keep this in mind as you read each section:

Set of web documents are obtained by crawling.
Web documents are parsed to extract important content.
Parsed web documents are indexed.
Indexed web documents are ranked by relevancy before displaying to the user.

Crawling

The crawling process begins with defining a set of seed domain names. Domain names give us a unique identifier for the DNS resolver to lookup and find the location of individual web pages. Before moving on, you should definitely watch the following video to learn about DNS (it isn’t necessary to understand all the details but may help you frame the system as you read on).

The seed domain list I used for this project can be found here (university-domains-list). It is a list of domains of various educational institutions, with a sTLD (sponsored top-level domain) of .edu. This is because I wanted my results to focus more on reputable academic resources as opposed to a random assortment of pages.

There are numerous approaches to crawling, but the fundamental understanding is that this is a graph search problem. For every page in the seed domain list, we can resolve that page, extract its content and hyperlinks, and continue following those links until we have extracted as many pages or to a particular depth in the search tree as we like or can store.

https://www.researchgate.net/figure/Web-page-crawling-cycle_fig1_221580040 — (Henrique, Wallace & Ziviani, Nivio & Cristo, Marco & Moura, Edleno & Silva, Altigran & Carvalho, Cristiano. (2011). A New Approach for Verifying URL Uniqueness in Web Crawlers. 7024. 237-248. 10.1007/978-3-642-24583-1_23)

Fetcher

This is a high-level overview of this subsystem. The fetcher (obtaining the page content given a domain name) has already been implemented for us, utilising the reqwest library. Reqwest is a HTTP client that supports two important features:

JSON handling, it integrates well with Serde, a Rust library used to serialise and deserialise (process) JSON data obtained from HTTP requests.
Compliance with HTTP/1.1 and HTTP/2 protocols, and supports TLS (transport layer security), which is important for HTTPS domains and requests actually returning the content they’re supposed to.

Again even though this is not strictly necessary, this is a great educational video on HTTP protocols.

Extractor

The extractor has also been implemented for us in the form of a CSS selector and HTML parser library called Scraper. I debated implementing this myself but it would require some dedication, if anyone is interesting in me tackling this for a future project, let me know!

For a short overview, HTML5 pages are defined in terms of an indented ‘tree’ of tags. When this is parsed out into a traversable tree structure, this is known as the DOM (document-object model). Scraper also provides the functionality to select individual elements (a common one being p for ‘paragraph’) based on CSS selectors, for example, HTML tags can be attached to an id.

Scheduler

The scheduler and ‘uniqueness verifier’ both come under the search process used to following links from the initial domains. Most algorithms will implement a visited set which ensures no URL is revisited. In concrete implementation, I opted to use a HashSet data structure from Rust’s built in collections library, which ensures O(1) lookup and no duplicate storage.

For the unacquainted, O(1) lookup means it will only require a single operation to check if the URL has already been visited (as the HashSet is based on a hashing algorithm, which I haven’t specified the details of here, which assigns a unique key for each unique URL string).

There are multiple common approaches to the scheduler, some of which are unique to web crawling:

Depth-first, breadth-first, best-first (no relevance weighting).
Focused (above with relevance weighting).

My current approach simply completes a breadth-first traversal. A breadth-first search enumerates all of the links for each new link traversed from the original domain, and will continue until a specified depth limit is reached. You can imagine it as ‘expanding out uniformly’ from the start node (one of our seed domains).

At the implementation level, some optimisations can be made. We need to keep track of our next URL’s to visit, and this is simple to represent with a struct, collecting important fields such as depth (or the extended weighted priority) and the resolvable domain name.

We begin from our seed domain and place all of its neighbours (reachable through one hyperlink) in a Binary Heap data structure, a binary tree used as a common approach for implementing priority queues. Rust provides us with a built-in max-heap, which stores the largest element at the root.

However, this heap relies on understanding how to compare a UrlToVisit.

One of the interesting things I noticed when learning Rust was the way comparisons were implemented. For those of you familiar with Python, standard comparison operators (<, >, =) do all the work, and low-level details only become important when comparing different data types, which definitely was the cause of many errors when I first started programming.

Below is how I implemented comparisons manually for the crawler (apologies ahead of time as Substack does not support syntax highlighting currently and taking an image didn’t seem like the right idea).

impl Ord for UrlToVisit {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        self.crawl_depth.cmp(&other.crawl_depth).reverse()
    }
}

Here we can see a few keywords commonly used in Rust. The first is impl, which gives functionality or behaviour to a struct, in this case UrlToVisit. Ord is a trait (the functionality defined through impl) which defines a binary total ordering (requiring reflexivity, antisymmetry and so on). We need not concern ourselves with the mathematical properties of total ordering, but be certain that crawl_depth is comparable. It is, since it is a u32 integer, and there are no ambiguous comparisons. As an example, this wouldn’t be the case for floats where we could be comparing NaN.

reverse() ensures lower crawl depth is prioritised, since we are working with a max-heap.

Rust also provides static typing and annotations, which we can see here as we are comparing two UrlToVisit’s (both with a reference to Self) and we are outputting a std::cmp::Ordering, an enum with three possible values, Less, Equal or Greater.

This comparison trait is then used by the binary heap implementation to maintain a priority of next links to search, with a focus on smaller depth first. This would not be strictly necessary for a BFS that is performed synchronously, but since we are crawling asynchronously, we may be waiting on our page extraction while we continue to dig deeper into the crawl, which is not desirable behaviour.

The crawl is completed asynchronously as each HTTP request to extract web page content made through reqwest is also asynchronous.

The second approach (focused crawling) aims to filter out irrelevant links (or equivalently focus on more relevant links) if we have information about the search query or other constraints on links to select. This is something I will be extending the project with in a future article, if it turns out well!

Parsing

The parsing process involves extracting important web page content, tokenisation and pre processing this content for later ranking, and also some translation work.

Firstly, we define a custom struct which will hold all the information we would like to maintain about a document.

#[derive(Debug, Hash, Eq, PartialEq, Clone, Serialize, Deserialize)]
pub struct Document {
    pub url: String,
    pub content: Vec<String>,
    pub description: String,
    images: Vec<String>, 
    links: Vec<String>,
    pub title: String,
}

This also demonstrates another Rust language feature, which is that the derive procedural macro can implement traits automatically (essentially expanding out the struct definition to include default implementations before compiling the code).

Now we need to consider how we extract these individual fields from raw HTML data. Luckily scraper provides a lot of functionality for this. We define selectors for different HTML elements that include information we need.

let title_selector    = Selector::parse("title").unwrap();
let p_selector        = Selector::parse("p").unwrap();
let h_selector        = Selector::parse("h1, h2, h3, h4, h5, h6").unwrap();
let metadata_selector = Selector::parse("meta").unwrap();
let image_selector    = Selector::parse("img").unwrap();

In this case, we have unwrap’d the result, which is not safe practice in Rust. Our parse method can return an Error type which will immediately cause the program to crash (panic) if unwrapped directly. The reason for this approach is that the parsing service is logically disconnected from the rest of the application, and the rest of the search pipeline does not work without being able to parse the extracted documents.

As you can see, we have not defined a content selector, this is because we are actually aggregating text from all h-type tags and also p tags.

One problem I fell into when implementing the parsing process was that many times non-English web pages were returned from the crawler, meaning that I needed some way to detect and/or translate text content. This falls into the quite complex field of machine translation, luckily, LibreTranslate provided all of the necessary functionality.

Finally, all of this aggregated and translated text content can be effectively tokenised and preprocessed. One stage of this preprocessing is demonstrated below:

fn tokenise (content: String) -> Vec<String> {
    let content: String = content.chars().filter(|c| c.is_alphabetic() || c.is_whitespace()).collect();

    let words: Vec<String> = content.split_whitespace().map(|s| s.to_string().to_lowercase()).collect();
        
    //remove stop words - words which do not add semantic value to the string as a sentence.
    let stop_words = vec![...];      

    words.into_iter().filter(|word| !stop_words.contains(&word.as_str())).collect()
}

This is also a good example for some of the functional language features provided by Rust.

Common procedure when pre processing natural language also includes stemming, but in this instance we only care about details at the term-level. Here we are tokenising and removing stop words. The first step for also includes filtering out any special characters or numerical characters.

Under the hood, when we are extracting the content variable, we convert our original String into a stream, in this case of UTF-8 encoded characters. We then apply a filter over this stream, which selects only elements which are either alphabetic or white space (useful for splitting the content again later).

This is a common functional pattern. Essentially, we consider iterating over our content as generating the sequence of character tokens and applying the higher-order function filter over the tokens, as opposed to moving a pointer over (iterating) the sequence.

Indexing

The indexing process essentially takes some vector (the Rust analogue to a list of elements) of documents and stores it for ranking later. Two types of indices were used, although B-Tree was tested experimentally, it was never of any use for the ranking algorithms that I implemented.

Document-Term

Document-term indices store a map of documents to the terms they contain. The naive approach to creating this index is to produce a HashMap which stores document types (in terms of a unique identifier) as keys, and a vector of their extracted terms (as per the above tokenisation) as values.

Inverted

The inverted index was slightly more difficult to create. It requires us to map individual terms onto a vector of documents that contain it.

The process was roughly as follows:

Iterate through each document.
If we haven’t seen a term in the map yet, add to map with document and default term frequency of 1.
If term is already in the map but document is not, add to map with document and default term frequency of 1.
If both the term and the document mapping exist, increment term frequency.

Both of these index types were binary serialised into a JSON file so that they could be deserialised and read at a later date.

Ranking

The ranking procedure is one of the most technically demanding aspects of the project, allowing us to provide a set of more relevant web documents to the user depending on their search query, and is situated in the broader field of information extraction.

My implementation includes three ranking methods, but there are many more. Google (as we discuss later), utilises an ingenious algorithm called PageRank, which focuses on edge information (i.e. contributions of web pages in particular categories of discourse) as opposed to the documents themselves. This may be something I attempt to implement in the future!

Document Clustering (Word2Vec)

First, it will be useful to overview the document clustering method. The basic outline is that we can obtain a semantic representation of each web document, in such a way that we can consider certain documents to be more ‘similar’. Then, we can apply clustering algorithms such as k-means to both the document representations and the query representation, to determine which documents should be displayed to the user first.

So, how can such a representation be obtained? The first approach, and admittedly an extremely inefficient approach, is to use Word2Vec. Roughly speaking, Word2Vec uses feed-forward (or recurrent) neural networks to learn an embedding of a word in a vector space defined by semantic similarity to other words. This similarity based on the context in which words are used, i.e. dog is similar to cat, queen is similar to king, and so on.

This is a very rough overview and the reader is encouraged to see the original research paper, which I may overview in a future article (https://arxiv.org/pdf/1301.3781). For our case, we can black-box Word2Vec and utilise it for our document clustering. In order to do this, we can take some subset of documents from the document-term index, compute an averaged (weighted) Word2Vec representation over terms, and cluster these with k-means. We can also compute the weighted average Word2Vec for each word in the search query, and find closest clusters for comparison.

Above is a rough sketch of a possible embedding space, in this case using a Word2Vec representation of length 2. In reality, the Word2Vec model I used, pre-loaded from the Gensim library, had output embeddings with 64 dimensions. Above we can see that we may be able to group our documents into categories on the basis of their Word2Vec representations.

When we have obtained our embedding space, we can define a specific distance function to work with, which is necessary for clustering. For our case, we have opted to use Minkowski distance and cosine similarity. Minkowski distance will define a valid distance metric no matter the dimensions of our embedding space. Cosine similarity compares the angle between our representations.

The particular strength of cosine similarity is that it is independent of document length, but for Minkowski distance we have to normalise document lengths first. A reasonable way to approach this which was used in the project is PCA (principal component analysis), since this keeps most of the original information. PCA simply extracts the key directions explaining the most variability in our vector embeddings, again, the reader could also continue on to learn about this concept in more depth from one of my favourite YouTube channels, StatQuest:

To finalise this section, I wanted to expand on my point about the inefficiency of Word2Vec embeddings. The primary bottleneck is that we have to compute a new embedding for the number of terms for each document.

Document Clustering (Sentence Transformers)

This uses the same recipe as above but instead of using Word2Vec, Sentence Transformers are used to build semantic representations, in this case (as indicated by the name), for a whole document, as opposed to individual words.

In the next article, I will add more technical detail to the function of sentence transformers, since I find them fairly interesting, and they yielded far faster search times than the Word2Vec approach (roughly the same time to compute the embedding but this scales relative to the amount of terms for each document).

The key paper is linked here: (SBERT).

BM25

The BM25 approach is a derivative of TF-IDF rankings, and is used on our inverted index, as opposed to our document-term index.

Instead of clustering together similar documents, the algorithm simply focuses on the relevance of each document to the users query, and then uses this as a score to compare documents. Below is the formula used for the sorting:

IDF stands for ‘inverted document frequency’. This is a more involved formula that essentially determines how rare a term is in a corpus (and thus its informativeness in distinguishing between documents).

f computes the frequency of input term with the document D. This necessitates usage of the inverted index since we can directly compute term frequencies for each document.

fieldLen and avgFieldLen are essentially the length and average length of the documents in the corpus.

b, k1 are free parameters. b controls document length normalisation, and k1 controls how much the term frequency influences the document scoring. Reducing k1 reduces impact of term frequency.

In essence, BM25 is a more complicated way of determining relevance based on the extent of usage of the query term(s) in the documents being ranked, and accounts for minutiae like weighting the extent of repetition of an individual term, and so on.

Some fun trivia on the side: BM25 is actually the default ranking algorithm used by Elasticsearch, a popular open-source full-text search and analytics engine.

For a more in-depth analysis of BM25, readers are encouraged to track the following resource:

Meta Search

Meta search is a term which refers to the process of aggregating and filtering results from various existing engines. Meta search can be extremely useful, especially considering engines have their own unique strengths. I tool I use often is called SearXNG, and a list of active instances can be found here: SearXNG Instances.

In my project, I decided not only to collect results from a local search engine made from scratch, but also from Google and DuckDuckGo sources.

DuckDuckGo has an actively maintained search CLI (command line interface) tool called DDGR, which provides us the ability to make automated requests, as long as we can parse the output. It also provides other functionalities such as user locale (geographical location and timezone filters) and various other search filters. The outputs are simple to implement with regular expressions, which have major support in Rust.

Google allows users to create a customisable ‘programmable engine’ which contains various default filters and configurations that were useful for this project. In particular, the top-level domain could be manually filtered. In my project, users can specify their own programmable engine key and API key in as an environment variable (the details of which is specified in the README). A Flask-based microservice is used to make the request and parse the results.

Extra Features

Summarisation with Local LLM

Summarisations for individual results can be made with various local LLM’s. The user can specify this in their environment variables, and can be a variety of pre-trained models supported by HuggingFace’s AutoTokenizer class, the one I tested with (with highly questionable outputs) was DistilGPT2.

One of the primary issue with local summarisation is the problem of inference speed. If the user does not have the necessary GPU resources for a larger model (even when quantised - meaning that less memory is needed to store model weights), then the React application will drop the majority of summarisation requests.

I ensured that if the user does have the necessary GPU resources, then they are supported by either Metal or CUDA, key graphics libraries supporting GPU kernel development for accelerating ML tasks (something which I will be talking about at length in my next project).

Autosuggestions with a Trie

This and below were not explicitly mentioned in the system diagram at the beginning of this article, as they are included within the React application (client-side) code.

Autosuggestions are a common feature found in many search tools, include the main Perplexity search bar. There are many different ways to implement autosuggestion, but the simplest and most common is to make use of a Trie.

The Trie is filled with words from a reference sentence database, starting from the root node and including children which are not currently defined (for the given example, the leaf node ten would point to tent with arrow t if we included this from a modified sentence database). Once it has been constructed, we have something like the following:

Here we can determine all of the words which can follow from a given prefix. As we are typing out a word in the search bar, i.e. ‘hell’, the Trie can collect all autosuggestions for a single term, i.e. ‘hello’, ‘hellenised’ (assuming these words were contained in the training sentences) by tracing the tree from the root. Full sentence autocompletions can also be provided this way, but this was not implemented in this case.

Conclusion

I would like to provide a live demo of the project, but currently I am having issues with video recording, and there are still a few bugs in the project. It should be publicly available in the following repository if anyone wants to try it out before I fix various bugs: https://github.com/ThomasMTurner/Educate.

There are some other extra features I have yet to mention (in particular how Levenshtein Distance can be used for query typo correction), which I hope to included in the next edition.

Hopefully this was a penetrable and technically enlightening discussion on how search engines work under the hood, and some of you may even be motivated to try making this project for yourselves! Soon, I hope to release Part 2, and discuss other exciting projects that I am currently working on.

All the best,
Complexity.

Complexity