Search, Information Retrieval and LLMs

What is Information Retrieval?

Information Retrieval (IR), commonly known as search, is the process of obtaining relevant information from a collection of documents (e.g. web pages, images, videos, etc.), often in response to a user query ("search term"). It is a fundamental area in computer science and information science that focuses on the organization, storage, retrieval, and evaluation of information.

Search engines, recommendation systems, digital libraries, healthcare, e-commerce, and more all have heavy use cases of information retrieval.

The underlying requirement or desire for specific knowledge that prompts a search is called an information need or search intent. For instance, a user might want to know "the best restaurants nearby" or "current weather conditions." - but it doesn't mean they always say it directly in the query. This is a big topic in IR called query understanding.

On a very high level, you can build an information retrieval system, or a search engine, by indexing raw data and documents into a corpus. When the user has a query, you then use IR techniques to retrieve the most relevant documents from the corpus.

Of course, there is a lot more to it than that -- and more importantly, large language models (LLMs) have made significant advancements in IR and search.

We can break it down here to give you the basic intuition of how IR works, and then how LLMs have changed the game.

Queries

Queries are how users articulate what they are looking for. For example, a user looking for pizza places might submit the query "best pizza restaurants near me."

Because users can submit queries in natural language, they can come in many forms.

When Google first started, users can put in simple keywords or phrases (e.g., "weather tomorrow"), logical operators like AND, OR, NOT (e.g., "cats AND dogs"), and use quotes to match exact phrases (e.g., "cat pictures").

There is even a term for advanced Google searches - "Google-fu". This can include filters, ranges, or advanced operators (e.g., "filetype AND machine learning").

Even before LLMs, users could use more complex queries, like full sentences or questions (e.g., "What is the tallest building in the world?"), but they were limited by the available features of the search engine.

The same information need can be expressed in multiple ways. For example, "cheap laptops" and "affordable notebooks" may lead to similar results. Queries may sometimes be vague or ambiguous. For instance, "jaguar" could refer to a car, an animal, or a software name.

To tackle all these problems, queries often undergo preprocessing to improve retrieval performance. Some common steps include:

Tokenization: Breaking the query into individual terms (e.g., splitting "best cars 2023" into "best," "cars," and "2023").
Stemming/Lemmatization: Reducing words to their root form (e.g., "running" becomes "run").
Stopword Removal: Eliminating common words like "and," "the," and "of" that may not contribute to the meaning.
Expansion: Enhancing the query with synonyms or related terms to broaden its scope.

Understanding the context of the query then is key to improve retrieval accuracy. For example, user location, search history, and preferences can refine the results for "restaurants near me."

The Search Index

Well, to return results for a query, you need to have a way to store and retrieve documents.

A corpus is the collection of data or documents that an Information Retrieval system searches to retrieve relevant results. It serves as the repository of information from which answers to user queries are drawn.

A corpus can consist of text, images, audio, video, or a combination of these. For example, web pages for a search engine, academic papers for a digital library, or customer reviews for an e-commerce site.

The size and diversity of a corpus depend on its application. A specialized corpus is focused on a specific domain (e.g., medical literature). A general corpus covers a wide range of topics (e.g., Wikipedia).

Documents in the corpus are often preprocessed to enhance retrieval performance. This includes tokenization, stemming, stopword removal - which matches the preprocessing steps for the query we mentioned earlier.

Indexing is the process of organizing and structuring the documents in a corpus to enable fast and efficient retrieval. It creates a data structure (often an inverted index, essentially a dictionary that maps terms to the locations of documents in which they appear) that maps terms to the locations of documents in which they appear.

"apple" → Doc1, Doc3, Doc7
"banana" → Doc2, Doc5

Indexes can grow large, so compression techniques (e.g., delta encoding, variable-length coding) are used to reduce their size while maintaining speed.

Results and Search Quality

The quality of search results is a critical aspect of search engines. It is influenced by various factors, including:

Relevance: Are the best results at the top?
Precision: Out of all the results the system showed, how many were actually useful?
Recall: Did the system miss anything important?

A good popular metric is F1, which is the harmonic mean of precision and recall. In fact, F1 is a good evaluation metric for many AI tasks, even generative AI.

But notice how search quality is a bit different from other AI tasks. You are optimsing a list of results, not just one of them. This problem is called ranking.

Ranking is all about deciding which search results to show, in what order, when someone types in a query. Imagine you're searching for “best coffee shops near me.” Ranking determines which coffee shops make the list, because the one ranked 1 gets more attention than the one ranked 20.

If the search engine shows you irrelevant or low-quality results at the top, you'll get frustrated and might stop using it.

To do that, the system assigns a "score" to each result based on things like:

How closely the result matches your query (e.g., “coffee shop” is in the title).
How often people click on that result for similar searches.
Whether the result is recent (because nobody wants outdated info).

Search quality evaluation is how we check whether the search system is actually doing its job well. After all, just because the search engine gives you a result doesn't mean it's the right one.

This might all sound niche and old school. If you haven't heard of these concepts, I am sure you have heard of the following terms that are much more popular: Vector databases, RAG, Embedding models.

Let's dive in.

Vector databases

Vector databases are the search indexes of the LLM era.

As search engine builders evaluate the quality of their search engine, they often hit challenges around understanding the nuanced semantics of language, leading to limitations in search accuracy and relevance.

More specifically, they face the following challenges:

Semantic Understanding: Traditional IR systems often struggled to grasp the meaning behind words. For instance, they might not recognize that "car" and "automobile" refer to the same concept, leading to less relevant search results.
Contextual Limitations: These systems typically treated words independently, ignoring the context in which they appeared. This approach could miss the subtleties of language, such as sarcasm or idiomatic expressions.
High Dimensionality: As datasets grew, the dimensionality of data increased, making it computationally intensive to process and retrieve relevant information efficiently.
Scalability Issues: Handling vast amounts of unstructured data, like images, audio, and video, posed significant challenges for traditional IR systems, which were primarily designed for structured text data.

To address these challenges, the field has seen a shift towards vector-based approaches.

It's quite hard to explain how vector databases help, so here are some analogies:

Traditional search indexes are like a basic map that shows roads and landmarks. It provides a general idea of locations but lacks detailed information.

Vector databases are like a detailed map that shows roads, landmarks, and even the exact location of each point of interest. It provides a much more nuanced and context-aware retrieval of information.

This makes sense: as you may recall, when documents are indexed, they go through a tokenization process, which loses key context like ordering, relationships, and semantic meaning.

In a vector database, the whole document instead is 'squeezed' into a vector, which is a list of numbers that represent the document.

Mathematically, this 'squeezing' is a lot like capturing all the complexities of a real life area on earth into a map.

So instead of indexing documents into an inverted index as tokens, which look like this:

"apple" → Doc1, Doc3, Doc7
"banana" → Doc2, Doc5

You index the whole document into a vector, which looks like this:

[0.1, 0.2, 0.3] → Doc1
[0.4, 0.5, 0.6] → Doc2

With vector databases, when you perform a search, the query is also converted into a vector.

Query to Vector: Your query "sunset over the mountains" is converted into a vector, such as:
```
[0.7, 0.3, 0.1]
```

Document Vectors: Each document in the database has already been converted into its own vector, which encodes its conceptual meaning:

[0.65, 0.35, 0.15] → Doc1: "A beautiful photo of a sunset over a mountain range."
[0.9, 0.2, 0.0] → Doc2: "Tips for climbing Mount Everest."

Similarity Search: The system calculates how similar your query vector is to each document vector, using mathematical techniques like cosine similarity. It then ranks documents based on these scores:
- Doc1 (very similar) would rank high.
- Doc2 (less similar) would rank lower or might not even appear.

This can then be combined with using large language models to generate a response to the query, and this technique is called Retrieval Augmented Generation (RAG). Learn more about RAG in this article.

Optimising a search engine

Evaluating and iterating on a vector search system involves assessing its performance, identifying areas for improvement, and implementing changes to optimize results.

Metrics

There are a few ways to measure the performance of a search engine. Precision, recall, and F1 score measure the relevance of the results, while NDCG measures the order of the results.

Click-through Rate measures how often users interact with retrieved results, and you can also track engagement, meaning time spent interacting with retrieved results or satisfaction scores from feedback.

Benchmarking and user behavior analysis

Develop a dataset representative of the queries and data your system will handle. Include user queries and expected results as a ground truth. Allow users to rate or flag results as relevant or irrelevant.

Analyze patterns in queries and click-through data to identify strengths and weaknesses. Study multi-step queries where users refine their search—this can indicate initial failure to retrieve relevant results.

Iteration

Vector search relies heavily on the quality of embeddings. Improve embeddings by training embeddings using your specific dataset to make them more domain-relevant. You can also add signals from other sources, incorporating metadata or external knowledge (e.g., user history, timestamps) to enrich embeddings.

Combine vector similarity with traditional search (this is called hybrid search) to balance semantic relevance and exact matches.

Conclusion

While support for search tasks is coming soon in Datograde, the key principles of observe, evaluate and optimise are still the same.

Supporting developers to optimise their search engines, vector databases, and retrieval augmented generation is on our roadmap - watch this space!