Using Vector Embeddings to Overengineer 404 pages

Written by .

Published .

After spending a significant amount of time working with vector embeddings, I've started to see more and more use cases for them for every day problems. One of the most interesting ones I've seen is using vector embeddings to find the page that the user was looking for when they hit a 404 page.

What is a vector embedding?

A vector embedding is a way to represent a word or phrase as a vector. This is useful because it allows us to do math on words and phrases. For example, we can find the word that is closest to another word by finding the word with the smallest distance between the two vectors.

Financial Times has a great interactive article that explains vector embeddings in more detail.

How can we use vector embeddings to find the page that the user was looking for?

We can use vector embeddings to find the page that the user was looking for by finding the page with the smallest distance between the vector of the page and the vector of the user's query. In the context of a 404 page, user's query is the URL that they were trying to access.

It is surprisingly simple:

  1. we need to create a database of all the pages on our site
  2. we need to create a vector embedding for each page URL
  3. we need to create a vector embedding for the user's query
  4. we need to find the page with the smallest distance between the vector of the page and the vector of the user's query

In case of AIMD, I am doing this all in-memory, but you could also do this in a database (e.g. Pinecone). It all depends on how much data you have and how much compute you have available.

Deciding on a vector embedding model

The first step is to decide on a vector embedding model. I am using Supabase/gte-small because it is small model and it outperforms OpenAI's text-embedding-ada-002 model.

I wrote this abstraction that creates a vector embedding for a given text:

import { pipeline } from '@xenova/transformers';

export const generateEmbedding = async (subject: string): Promise<number[]> => {
  const generateEmbedding = await pipeline(
    'feature-extraction',
    'Supabase/gte-small',
  );

  const result = await generateEmbedding(subject, {
    normalize: true,
    pooling: 'mean',
  });

  if (result.type === 'float32') {
    return Array.from(result.data) as number[];
  }

  throw new Error('Expected embedding type to be float32');
};

Creating a database of all the pages on our site

The next step is to create a database of all the pages on our site.

Let's assume that we have an array of all the pages on our site:

type SitemapEntry = {
  loc: string;
};

const staticPages: SitemapEntry[] = [
  {
    loc: 'https://aimd.app/',
  },
  {
    loc: 'https://aimd.app/blog',
  },
  {
    loc: 'https://aimd.app/blog/2024-01-15-top-seo-trends-for-2024-what-should-you-focus-on',
  },
  {
    loc: 'https://aimd.app/blog/2024-01-07-maximizing-article-visibility-understanding-and-applying-e-e-a-t-in-seo',
  },
  // ...
];

We can then create a database of all the pages on our site by creating a vector embedding for each page URL:

type Metadata = Record<string, string | number | boolean>;

type DatabaseEntry = {
  metadata: Metadata;
  vector: number[];
};

const entries: DatabaseEntry[] = [];

for (const page of staticPages) {
  const vector = await generateEmbedding(page.loc);

  entries.push({
    metadata: {
      url: page.loc,
    },
    vector,
  });
}

entries is now a database of all the pages on our site.

Finding the page with the smallest distance between the vector of the page and the vector of the user's query

The last step is to find the page with the smallest distance between the vector of the page and the vector of the user's query.

Let's assume that we have a user's query:

const query = 'https://aimd.app/blog/2024-01-17-using-vector-embeddings-to-overengineer-404-pages';

First, we need to create a vector embedding for the user's query:

const queryVector = await generateEmbedding(query);

Then, we need a way to calculate a distance between two vectors. For this, we can use cosine similarity:

import similarity from 'compute-cosine-similarity';

Finally, we can find the page with the smallest distance between the vector of the page and the vector of the user's query:

const closestEntry = entries.reduce((closestEntry, entry) => {
  const distance = similarity(queryVector, entry.vector);

  if (distance > closestEntry.distance) {
    return {
      distance,
      entry,
    };
  }

  return closestEntry;
}, {
  distance: -Infinity,
  entry: null,
});

closestEntry.entry is now the page that has the most similar URL to the page the user was similar.

The best part is that this does not even need to be the exact page that the user was looking for, e.g. in case the page was removed. It will be whichever page has the most similar URL to the page the user was looking for.

Using with Remix

Just to complete the example, here is how you would use this with Remix:

// app/routes/$.tsx

import { type MetaFunction } from '@remix-run/node';
import { Link, useLoaderData } from '@remix-run/react';
import { json, type LoaderFunctionArgs } from '@remix-run/server-runtime';
import { findNearestUrl } from '#app/services/sitemap.server';

export const meta: MetaFunction = () => {
  return [
    {
      title: '404',
    },
  ];
};

export const loader = async ({ request }: LoaderFunctionArgs) => {
  const nearestUrl = await findNearestUrl(request.url);

  return json({
    nearestUrl,
  });
};

const Route = () => {
  const data = useLoaderData<typeof loader>();

  return (
    <div>
      <h1>404</h1>
       <p>Were you looking for <Link to={data.nearestUrl.url}>{data.nearestUrl.url}</Link>?</p>
    </div>
  );
};

export default Route;

Now your 404 page will suggest the page that the user was most likely looking for.

Examples of this in the wild

Here is how that looks once deployed:

Even though all of these pages do not exist, they all produce a 404 page that links to the correct page.

In practice, this type of hint will be most useful for pages that were removed or renamed, e.g. I have accidentally introduced numerous 404s on this site by changing the dates of the posts.

Do we even need 404 pages?

This is a bit of a tangent, but I think it is worth mentioning that we might not even need 404 pages. Instead, we could just redirect the user to the page that they were looking for.

Realistically, the only reason we have 404 pages is because we don't know what the user was looking for. But if we can use vector embeddings to find the page that the user was looking for, then we can just redirect them to that page.

I will be experimenting with this on AIMD in the future.

Why not just use Levenshtein's distance?

You might be wondering why not just use Levenshtein's distance to find the page that the user was looking for.

The reason is that Levenshtein's distance is only going to be useful in case of typos. Whereas vector embeddings will go further and find the pages that are the most semantically similar, e.g. if user was looking for URL such as https://wagsandwiggles.com/the-importance-of-keeping-your-dogs-nails-at-the-appropriate-length/ (The Importance of Keeping Your Dog’s Nails at the Appropriate Length), but the article (and the URL) were changed to https://wagsandwiggles.com/paw-care-essentials-the-impact-of-nail-length-on-your-dog-s-health/ (Paw Care Essentials: The Impact of Nail Length on Your Dog's Health), then vector embeddings will still be able to find the correct page.


And that's really it! I hope you found this useful. If you have any questions, feel free to reach out to me on Twitter.