Ideas for free
Search is broken

Tuesday 24 December 2024 at 12:00 CET

Sometimes, I have ideas, and while it’s unlikely I’ll ever pursue them, I can’t stop thinking about them. So I write them down. I hope to get them out of my head, and ideally, into the head of someone else’s, who might benefit from them. If you want to make this, or even just take one small idea from the pile, please do. It’s yours.


It will come as no surprise to anyone that web search is irrevocably broken.

Google1 hasn’t worked well since 2021 at the latest. Bing is, well, still a punchline, even if it works. DuckDuckGo is Bing with a privacy guard.

And instead of trying to solve anything, they instead add another layer on top, so we don’t even have to go to the source of the shitpost to learn how to use school glue to keep cheese on your pizza.

It’s over. The SEO scammers won. Honestly, they won the minute that DoubleClick acquired Google Google acquired DoubleClick.

So rather than asking, “how do we fix search?”, we need to be asking, “what does search look like in a world where the web is hostile and will eat you if you make a sound?”

So let’s define the problem

Google started to become useless the moment that they forgot that their goal was, and should still be, to send people away.

A search engine is not the destination. At best, it’s a well-placed guidepost. Usually, it’s more akin to unrolling a taped-together map while standing on a hilltop in the rain. No one wants to spend time at a search engine; they want a pointer to the right place so they can move on.

And “search”, the verb? It’s the wrong word. We don’t “search” the web, casting our eyes about until we find what we’re looking for. No, we “navigate”. We use the tools at hand to derive information that’ll get ourselves closer to our destination. And if we don’t know what our destination is, we “discover”, learning more about the world around us until we can answer the question, “where do I want to be?”

There are two problems at hand here, not one, and to imagine that the same tools will solve both is hubris.

Search engines are still (somewhat) good at navigation, but they’re now pretty terrible at discovery. They can help us find the website for something we can name: our bank, or documentation for a tool we use frequently; but they have become truly awful at discovering meaningful information amongst all the slop.

Local-first

The big feature of Google and other search engines is that they crawl and index the entire web (allegedly, though I think they’re not doing as well at this as they used to).

But what if this is a mistake?

The web2 we have isn’t the same web that existed when search engines started becoming popular. In 1995, there were probably around 10 thousand websites online. In 2000, 10 million. Nowadays, there’s probably over a billion. Many of those newer websites are decent, but I am willing to bet that as the number has increased, the percentage of absolute shite has gone up, and up, and up.

We don’t want to search the entire web any more. Just like living in a megacity, we don’t need to know the whole thing; our neighbourhood is where we spend the most time.

Start at home

What if the crawler was personal?

I mean, not that everyone has their own crawler (though that’d be pretty cool), but that the way the information was exposed was based on our subscriptions?

Many of us subscribe to lots of resources, and most of those resources are, under the hood, people. We follow friends and interesting people on social media, and subscribe to blogs or newsletters.

Why not start there, with the people and organisations we follow, and the stuff they publish? Index the documents of those that opt in, and make them available to those who search, respecting the privacy of the feed by only exposing people to the content they would be able to see by browsing.

Social media is understandable… ish. Both ActivityPub and the AT Protocol (ATProto) are open protocols that can be built upon (and there are others, less open, which we can ignore). We could, in theory, index the posts of everyone who signs up, similarly to Tootfinder.

But what about websites? We could crawl the RSS or Atom feeds of websites that a user subscribes to, and then index the contents. I would personally prefer if this whole thing was opt-in, and feeds get us some way towards this: if a website exposes a feed, it’s doing so because it wants the content to be distributed.

This only gets us so far, though as typically, only the last 10 articles show up in a feed; we can’t go backwards, but at least we could index new content. There are other problems too: for example, many websites only put the first paragraph or so of a document in the feed, don’t publish updates, and sometimes don’t even include formatting or paragraph separators.

As this whole thing is kind of radical, I’d like to propose something a bit more out there: request that website operators implement structured archival. This proposal asks for special feeds, which provide the necessary structure. This makes it very clear that a website is opting into an index (rather than the complete failure of opt-out that is robots.txt), too.

And then go further

Searching everything ever posted on the websites I subscribe to sounds… nice. Not that useful though, until we can go further.

Of course, the people I follow… read things too, and they share links, and republish (“boost”) posts by others. They might also follow others too, though it’s up to them whether they disclose this information.

So: we have some inclination about other stuff that theose people share. Let’s index that too (if permitted). We could weight the information not just by relevance, but also by a distance score: how many hops to get there, how many routes there are, etc.

The same applies to blogs: authors often follow other blogs, and so those could be exposed somehow. It could be a .well-known text file, an XML document linked from the home page, or something else. If we go down the ActivityPub route for archival, it doesn’t seem that far-fetched for a blog to “follow” other blogs, just like people follow other people.

The wider web is pretty much up for grabs

Once we leave social media, we probably have to stop thinking too hard about consent. There’s a difference between capturing a small thought that someone put out on the internet, and a larger document, especially when it’s in the public interest, such as a news story or information about a company’s board. The latter needs to be discoverable.

It’s inevitable that a discovery index needs to also scrape content (at least text) from arbitrary websites, as long as they are pertinent (i.e. accessible through the social graph). Which means that we can’t just rely on nicely-structured documents: we also need to parse the HTML (and, often, evaluate the JavaScript) to figure out what the content is. Unpleasant, but necessary.

A note about privacy

I am not a privacy or security expert. Everything in this document should be considered suspect with regards to these incredibly important topics. I strongly recommend that anyone working on a modern search/discovery engine is well-versed in privacy and all the ways in which modern, connected technology fails minorities and mistreated groups, often with catastrophic results.

This idea is free as in birds

If you like this idea, it’s yours. While I’d be happy to discuss it with you (and please get in touch!), and there’s even a non-zero chance I might get on board, it’s more likely I’ll wish you the best of luck, help a little when I can, and tell everyone I know how wonderful you are.

You can read more ridiculous ideas by browsing the series:

  1. Starting from scratch
  2. Structured archival, and the web as it once was
  3. Search is broken

A big thanks to Irene Knapp (irenes) and danny mcClanahan for reviewing this, providing valuable feedback, and contributing many ideas on top of my own.

Footnotes

  1. Other search engines do exist, but I prefer to refer the monopoly player by name rather than dancing around it.

  2. I refuse to capitalise “web” or “internet” because they belong to everyone, like “cheese”3. Additionally, this language is my language and I shall do as I please.

  3. I subscribe to the theory that there is a single cheese and we simply curate, package, and devour parts of it.


If you enjoyed this post, you can subscribe to this blog using Atom.

Maybe you have something to say. You can email me or toot at me. I love feedback. I also love gigantic compliments, so please send those too.

Please feel free to share this on any and all good social networks.

This article is licensed under the Creative Commons Attribution 4.0 International Public License (CC-BY-4.0).