How Yakread's RSS recommendation algorithm works
Welcome to my weekly newsletter about practical ways to make the Internet better, focused on my own work in that space. I'm Jacob O'Bryant.
As previously discussed, I am currently in the middle of overhauling Yakread's user experience. The plan is to have the default experience be completely algorithmically driven, the same way it is on YouTube or TikTok, but for articles instead of videos. You'll still be able to add your own newsletter/RSS subscriptions and bookmarks, it just won't be required, and it won't be emphasized in the onboarding flow for new users. The home page will look like this:
I spent the past week working on the core of that change: Yakread's recommendation algorithm. Yakread has had algorithmic recommendations for a while, but they've been a bit of an afterthought—a fallback plan for new users who don't actually add their own subscriptions. If the algorithmic feed is going to become the main attraction, that algorithm better be good.
What follows is how my not-yet-released version of Yakread works, which will hopefully become public within two weeks (🤞).
Yakread sources RSS feeds from users' subscriptions and bookmarks. Any RSS feed that Yakread becomes aware of goes into a pool of feeds that could potentially be recommended. Moderation works on a per-feed basis: whenever a new feed enters that pool, it shows up in an admin dashboard on my end, and it won't get recommended unless I approve it. Besides helping to keep the recommendation quality high, this also ensures that Yakread never recommends a private RSS feed (say, for a user's bookmarks on another service like Pinboard). I also need to start checking robots.txt or something to ensure that people have a way to opt-out of being recommended.
Yakread looks at three different signals to measure your preferences: opens (i.e. when you click on an article on the home page), likes, and dislikes. Yakread aggregates those signals per-feed, not per-article: if you open three separate articles from my website's RSS feed and hit the dislike button on two of them, that all counts towards your preference for my website in general, not any individual article (at least not at this stage of the recommendation pipeline—more on that below).
Specifically, each signal gets a certain point value. I've somewhat arbitrarily set them as 1 point for an open, 5 points for a like, and -5 points for a dislike. In the previous example, your overall preference for my website would be measured as 3 opens + 2 dislikes = -7 points.
Periodically, Yakread makes a big list of all the user-feed preferences ("ratings") that have accumulated so far— about 22k ratings as of writing—and loads it into an off-the-shelf recommendation tool (Apache Spark's MLlib, with the "implicit preferences" option enabled). That tool ("library", in programmer parlance) does the complicated bit of taking all the ratings and turning them into a mathematical model which can be used to predict what other feeds a particular user might like. It gives you the ability to ask "what is the probability that Alice will like the articles on Joe's website?" and get an answer like "0.782" in response. The more data you accumulate, the more accurate the answers become.
That alone isn't quite all we need to populate Yakread's home page with good articles, though. We could naively select whichever RSS feeds have the highest predicted rating for the current user (and then show some recent articles from those feeds), however we'd quickly run into a problem known as popularity bias, also known as "the rich get richer." A few RSS feeds with broad appeal would tend to get the most clicks initially, which would cause them to get recommended more often, which would lead to even more clicks, etc, even though those feeds would likely not be the best possible recommendations for every person.
So we need a way to mitigate that feedback loop. While working on The Sample, I came up with a technique that I call "popularity smoothing." First, you count how many times each feed has been recommended in the past. Then you pick a random percentile and throw out any feeds that are above that percentile in terms of how often they've already been recommended. For example, if our randomly chosen percentile is 57%, then we throw out the top 43% most popular feeds. Once we have our leftover batch of less-popular feeds, then we select whichever of those feeds have the highest predicted ratings.
(Alternatively, given a percentile p, you can include the first α + pβ
percent of feeds, where α and β are parameters that you can tune to make popularity smoothing more or less aggressive.)
Once we have a set of feeds to recommend, we have to pick an article from each of those feeds. This bit isn't personalized: if we've decided to recommend Joe's website and you've never seen any articles from Joe before, then it's probably good enough to just pick whichever of Joe's article has the most opens and likes in general.
Recency is another important factor, however: Joe's popular blog post from two years ago may or may not still be relevant today. To address this, I'm thinking I'll just look at opens, likes, and dislikes from the previous three months (or some other timeframe). If people are still engaging with an article from two years ago, then there's a higher chance it's still relevant.
And that's about it, or at least a high-level view—there are of course lots of little details that go into implementing, well, pretty much anything. But hopefully that all makes sense! There are additionally plenty of other things I could do, like content-based modeling: I could fire up the old GPT-whatever and use it to analyze the articles' textual content and infer what topics they're about, and then incorporate that into the recommendation model in one way or another. (And then I could slap a "Powered by AI" sticker on the website and increase my chances of getting into Y Combinator by 300%!) But based on some previous experience, my hunch is that the simple approach I've described above will be good enough for now.
Published 8 May 2023