Plans for scaling Findka


I’ve hit a milestone: first article about Findka written by someone other than myself. I woke up on Sunday morning to find I had 70 new signups overnight. Last week I got about 150 new signups over several days thanks to the Show HN post, but that had trickled down. At first I thought maybe this new spike came from bots, but it didn’t make sense since the recipients were clicking the signup confirmation links, and I’d already put in Recaptcha. But a quick search turned up that article. Since it was posted, I’ve had 270 new signups. The number of ratings has increased from 20K to 64K (it was 3K before Show HN). Content items went from 2K to 4K. I haven’t looked at any samples of what items were added, but my brother suggested that maybe Findka will be better at recommending anime now.

Someone on HN asked how well I thought my current algorithm (computing the entire rating matrix and storing it in memory without factorization) would scale. I said I thought it would scale just fine for at least several months. Turns out it only lasted a week. It handled up to 20K or 30K ratings just fine, but after that the computation time went up steeply and started making my server crash. In my original estimate I was only considering the amount of memory that the matrix would take up after being computed, but it turns out the real issue is pulling all the ratings from the DB and computing the matrix in the first place. I tried tweaking the code for a while but wasn’t able to make it run without crashing the JVM. (In the mean time, I’ve limited the server to only using 20K of the ratings. Sad.).

Maybe I could make it work with some more tweaking or if I upgraded the server again, but it’d still be a dead end. Might as well start doing matrix factorization now. I’m going to use a latent factor model, which means that I’ll use ML to learn a vector of features for each item. It’s like an automatic version of the Music Genome Project which Pandora uses to power their music recommender system. For each new song, they have people listen to it and assign it numeric values for 450 different features (like “female vocals”, “rock”, etc). With a latent factor model, you provide a bunch of user ratings, specify how many features each item should have, and let the computer figure out what all the feature values should be.

To represent a user, you combine the feature vectors for all the items they’ve rated. Then you can compare that user vector to any item vector and see how close they are in the vector space.

So that’s what I’ll be working on today. I’ll start out computing it from my laptop once per day or so, but at some point I’ll likely have the main server spin up a temporary worker server and compute the model from there.

Once that’s done, I’m going to continue with my Biff—Materialize integration. That’ll help with scaling the rest of Findka (particularly the analytics queries. CRUD operations have been doing just fine). I’m hopeful that once I have that and matrix factorization in place, I won’t have any major scaling issues for a while (knock on wood).

Needless to say, this has all been extremely exciting for me. Sort of—actually I pretty much never get visibly excited. But it makes me think positively about the probability of Findka succeeding. I now know that Findka’s landing page conversion rate is good. Once I can demonstrate that retention is also good, I’m going to start experimenting with monetization. I’ll be happy if all those metrics check out.

(For now I’m putting the social networking features on hold. I’ll start working on those again if retention isn’t good and working on the algorithm doesn’t fix it, or if retention is good and I think social networking will help to accelerate growth.)

Oh, also: I’ve ordered a couple books: Recommender Systems: The Textbook (as opposed to Recommender Systems: The Opera, I guess) and Designing Data-Intensive Applications.