How Not to Not Sort by Average Rating

My friend Chris pointed out two webpages recommending algorithms for dealing with user-added ranking systems: Bayesian Rating – How to implement a weighted rating system and How Not to Sort by Average Rating. They’re both very problematic, and so I decided to write a response.

Both pages contain a clear description of the problem. For example, Bayesian Rating says

the artworks in TheBroth gallery are visitor rated, using a rather simple + and – system. If you like an item, rate it “plus”. If you don’t like it, give it a “minus”.

The rating of an item would then be: number of positive votes divided by number of total votes. For example, 4 + votes and 1 – vote would correspond to a rating of 0.8, or 80%.

Now if you want to rank the items based on this simple equation, the following happens:

Assume you have on item with a rating of 0.93, based on 100s of votes. Now another new item is rated by a total of 2 visitors (or even just one), and they rate it +. Boom, it goes straight to #1 position in the ranking, as its rating is 100%!

That’s an interesting problem. What does Bayesian Rating recommend as a solution?

Bayesian Rating is using the Bayesian Average. This is a mathematical term that calculates a rating of an item based on the “believability” of the votes. The greater the certainty based on the number of votes, the more the Bayesian rating approximates the plain, unweighted rating. When there are very few votes, the bayesian rating of an item will be closer to the average rating of all items.

Use this equation:

Now, I’m no expert in Bayesian statistics, but I do know Bayes’s Theorem, and this isn’t it. I suspect you can get there from here, by assuming that all these distributions are normal, the prior is also normal, and their variances are particular carefully chosen values… but this is make-believe math. The result, in fact, is a reasonable formula, but it’s not correct Bayesian reasoning.

What’s even funnier, though, is How Not to Sort’s answer:

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:

\left(\hat{p}+\frac{z_{\alpha/2}^2}{2n}\pm z_{\alpha/2}\sqrt{[\hat{p}(1-\hat{p}) + z_{\alpha/2}^2/4n]/n}\right)/(1+z_{\alpha_2}^2/n).

Now the Wilson Score Confidence Interval is an interesting beast, and not irrelevant. It’s worthwhile seeing how we get there, since it’s completely unstated in How Not.

Suppose that I ask you to compute rankings of a large number of items M, given that each item i has been rated

N_i

times, and of those ratings,

k_i

were positive. I do not give you any more information for computing your rankings, which is good, because otherwise you could end up doing something crazy like MovieLens, which uses raters’ past ratings to predict future ratings by detecting similarities between movies and between raters.

The obvious assumption to make, then, is that the probability that a rater rates item i positively is some constant, unknown

p_i

, and if an infinite number of raters were available,

k_i/N_i

would converge to

p_i

. Unfortunately, we only have finitely many ratings, so instead we observe a binomial distribution:

P[k_i = \kappa | N_i, p_i] = \mathrm{Binom}(\kappa;N_i,p_i) = p_i^\kappa(1-p_i)^{N_i-\kappa} {N_i\choose\kappa}

. Our goal, then, is to somehow estimate

p_i

given the data we have. This is the standard case for Bayesian Estimation.

Bayesian Estimation works by constructing a probability distribution for the parameter we seek to estimate by multiplying together a Conditional Probability (the binomial above) and a Prior probability. In this case, the Prior is some probability density function, call it D(p). D is the probability density of getting any particular

p_i

. It’s a normalized probability distribution, and since all

p_i

have to be between 0 and 1, D is only nonzero between 0 and 1. Note that by adopting any such prior, we are assuming that all the items we are rating are drawn from some single distribution. This is perfectly acceptable, since we have no information that allows us to distinguish between these items. In this case, Bayes’s Theorem tells us that

P[p_i = q | N_i, k_i] = D(q)\cdot\mathrm{Binom}(k_i;N_i,q)

. This is called the posterior distribution.

To get the Wilson Confidence Interval noted above, we have to take D() to be a “flat prior”, i.e. D(q) = 1 for all q in the domain (0 to 1). A flat prior means that we have no reason to expect any value for

p_i

to be more likely than any other value. Given a flat prior, the posterior distribution is the same as the conditional distribution, i.e. a binomial. The confidence interval formula actually just computes percentiles into the binomial distribution.

So given this information, what does How Not recommend?

I would pick 0.05.

In other words, replace each set of ratings by the 5th percentile estimate. This choice seems fairly arbitrary, and far from the best estimator for

p_i

(that would be the 50th percentile). Where does it come from? By choosing the 5th percentile, we produce lower estimators for wider distributions, which correspond to those with fewer ratings. The effect, then, is similar to choosing a prior in which smaller values are much more probable than larger values, but it’s hard to solve backwards to figure out exactly what prior we’re effectively using. We might also be able to reproduce Bayesian Rating‘s result, approximately, by some particular choice of Normal prior, centered at the mean of all ratings.

If you have a strong feeling about what the prior probability distribution D should be, then you’re pretty much done. Bayes’s Theorem gives you the Posterior distribution on

p_i

, and all you have to do is pull out a value from it. The most popular way to do this is Maximum Likelihood Estimation, which, as the name implies, means just picking the value that gives you the highest posterior probability density. Symbolically, MLE means choosing

\bar{p_i} = \arg\max_q D(q)P[k_i = \kappa|N_i,q]

.

What if you don’t have a strong feeling about D? Then we need to estimate D somehow. Sometimes when you have a hammer, you really want to smash stuff, so we can simply apply Bayesian Estimation again, this time to D itself. To do so, we have to introduce a hyperprior E, that is, a prior probability distribution whose domain consists of all possible probability distributions. The conditional probability is now the probability, given D, that the entire dataset would be generated:

P[\{k_i\} | \{N_i\}, D] = \prod_{i=1}^M P[k_i | N_i, D] \\= \prod_{i=1}^M \int_0^1 P[k_i | N_i, p_i] D(p_i)dp_i

.

We then choose D by MLE again:

\bar{D} = \arg\max_\Delta E(\Delta)\prod_{i=1}^M \int_0^1 \mathrm{Binom}(k_i;N_i,p_i)\Delta(p_i)dp_i

. We can then use

\bar{D}

as our prior to compute

\bar{p_i}

.

Of course, we have now just gotten ourselves back to the same problem: we have to choose a hyperprior. Hyperpriors are pretty cool, though, because they allow us to make some very general statements about what kinds of priors we find acceptable. For example, the hyperprior could enforce smooth priors, or weak (i.e. nearly flat) priors, or (truncated) Normal priors, or any other class of priors that fits with your intuition about your particular rating problem.

We also have the option of choosing a totally flat hyperprior, meaning that E(d) is a constant regardless of its argument, and all prior distributions are equally likely a priori. This simplifies the MLE… but only slightly. What we’re left with is unfortunately not a very friendly expression, and I have no reason to expect that a closed-form solution exists in general (and many hours’ worth of failed attempts). In fact, the product-of-integrals formulation looks to me a bit like a continuum analogue of the Satisfiability problem, which is NP-hard.

Given that the flat-hyperprior MLE problem is hard, we can take one step back and perform what I call “Large Likelihood Estimation”: pick a value for
D that achieves a high posterior probability. For example, a decent choice might be

D(q) = \sum_{i=1}^M\frac{1}{M}\delta\left(q-\frac{k_i}{N_i}\right)

. This prior simply says that each rating should be one of the observed ratings. As such, it’s unfortunately rather trivial, equivalent to a flat prior on this dataset. However, it’s a good starting point for better priors, which can be found by optimizing the likelihood function while varying the coefficients and positions of the delta functions, using something like Nonlinear Conjugate Gradients. In fact, I have a hunch that this may actually produce the globally optimal prior.

Anyway, the point of this ramble is that something as simple as interpreting +/- ratings can wind up tremendously complicated. It’s complicated even though we actually made some rather strong simplifying assumptions: ratings are binomial,

N_i

is completely independent from

p_i

, all items are interchangeable… imagine the complexity if everything were on a 5-star scale!