byte[]
Philomena Contributor
@The_Park
Let’s define the problem.We want to know in an objective sense some sort of reasonable lower bound on how well people will rate an image on average, and we don’t currently have perfect information about how people will rate the image, just a small sample of people who voted on it. For the sake of convenience (and it is a big stretch), we will assume the sample is randomly selected from independent people who wanted to see that type of image. This gives us a statistical basis to use to start making inferences.An obvious lower bound for how well people will rate an image is that 100% of them will downvote it. But that isn’t a very useful metric because it assigns the same score to all images, even though people do like some images more than others – it doesn’t provide any ranking information.In order to help us find a more informative lower bound on the proportion of people who will like this image, we will need to incorporate some of the information about how people have ranked it so far. To do that, we will use our assumption of the people who voted on it so far being a random sample to help us. This observation is based off of the fact that when you take a bunch of random samples for the same image, with different people voting each time, the proportion of how they would vote on the image will mostly converge on a small range of probabilities centered around the true value. This is called the central limit theorem.Because the central limit theorem lets us treat these sample averages as a normal distribution, we can use the math of the normal distribution to help us out. To do this, I will introduce some notation.p̂ (p-hat, p with a circumflex) is the sample proportion. This is the number of upvotes divided by the total votes.
n is sample size, or how many people voted.
p is the population proportion. This is the value we don’t know (the true rating of the image), and the one we would like to find a lower bound for.
μ is the the average of all the sample proportions. If our samples are really random and independent, it should be equal to the population proportion p.
σ (Greek letter sigma) is the population standard deviation. In the picture above, we expect 68% of our samples to be between one standard deviation below and one standard deviation above the average.In the above picture, we can see that a good lower bound for our true rating should be at μ – 3σ, because 99.85% of all of our samples will be above that value. However, if we knew what μ was, we wouldn’t need any of this stuff, because we could just use that as the rating! So we have to estimate based on p̂. We will naively guess that p̂ = p, and the sample size will control our standard deviation. This should intuitively make sense, because if we choose an enormous sample size, and they are all randomly sampled, it is much more likely that we are close to the true value than if we only have a small number of people.The formula for the sample standard error is expressed as
`
sqrt(p̂(1 - p̂))
s = —————
sqrt(n)
`
s = —————
sqrt(n)
`
So our formula for the estimate for the value the true proportion will be above 99.85% of the time, the value we were interested in, is
`
sqrt(p̂(1 - p̂))
p̂ - 3 * –––––––
sqrt(n)
`
p̂ - 3 * –––––––
sqrt(n)
`
This almost works as intended, and it is pretty close to what we actually do, except it annoyingly doesn’t work correctly when the sample size is really small (when an image only has a few votes) or when the probability is really extreme (everyone liked or disliked it). This is what Edwin Bidwell Wilson’s scoring interval, which we abbreviate to the Wilson score interval, fixes.For Wilson, there are modifications to the formula which greatly improve its stability in these extreme cases. If you want to read more about it, hit up the Wikipedia page here.Note: Derpibooru actually is less pessimistic than this and uses a lower value that 99.5% of all possible probabilities are greater than, rather than 99.85%.