What is Wilson score?

deactivateda60e19

What is “Wilson score” sorting method?

Posted 3 years ago Report

Link Quote Reply

byte[]

Philomena Contributor

@byte[]

@The_Park
Let’s define the problem.

We want to know in an objective sense some sort of reasonable lower bound on how well people will rate an image on average, and we don’t currently have perfect information about how people will rate the image, just a small sample of people who voted on it. For the sake of convenience (and it is a big stretch), we will assume the sample is randomly selected from independent people who wanted to see that type of image. This gives us a statistical basis to use to start making inferences.

An obvious lower bound for how well people will rate an image is that 100% of them will downvote it. But that isn’t a very useful metric because it assigns the same score to all images, even though people do like some images more than others – it doesn’t provide any ranking information.

In order to help us find a more informative lower bound on the proportion of people who will like this image, we will need to incorporate some of the information about how people have ranked it so far. To do that, we will use our assumption of the people who voted on it so far being a random sample to help us. This observation is based off of the fact that when you take a bunch of random samples for the same image, with different people voting each time, the proportion of how they would vote on the image will mostly converge on a small range of probabilities centered around the true value. This is called the central limit theorem.

Because the central limit theorem lets us treat these sample averages as a normal distribution, we can use the math of the normal distribution to help us out. To do this, I will introduce some notation.

p̂ (p-hat, p with a circumflex) is the sample proportion. This is the number of upvotes divided by the total votes.
n is sample size, or how many people voted.
p is the population proportion. This is the value we don’t know (the true rating of the image), and the one we would like to find a lower bound for.
μ is the the average of all the sample proportions. If our samples are really random and independent, it should be equal to the population proportion p.
σ (Greek letter sigma) is the population standard deviation. In the picture above, we expect 68% of our samples to be between one standard deviation below and one standard deviation above the average.

In the above picture, we can see that a good lower bound for our true rating should be at μ – 3σ, because 99.85% of all of our samples will be above that value. However, if we knew what μ was, we wouldn’t need any of this stuff, because we could just use that as the rating! So we have to estimate based on p̂. We will naively guess that p̂ = p, and the sample size will control our standard deviation. This should intuitively make sense, because if we choose an enormous sample size, and they are all randomly sampled, it is much more likely that we are close to the true value than if we only have a small number of people.

The formula for the sample standard error is expressed as
`

sqrt(p̂(1 - p̂))
s = —————
sqrt(n)
`

So our formula for the estimate for the value the true proportion will be above 99.85% of the time, the value we were interested in, is
`

sqrt(p̂(1 - p̂))
p̂ - 3 * –––––––
sqrt(n)
`

This almost works as intended, and it is pretty close to what we actually do, except it annoyingly doesn’t work correctly when the sample size is really small (when an image only has a few votes) or when the probability is really extreme (everyone liked or disliked it). This is what Edwin Bidwell Wilson’s scoring interval, which we abbreviate to the Wilson score interval, fixes.

For Wilson, there are modifications to the formula which greatly improve its stability in these extreme cases. If you want to read more about it, hit up the Wikipedia page here.

Note: Derpibooru actually is less pessimistic than this and uses a lower value that 99.5% of all possible probabilities are greater than, rather than 99.85%.

Posted 3 years ago Report

Link Quote Reply

The Smiling Pony

( ͠° ͟ʖ ͡° )

@agmistry
The tl;dr is it scores images assuming the current up/down ratio trend continues to infinity, so an image of 300up/20down is actually “liked more” than an image of 400up/60down.

Posted 3 years ago Report

Link Quote Reply

Rexton the III

One artist, two tags

It’s awesome because now when you see a picture that you personally don’t like in the trending window, your downvote actually matters! You can knock a picture down from a couple of places to several pages with a single click and you can rest easy no one else needs to see it either! It’s an awesome damage boost.

I’m being sarcastic of course but I don’t see the logic behind it. With the old system it was the same handful of artists that were competing for those spots. Now it’s 9/10 of the same artists with the odd score 200 picture thrown in that no one thought to downvote on. Personally I always found it something to aspire towards, you know, making something that people really like and getting recognition for it, but whatever.

The silver lining is that it does give more people a (slight) chance at the spotlight, even if it’s pretty much random now.

Posted 3 years ago Report
Edited 3 years ago

Link Quote Reply

The Smiling Pony

( ͠° ͟ʖ ͡° )

@Rexton the III
The purpose of the “Trending Images” highlight isn’t a “rich get richer” pit, it’s to expose people to a variety of images they might like. So far we’ve found that the current scoring system provides a higher variety of quality content than the previous system and allows more artists a chance to get some exposure.

If you have a complaint I encourage you to do so with some empirical data and less passive-aggressive snark.

Posted 3 years ago Report

Link Quote Reply

Rexton the III

One artist, two tags

@The Smiling Pony
Sorry. I was trying to point out a problem, not trying to be a dick.

I’m not saying the old way was necessarily better or that this is worse. Just that the new way makes the downvote disproportionately powerful. If you imagine a scenario where two score 200 images are about to trend, but then one gets a downvote and doesn’t make it while the other one doubles in score due to the exposure, well, it’s not exactly “fair”.

And these things are always a rich-get-richer-pit, but it’s not something that can or even should be avoided. You can’t really tweak popularity out of a popularity contest.

Either way, it’s good to see more variety in the trending window.

Posted 3 years ago Report

Link Quote Reply

deactivateda60e19

Thanks a lot, now i got it.

Posted 3 years ago Report

Link Quote Reply

Background Pony #97D1

So if I get that right, it is measured across all images similar to the one being posted, right?
What I always wondered was if a big fellowship of an artist will pollute the voting system, since most people that follow someone will most likely upvote it whether they like it or not, right? (my assumption)
Are there numbers for this? Did anyone factor stuff like this in? Does it even make sense to do so or would it be just normalized in the sample data anyway?

It’s my subjective observation that popularity pollutes image quality to some degree (and of course other factors like current meme train or whatever too).

Anyway, super glad that you try to give people a chance to be seen, it is pretty difficult getting some traction for newbies.

Posted 3 years ago Report

Link Quote Reply

xbi

I am skeptic to Wilson score for image ranking at Derpi.

First, even if all-knowing oracle gives us the perfect exact value of the estimated downvote/upvote ratio with super-narrow interval it is not so useful for sorting: if we have boring image with 0 downvote ratio but result upvote count is 5 and 0 downvotes (and 0.013456% downvotes exact estimate from oracle), it is not useful to rank it better than interesting image with 1000 upvote and 10 downvotes (1% downvotes, here both calculated interval and oracle estimates are pretty near)

Second, it is not so meaningful to compare left confidence interval bounds, it has no any strict math meaning. It could be useful to compare right bound to the left bound to have some confidence in value difference, but I don’t know any meaning of left-with-left or right-with-right comparison.

Third, usually downvote count is pretty small for good images (ones, often zero), so the bottom of the confidence interval become very-very sensitive to number of downvotes. So, it become just some magic formula that punishes hard picture ranking for small number of downvotes.

Maybe the estimate of upvote to registered view count could be more useful. Thsi value useful value for image quality ranking. If we have another oracle that tells us exact chance of image upvote after view, it would be much more useful than oracle which tells us downvote/upvote ratio to judge about image quality without random popularity boosts (like ‘featured art’ or links from powerfull twitter bloggers)

Also, any ranking do not solve the problem ‘richer-get-richer’. This specific Wilson score do not solve this problem at all, it just changes definition of ‘rich’, making ‘richer’ images with non-controversial topics and artists who are loved, not hated.

Also, Derpibooru has not so much problem with popularity artist boost as Twitter or DeviantArt. This is specific of derpi, that image statistics much more depends on image content than artist popularity.

Posted 3 years ago Report
Edited 3 years ago

Link Quote Reply

🐴

IRL 🎠 stallion

@xbi
In regards to your final paragraph, small artists aren’t buried as easily because users here follow tags rather than only following artists. A small artist who draws good art on a popular tag will soon get more views.

Posted 3 years ago Report

Link Quote Reply

Rexton the III

One artist, two tags

Wow, that’s an old thread necro. FWIW I still stand behind my criticism.

In the earlier system a score 300 image had a chance at trending depending on the competition. Now for a no-name artist getting 5 or so downvotes pretty much guarantees it can’t get there. More popular artists can eat the hit, so they are largely unaffected.

TL;DR it makes downvotes disproportionately powerful. The main upside is that around score 200 images can get more exposure, but only if they receive no downvotes.

Posted 3 years ago Report

Link Quote Reply

Interested in advertising on Derpibooru? Click here for information!

Help fund the $15 daily operational cost of Derpibooru - support us financially!