We are rolling out our brand new frontend It is designed with scientifically tried and proven concepts to improve your experience while using the site. Let us know what you think!

Feature suggestions and discussion [READ THE FIRST POST]

Exedrus
Heart Gem -
Gold Bit -
Not a Llama - Happy April Fools Day!
Friendship, Art, and Magic (2017) - Celebrated Derpibooru's five year anniversary with friends.
Happy Derpy! -
Responsible Disclosure -
Silver Patron -
Artist -

@byte[]
>I don’t suppose you can see the problem here?

This seems reasonable, assuming there isn't some technical limitation preventing this. (Is spoiler handling done client-side?)

I mean, ideally people could label each line so the text could be more concise, but failing that this would at least give them some context before they had to resort to a scroll-over to scan the tags.

Edit: Seems like you might also be able to make it just display the first line/outer-or-clause that matches rather than the shortest? Then people could just order them according to how important each individual check is.
derpy727

Feature Request: Provide REAL SHA512 for Derpibooru-optimised versions of images in Search/JSON API

I was surprised to have discovered, and then to have my suspicions confirmed on IRC, that, contrary to what one naturally expects from reading about it in Search Syntax documentation…

the sha512_hash field always equals orig_sha512_hash field, which is the SHA512 of submitted image before image optimisation pass.

In other words, if one downloads an image from derpibooru, its SHA doesn't always match up with what is said in JSON! An example of such is this post: >>1360613

What this means from a practical standpoint:

1) The natural expectations of JSON API users are violated. The very first downloader script I found checks the files it downloaded against sha512_hash field.

2) There's literally no other way to find the original derpibooru page by a given image downloaded from Derpibooru except by heavyweight reverse image search (which doesn't have a JSON API even). What could have been a lightweight /search.json?q=sha512_hash:<the_hash> GET query becomes an upload of the full picture with server-side similarity computation. Which is why it is quite probably also severely rate-limited by Derpibooru admins, too (I've never tried, because I don't want to stress a site I like with unnecessary load).

I'm not even sure what is the use-case for hashes of original submissions alone. The majority of the images I encounter in the wild comes from Derpibooru (or from some secondary site, where it was mangled by their own compression, and so is beyond searching, whether by original or Derpibooru-optimised SHA, which is why it's out of scope of this discussion). Does anyone expect some user to go to, say, DeviantArt, download a pony image, compute its SHA, and search with it on Derpibooru… for what exactly? If they got the original image directly from the author, it probably had all the info they possibly needed alongside.

Maybe you're using it to detect duplicate submissions… no, scratch that. Here's an example (marginally NSFW) of duplicate images with literally the same SHA, that nobody cares about. But if you did, SHA512 of post-optimisation images would be even BETTER for detecting duplicates, because it won't be affected by various metadata, bytes appended by imageboard-anti-duplicate-image-check scripts, etc, that are stripped by the optimisation pass.

I literally fail to see, how come you do not use internally and/or provide via JSON API the hashes of images that you serve to your users. The sha512_hash field is already described in Search Syntax docs like it's exactly that counterpart of orig_sha512_hash field, so why not actually make it so?

My need in this is that I have a bunch of images I collected over time from various sources, and I wanted to organise them a bit by fetching image tags for them from Derpibooru and moving them to respective folders. Fetching by sha512_hash seemed like the perfect match for this task… if only it wasn't for this deficiency.

I honestly do not want to go through hoops like uploading my whole collection to Derpibooru's reverse image search via POST requests. And since I'm not quite confident in Derpibooru's reverse search algorithm (which is it, btw?), it would probably be followed by downloading images from all search results, in order to have my image similarity algorithm determine if I really got the tags for the image I wanted (by, at least, comparing the SHAs, lol). I would, of course, rate-limit it muchly, but still I hate to think about this unnecessary stress on the site (and about the time it will take to complete).
Posted Report
byte[]

Admin
Site Developer
@derpy727

There's literally no other way to find the original derpibooru page by a given image downloaded from Derpibooru except by heavyweight reverse image search (which doesn’t have a JSON API even).
Technically, it actually does have a JSON API; POST with the image form-encoded in the image param, and set Accept to application/json, or append .json to the URL, or you can use it as a bookmarklet with the GET query param scraper_url. However, based on what you wrote, this probably wouldn't be very helpful to you.

I'm not even sure what is the use-case for hashes of original submissions alone. […] Maybe you're using it to detect duplicate submissions… no, scratch that.
Actually, that's correct. SHA-512 prevents exact duplicate copies of images from being uploaded. This catches about 50% of our duplicate uploads; the rest are caught with a heavier-weight perceptual deduplication.

But if you did, SHA512 of post-optimisation images would be even BETTER for detecting duplicates, because […]
This is a faulty assertion. Optimized image data do not have a single normal form that they will naturally and obviously be reduced to for every set of input pixels, and the output data and resultant hash are likely to be different for runs on differently-encoded inputs.

And since I'm not quite confident in Derpibooru's reverse search algorithm (which is it, btw?)
Homegrown. See here.

ADDENDUM: Cloudflare also messes with files downloaded from the site. Don't expect anything except large (>2MB) PNGs to have hashes matching those from our own collection.


As far as I can tell, nothing I can personally do on my side would be helpful for you.
derpy727

@byte[]
Technically, it actually does have a JSON API

Oh, I see. I'ts been a while and I didn't investigate it thoroughly that time, I guess.

However, based on what you wrote, this probably wouldn't be very helpful to you.

Why? It probably would, if I really will have to resort to it. Do you perhaps have a suggested request/byte rate limit for me, or should I refrain from using it altogether?

Actually, that's correct. SHA-512 prevents exact duplicate copies of images from being uploaded. This catches about 50% of our duplicate uploads; the rest are caught with a heavier-weight perceptual deduplication.

I wonder how that explains my accidental finding of an exact duplicate. All of those are one select … having count(*)>1 query away to you.

But if you did, SHA512 of post-optimisation images would be even BETTER for detecting duplicates, because […]
This is a faulty assertion. Optimized image data do not have a single "normal form"
No, not really. But an image doesn't have to reduce uniquely in order for a hash check to be useful. It is sufficient for it to reduce to a lesser number of variants, and hence be better for detecting duplicates. Though now I realise you might not want to spend CPU time on optimising an image before duplication check, so as to not present abusers with a DoS avenue. Though it depends on how heavy your processing is, maybe it's not going to put much of a dent in your CPU budget behind a reasonable upload rate limiter protection. In a quest for a normal form you could hash not the image file as a whole, but its pixel data in RGB, etc. The possibilities are endless, really.

Homegrown. See here.

Oh. OH. Explains why I had no confidence in it. Yeah, it probably will not help me. Were you constrained by CPU budget per image when you chose this? If I may make a suggestion, you really should use something like ph_dct_imagehash() from pHash library for this (64bit-sized hashes compared with hamming distance), combined with a proper Metric Tree data structure for search, unless not being very accurate with reverse image search is your intention. And exposing phash values via JSON API and allowing to search by it with a given bit distance would REALLY be helpful with regard to not having to needlessly upload files and compute hashes for them on your server.

ADDENDUM: Cloudflare also messes with files downloaded from the site. Don't expect anything except large (>2MB) PNGs to have hashes matching those from our own collection.

Luckily, those constitute a very small percentage of my collection. If THIS were the only source of search failures, I very well could live with that failure rate.

Edit: I might have misread that. So it's the small files that get mangled? Anyway, it appears that it's a configurable option, and I'm not sure why you'd need Cloudflare to optimize your already optimized images, but I can hardly ask you not to, can I? Oh well, this leaves the possibilities of pixel data SHA or (if you use the lossy variety of Cloudflare optimisation) some less mangling-sensitive image hash, like phash.

As far as I can tell, nothing I can personally do on my side would be helpful for you.

Oh, I really do hope otherwise.
byte[]

Admin
Site Developer
@derpy727
Though now I realise you might not want to spend CPU time on optimising an image before duplication check, so as to not present abusers with a DoS avenue.
Bingo.
In a quest for a normal form you could hash not the image file as a whole, but its pixel data in RGB, etc.
It wouldn't really be helpful; there are just too many input variables to expect to reduce everything to the same 8-bit RGBA gAMA-normalized cHRM-normalized pixel data.

Were you constrained by CPU budget per image when you chose this?
Severely. While it's not constrained so much anymore, I'd like to prevent it from being eaten up excessively.
If I may make a suggestion, you really should use something like ph_dct_imagehash() from pHash library
GPLv3, can't use it.
unless not being very accurate with reverse image search is your intention
I mean, it does have a nearly zero false negative rate.

So it's the small files that get mangled?
Yes
Anyway, it appears that it's a configurable option, and I'm not sure why you'd need Cloudflare to optimize your already optimized images
I don't know either, but disabling Polish makes image loads almost 3x slower for me on everything I use, so I have opted to keep it on.
Posted Report
derpy727

@byte[]
GPLv3, can't use it.

Well then, how about this ? The license seems as permissive as it gets. Though if you look at the implementation of its phash(), I'd argue it's so simple as to be barely copyrightable. It's just a few lines you could do in plain C and opencv yourself. Or implement your own flavor. It's not like perceptual hash implementations are even compatible among themselves, though the general idea is the same.

Oh well, I guess you still have your CPU concerns. It's just a pity that for anime there's iqdb.org, which indexes not just one, but a dozen booru sites and does a great job at finding stuff, yet we can't have such nice thing for ponies.

I mean, it does have a nearly zero false negative rate.
Haha.

I don't know either, but disabling Polish makes image loads almost 3x slower for me on everything I use, so I have opted to keep it on.
Well, that's fair, even if strange. I wonder though if Cloudflare's optimisation is idempotent, so that you could actually pass images through them once and then download their version and replace yours, and after they process it again it will still have the SHA intact. Just wondering out loud, it's clearly too many hoops for you to jump through.

Oh well, it would appear I'll have to try to make do with your reverse image search. Now that I know your algorithm, I can probably learn to generate small grayscale thumbnails that would weigh almost nothing yet produce the same 5 intensity values as the original image, to spare your upload bandwidth and CPU. Unless, of course, you're willing to expose an API where I could provide those 5 numbers + threshold directly.

Then, though I'd have muchly preferred to fetch just phashes instead, for the search results obtained I intend to download thumb_tiny/thumb_small thumbnails, which should be just barely enough to reliably compare them to my original images, while not wasting much of your bandwidth again. I hope that's alright by you.
Posted Report
derpy727

Unless, of course, you're willing to expose an API where I could provide those 5 numbers + threshold directly.

I'll take that silence as "No", which is a real real pity. It appears that the image search API doesn't accept images less than 32768 pixels total, so there is that limit to how small I can make my queries.

By the way, now that I've looked at your fingerprinting algo a bit closer, I'm not sure if I'm missing something, or it really is as pointless as it looks.

Why even compute the costly convolution with a unit gaussian kernel? It's unit, so that isn't going to change the total sum, save for the image border effects. For 3x3 gaussian kernel and the default opencv border strategy "reflect", your computation is almost exactly equal to

total_sum - 0.25*sum_of_border_values + 0.25*sum_of_values_one_pixel_away_from_border

So for a w*h sized image the difference is on the order of (w+h)/(w*h), and so is less than 1% of the sum for even the smallest images (183x183), and that's assuming the most unlikely case of the highest contrast between border pixels and one_pixel_away_border_pixels.

I'm also not sure if (half_x+1) and (half_y+1) instead of just half_x and half_y in quadrants size computation is an attempt to prevent overlap of gaussians between quadrants (why?) or just an off-by-one error. If it weren't for that, average intensity would contain no info, that the average of quadrant intensities didn't encode already. But since this basically throws away the horizontal and vertical central lines after a blur application, with that it merely encodes the contrast on the middle lines, and so deviates from naive summation by the whopping 1% at most too, and that's again in the unlikely case that e.g. your central lines are all white, and their immediate surroundings are black, etc.

Assuming you search for dup fingerprints by comparing abs(i1-i2) to threshold, if there is a costlier way to encode some border and middle line contrast info into the least significant bits of an image fingerprint, I'm yet to see it. Quite probably, most of the time, this pointless info from the least meaningful 1% of the image simply doesn't matter during comparison, which is probably good.

By all accounts, you should've just summed quadrant pixels directly, without pointless blurring and a separate computation of the total average. At least, there are faster ways to compute exactly your fingerprint without the convolution that doesn't affect the result for 99% of pixels.
Posted Report
byte[]

Admin
Site Developer
I’ll take that silence as "No", which is a real real pity.
It was never added because those numbers are meaningless outside the context of the app. I can run the queries for you if you want.

I'd also like to add that I didn't write the current hash algorithm, and I have no training or experience whatsoever in matters of computer vision.
Posted Report
derpy727

@byte[]
It was never added because those numbers are meaningless outside the context of the app.

Well, the usefulness of such API is indisputable, I'd say. Now that you've generously shared the algorithm, it's frankly not rocket science to present such API with necessary numbers. I'm sure people could utilize it to add metadata and/or organize pony pics. You (derpibooru) might even provide an official tool yourself.

And that aside, such a simple fingerprint likely could be computed on the client side with javascript, so if you were to integrate it in your reverse search page, most queries would just use the 5-numbers API and never even upload anything, with the upload fallback reserved for javascript-disabled users. You could also integrate this check in image submission page, so that duplicate uploads would be stopped right after people select a file, before they even attempt to pointlessly populate tags, etc.

I can run the queries for you if you want.

Oh, I might just take you up on that offer. Though my collection tends to grow at a rate of ~20 pics a day, and I surely can't ask you to run queries for me continuously, so I'd rather work towards a more permanent solution that also doesn't single me out as the only lucky pone, heh. I think being able to do mass queries could be useful to others too.

But even though I have already succeeded in generating ~800 bytes 183x183 PNG pics that have the fingerprint within 0.1% of the original image, unfortunately, server rejects them with code 500, and it appears to be just a minimum size restriction. I have to append zero bytes to pics until ~8KiB, before they begin to work. And since the actual limit probably isn't the image resolution, like I thought, but the file size, maybe in reality I could even do with smaller than 183x183 resolutions and hence less than 800 bytes images. What an exercise in frustration. Oh well.

Since it appears the problem of smaller image requests can't be solved from my side, I guess I'll have to take what I can, and as for the rest — Derpibooru surely will be able to handle twenty 8KiB requests a day.

What format would you prefer me to provide my fingerprints in? A CSV with 5 columns of floats? We're talking ~23k images here, just in case. Doesn't sound like too much even for querying with 8KiB requests, now that I think of it.

Anyway, at the minimum, for every fingerprint I'd want a list of potentially matching posts' ids. In whatever format, as long as the correspondence between my fingerprints and post ids is kept, since pics of thumb_small size might be not enough to restore which was which.

I'd also like to add that I didn't write the current hash algorithm, and I have no training or experience whatsoever in matters of computer vision.

Ah, sorry, I was using "you" in the general plural sense of "you, the derpibooru devs". Forgive me if that wasn't clear, I didn't even really consider if you personally wrote it or not, and certainly didn't intend to belittle anyone personally or collectively.

I do think that things could be done better, but whatever performance improvements to the current algo probably aren't worth the effort and slightly less readable code, and to suggest another algorithm with confidence I'd need to know more on how well your current fingerprinting serves you today, test possible replacements for cost/benefit, etc, which I'm not in position to do, so my code commentary was pointless, I guess.
Posted Report
derpy727

@byte[]
Ok, I've actually made it 6 columns. Headers present in file, LF line endings.

id, average_intensity, nw_intensity, ne_intensity, sw_intensity, se_intensity

I've added an id column so that you could just give me (my_id, found_post_id) pairs directly from SELECT.

Download from here
Posted Report
derpy727

PS. Ah yes, while I'm at it, there's another funny thing I noticed about the hash algorithm.
As you can see, it basically computes the fingerprint for a grayscalized picture, by multiplying r=0.2126; g=0.7152; b*=0.0772, the human color perception constants, and summing the results.

The first minor nit is that it could be done by opencv itself, by loading images with CV_LOAD_IMAGE_GRAYSCALE, though it has slightly differing constants baked in (see documentation for cvtColor()), and though there might be minor precision advantages in loading 3 uint8 channels first, processing them separately and only then doing the conversion.

And secondly, more importantly, CvMat.load() returns color channels in a different order. It's BGR, instead of RGB. The current code fails to account for that and so multiplies R by the constant for B, and vice-versa. So it tends to underestimate the importance of red and overestimate the importance of blue by some ~3 times (which also means, that if someone uploads a grayscale version of an already uploaded pic, i.e. one passed through the correct conversion formula, it might not get detected as a duplicate).
Clover the Clever
A Perfectly Normal Pony - <@CloverTheClever> I'd pay to see Carcer in a fursuit
Always Codes Drunk - It explains a lot
From the Night - I have technically banned myself a bunch of times...
Since the Beginning  - User number zero

Lord and Saviour
@derpy727
You are right on the BGR/RGB front — I actually fixed that way back when but apparently never committed the code. Doh. On the gaussian stuff, yeah, it's probably mostly redundant.

The perceptual dedupe code is mine, mostly, and was indeed intended to be very low CPU, reasonably robust, and err on the side of a low false-negative rate — it was always intended to be used to prevent accidental duplication of images on the site, not to provide reverse search. Since I wrote it years back, OpenCV's gotten better, CPU has gotten cheaper, and other libraries have come into existence.

A major issue with hamming distance based approaches is storage and search. The threshold based approach is trivial to store and search with extremely good efficiency. While there are some options for faster hamming space search these days, it's still an ongoing concern.

If I were doing it again from scratch today then I would probably look into that in more detail, and use a locality sensitive hashing algorithm using a normalised image (probably 64x64) and its RGB intensities (rather than throwing the colour info away). I've also experimented with feature detection and summing as another input to LSH algorithms in the past (eg oriented BRIEF keypoint extraction, sum the count of keypoints per bucket for the normalized segmented image), which thus encodes some structural information about local contrast variance without resorting to local binary pattern analysis in the hashing algorithm and gains the process some robustness.
Posted Report
Background Pony #F764
@Princess Luna
I'm interested in that. It seems like "Galleries" are just equivalent of FiMFiction bookshelves, or fav folders. Purely personal thing, editable only by one person, and viewable only via "include image" search. I would like to see something like a tag system: visible on image and editable. With "First Prev [all] Next Last" links on image page, regardless of "?q=". In short, something that makes reading comics and other sequential arts not a pain in the ass.
Posted Report
Background Pony #B4DE
Feature request: block HTTP Referer on outside links
Exposing that someone likes MLP may be embarrassing. And considering that some images on this site are… questionable, this may be very embarrassing.

Solution: don't tell 3rd party sites who sends people to them.

How to achieve that:
noreferrer
Referrer policy
Referrer policy support in browsers
Posted Report
DoublePipe
Economist -
Not a Llama - Happy April Fools Day!

@Background Pony #C1A0
Thanks for the suggestion! We'll be adding a Referrer Policy with the "same-origin" value along with the other changes described in this thread. As 'Can I use' says, it is unfortunately not supported by all browsers, but hopefully having more sites use it will prompt quicker implementations.
Posted Report
Exedrus
Heart Gem -
Gold Bit -
Not a Llama - Happy April Fools Day!
Friendship, Art, and Magic (2017) - Celebrated Derpibooru's five year anniversary with friends.
Happy Derpy! -
Responsible Disclosure -
Silver Patron -
Artist -

@TheAnonShy
IIRC, last time it was brought up in IRC with the devs, they indicated they were waiting for the demographics for web browsers to change. Some browsers with issues supporting WebM were still used heavily enough to be a problem. It sounded like this was steadily becoming less of an issue, so this would become viable at some point, just not yet.

Note: I haven't been on IRC recently, so not sure if things may have changed in the mean time.
Posted Report
Interested in advertising on Derpibooru? Click here for information!
Ambient Roleplay Discord

Derpibooru costs over $25 a day to operate - help support us financially!

Syntax quick reference: *bold* _italic_ [spoiler]hide text[/spoiler] @code@ +underline+ -strike- ^sup^ ~sub~