derpy727
Feature Request: Provide REAL SHA512 for Derpibooru-optimised versions of images in Search/JSON API
I was surprised to have discovered, and then to have my suspicions confirmed on IRC, that, contrary to what one naturally expects from reading about it in Search Syntax documentation…
the sha512_hash field always equals orig_sha512_hash field, which is the SHA512 of submitted image before image optimisation pass.
In other words, if one downloads an image from derpibooru, its SHA doesn’t always match up with what is said in JSON! An example of such is this post: >>1360613 (deleted)
What this means from a practical standpoint:
I was surprised to have discovered, and then to have my suspicions confirmed on IRC, that, contrary to what one naturally expects from reading about it in Search Syntax documentation…
the sha512_hash field always equals orig_sha512_hash field, which is the SHA512 of submitted image before image optimisation pass.
In other words, if one downloads an image from derpibooru, its SHA doesn’t always match up with what is said in JSON! An example of such is this post: >>1360613 (deleted)
What this means from a practical standpoint:
- The natural expectations of JSON API users are violated. The very first downloader script I found checks the files it downloaded against sha512_hash field.
- There’s literally no other way to find the original derpibooru page by a given image downloaded from Derpibooru except by heavyweight reverse image search (which doesn’t have a JSON API even). What could have been a lightweight /search.json?q=sha512_hash:<the_hash> GET query becomes an upload of the full picture with server-side similarity computation. Which is why it is quite probably also severely rate-limited by Derpibooru admins, too (I’ve never tried, because I don’t want to stress a site I like with unnecessary load).
I’m not even sure what is the use-case for hashes of original submissions alone. The majority of the images I encounter in the wild comes from Derpibooru (or from some secondary site, where it was mangled by their own compression, and so is beyond searching, whether by original or Derpibooru-optimised SHA, which is why it’s out of scope of this discussion). Does anyone expect some user to go to, say, DeviantArt, download a pony image, compute its SHA, and search with it on Derpibooru… for what exactly? If they got the original image directly from the author, it probably had all the info they possibly needed alongside.
Maybe you’re using it to detect duplicate submissions… no, scratch that. Here’s an example (marginally NSFW) of duplicate images with literally the same SHA, that nobody cares about. But if you did, SHA512 of post-optimisation images would be even BETTER for detecting duplicates, because it won’t be affected by various metadata, bytes appended by imageboard-anti-duplicate-image-check scripts, etc, that are stripped by the optimisation pass.
I literally fail to see, how come you do not use internally and/or provide via JSON API the hashes of images that you serve to your users. The sha512_hash field is already described in Search Syntax docs like it’s exactly that counterpart of orig_sha512_hash field, so why not actually make it so?
My need in this is that I have a bunch of images I collected over time from various sources, and I wanted to organise them a bit by fetching image tags for them from Derpibooru and moving them to respective folders. Fetching by sha512_hash seemed like the perfect match for this task… if only it wasn’t for this deficiency.
I honestly do not want to go through hoops like uploading my whole collection to Derpibooru’s reverse image search via POST requests. And since I’m not quite confident in Derpibooru’s reverse search algorithm (which is it, btw?), it would probably be followed by downloading images from all search results, in order to have my image similarity algorithm determine if I really got the tags for the image I wanted (by, at least, comparing the SHAs, lol). I would, of course, rate-limit it muchly, but still I hate to think about this unnecessary stress on the site (and about the time it will take to complete).