As far as misspellings and such, I would put the raw text followed by clarification in brackets. So, "he went 2 (to) the shpo (shop)", etc. This can also be applied to translations.
I also support the idea of overlaid translations, though I appreciate that's an entire system to implement. A user-submitted transcript is no different from the description box, with the only extra work being search functionality.
There also needs to be a checkbox for confirming if an image has no text to transcribe, so you can search for images missing one to submit.