Hobby-hacking Eric

2008-07-22

encodings-aware hex editor

Here's another coding-project idea: I would like to see a hex editor that knows how to display characters in other encodings than ASCII (specifically: I want to debug messed up UTF-8 text files).

Google and apt-cache search reveal no such editor, at least not in the free/open-source worlds, nowhere in Linux or MacOS X freeware land. On Debian based systems, there are a couple that handle some Japanese encodings, but nothing that deals with UTF-8.

Likely features:
  • toggle between an ASCII-only mode and a show-as-UTF-8 mode
  • good UI for the fact that UTF-8 characters have a variable length in bytes
  • graceful handling of encoding errors


Haskellers could possibly do this as a part (plugin?) of Yi, or maybe just a completely standalone product.

And if you want a slightly simpler project, a UTF-8 hex dumper would be good. Hmmph... come to think of it, maybe it would have been more productive to just go write that instead of this blog post.

Edit: Well, I went ahead and made a stupid little dumper for my needs. Here is the output on some sample corrupted UTF-8
20 28 5b 47 65 6f 72 67 69 61 6e                     ([Georgian
3a 20 e183a1 e183 3f e183a5 e183 : ს«e1 83»?ქ«e1 83»
3f e183 20 e18397 e18395 e18394 e1839a e183 ?«e1 83» თველ«e1 83»
3f 5d 0a ?]
20 28 5b 47 65 72 6d 61 6e 3a 20                     ([German:
44 65 75 74 73 63 68 6c 61 6e 64 Deutschland
5d 20 5b 49 50 41 3a 20 cb88 64 c994 ] [IPA: ˈdɔ
c9aa 74 ca83 6c 61 6e 74 5d 29 2c 20 ɪtʃlant]),
6f 66 66 69 63 69 61 6c 6c 79 20 officially
74 68 65 20 46 65 64 65 72 61 6c the Federal
20 52 65 70 75 62 6c 69 63 20 6f Republic o
66 20 47 65 72 6d 61 6e 79 20 28 f Germany (
42 75 6e 64 65 73 72 65 70 75 62 Bundesrepub
6c 69 6b 20 44 65 75 74 73 63 68 lik Deutsch
6c 61 6e 64 2c 20 5b 49 50 41 3a land, [IPA:
20 cb88 62 ca8a 6e 64 c999 73 72 65 70 ˈbʊndəsrep
75 62 6c 69 cb 3f 6b 20 cb88 64 ubli«cb»?k ˈd
c994 c9aa 74 ca83 6c 61 6e 74 5d 29 2c ɔɪtʃlant]),
20 69 73 20 61 20 63 6f 75 6e 74 is a count
72 79 20 69 6e 20 43 65 6e 74 72 ry in Centr
61 6c 20 45 75 72 6f 70 65 2e 20 al Europe.
0a
Highlighting by hand. I should probably go figure out how to colourise the corrupted characters. Or maybe I should just go ahead and package this, put it up on hackage? Make it available via darcs? I would need a decent name. So far, I have hexy-xxy and hexdump-utf8 neither of which are that great :-/


1 comment:

kowey said...

I ended up going with uhexdump, for what it's worth