Hobby-hacking Eric

2008-07-21

simply reading and writing UTF-8 in Haskell

A year and a half ago, I posted what seemed to be the simplest recipe for reading and writing UTF-8 in Haskell. In this post, I will provide an even simpler recipe, made possible by Eric Mertens' utf8-string package.

For those who are not familiar with Haskell, its internal representation for characters is Unicode, but for IO it effectively assumes that that it is reading and writing in the ISO8859-1 format. This used to be annoying for those of us who wanted to work with the UTF-8 encoding, but now there is a very simple solution, perfect for those of us who don't want to think too much and just get the job done.

the example


The sample problem from my last post was to take a UTF-8 encoded file as input, reverse all its lines, writing the results in the same file, with a ".rev" extension appended to its name. The solution might be self-explanatory if you are used to Haskell, but I will make some minor comments below, just in case.

import System.IO.UTF8
import Prelude hiding (readFile, writeFile)
import System.Environment (getArgs)

main =
do args <- getArgs
mapM_ reverseUTF8File args

reverseUTF8File f =
do c <- readFile f
writeFile (f ++ ".rev") $ reverseLines c

reverseLines = unlines . map reverse . lines

In the above code, we use some drop-in replacements for some System.IO functions. Some of these functions are also provided in the Prelude, so we must hide them so that they do not overlap with what we import. (Alternatively, we could import the UTF-8 ones qualified, which could be handy in contexts where we want the option of reading and writing in UTF-8 without committing to it). The rest is straightforward. Notice that we do not jump through any hoops whatsoever. In fact, you can pretty much take any pre-existing Haskell program that you have written and turn it into a UTF-8 version by changing the import statements.

Here are the results of running this script on a UTF-8 sampler:
)udrU( یتوہ ںیہن فیلکت ےھجم روا ںوہ اتکس اھک چناک ںیم 
)othsaP( يوږوخ هن ام هغه ،مش ېلړوخ هشيش هز
)naeroK(요아않 지프아 도래그 .요어있 수 을먹 를리유 는나
)keerG( .ατοπίτ ωθάπ αν ςίρωχ άιλαυγ ανέμσαπσ ωάφ αν ώροπΜ
)cidnalecI / aksnelsÍ( .gim aðiem ða sseþ ná relg ðite teg gÉ
)hsiloP( .izdokzs ein im i ,ołkzs ćśej ęgoM
)nainamoR( .etșenăr ăm un ae iș ălcits cnânăm ăs toP
)nainiarkU( .ьтидокшоп ен інем онов й ,олкш итсї ужом Я
)nainemrA( ։րենըչ տսիգնահնա իծնի և լետւո իկապա մանրԿ
)naigroeG( .ავიკტმ არა ად მაჭვ სანიმ
)idniH( .तह हन डप ईक स सउ झम ,ह तकस ख चक म
)werbeH( .יל קיזמ אל הזו תיכוכז לוכאל לוכי ינא
)hsiddiY( .ײװ טשינ רימ טוט סע ןוא זאלג ןסע ןעק ךיא
)cibarA( .ينملؤي ل اذه و جاجزلا لكأ ىلع رداق انأ
)esenapaJ( 。んせまけつ傷を私はれそ。すまれらべ食をスラ
)iahT( บจเนฉหใำทมไนมตแ ดไกจะรกนกนฉ
)slobmys ycnerruc( ₯·₮·₭·₫·₪·₩·₨·₧·₦·₥·₤·₣·₢·₡·¢·$·€·£·¥


The utf8-string package is available on HackageDB. Thanks to Eric M. for providing this little wrapper! It's a perfect example of the kind of thing which seems obvious... after somebody else has thought to do it.


5 comments:

Eric said...

I'm glad you've found the library useful!

Alexander Strange said...

The Japanese sentence is missing the first character. I hope that's not a bug?

falcon said...

As far as I can tell, the Urdu, Pashto, Persian and Arabic sentences are basically gibberish. Perhaps the source of these sentences is like that? Any way, I love any post that contains Haskell and Urdu :)

Bill Mill said...

Blogger's RSS feed leaves out the < signs from the "<-" operators, which makes the code sample extremely confusing when read over RSS.

I came here completely humbled in my rudimentary ability to read Haskell, ready to ask where "args" came from. I hate blogger!

Anonymous said...
This comment has been removed by a blog administrator.