Hobby-hacking Eric

2007-02-02

reading and writing UTF-8 in Haskell

We were discussing on the Haskell fr what it would take to work with UTF-8. Here is an example I cobbled together. Do whatever you want; it's going in the public domain. Note that I've also posted it to the wiki, so please make corrections there if you can.

Note that I don't really know what the best practices are wrt to reading and writing UTF-8, but here's what works for me.


> module Main where

> import Control.Monad (mapM_)
> import Data.Word (Word8)
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
> import System.Environment (getArgs)
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
> IOMode(ReadMode, WriteMode))


We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. I don't know if this is the best one, but it's what darcs uses. (I do wonder though, if it's worth turning it into a little library, something like Data.Char.UTF8)


> import UTF8


We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its lines, and writing it back out with the extension '.reversed'


> main :: IO ()
> main =
> do args <- getArgs
> mapM_ reverseUTF8File args

> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
> do fb <- readFileBytes f
> case decode fb of
> (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
> (_, xs) -> fail $ show xs
> where
> reverseLines = unlines . map reverse . lines


For this to work, we need to have some helper functions for reading and writing [Word8]. I don't know if this is the right way to go about it.


> readFileBytes :: FilePath -> IO [Word8]
> readFileBytes f =
> do h <- openBinaryFile f ReadMode
> hsize <- fromIntegral `fmap` hFileSize h
> hGetBytes h hsize
>
> writeFileBytes :: FilePath -> [Word8] -> IO ()
> writeFileBytes f ws =
> do h <- openBinaryFile f WriteMode
> hPutBytes h (length ws) ws

> hGetBytes :: Handle -> Int -> IO [Word8]
> hGetBytes h c = allocaArray c $ \p ->
> do c' <- hGetBuf h p c
> peekArray c' p
>
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
> hPutBytes h c ws = allocaArray c $ \p ->
> do pokeArray p ws
> hPutBuf h p c


2 comments:

Neil Mitchell said...

"I do wonder though, if it's worth turning it into a little library, something like Data.Char.UTF8"

It's always worth taking your code and splitting it up into libraries! It gives back to the community (whose libraries you are using), it encourages you to design a clean and consistent abstraction to the library - which usually results in better code in your application as well.

It probably won't end up in other people contributing (unless you are lucky), but it will make other people think you are cool :)

kowey said...

Thanks for the encouragement. I mentioned this to pesco, the author of the UTF-8 library, and he says that he's been thinking of doing exactly this. Yay!