We were discussing on the Haskell fr what it would take to work with UTF-8.  Here is an example I cobbled together.  Do whatever you want; it's going in the public domain.  Note that I've also posted it to the wiki, so please make corrections 
there if you can.
Note that I don't really know what the best practices are wrt to reading and writing UTF-8, but here's what works for me.
> module Main where
 
> import Control.Monad (mapM_)
> import Data.Word (Word8)
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
> import System.Environment (getArgs)
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
>                   IOMode(ReadMode, WriteMode))
We're going to be using the 2002 
UTF-8 implementation by Sven Moritz Hallberg. I don't know if this is the best one, but it's what darcs uses. (I do wonder though, if it's worth turning it into a little library, something like Data.Char.UTF8)
> import UTF8
We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its lines, and writing it back out with the extension '.reversed'
> main :: IO ()
> main =
>  do args <- getArgs
>     mapM_ reverseUTF8File args
 
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
>   do fb <- readFileBytes f
>      case decode fb of
>        (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
>        (_,  xs) -> fail $ show xs
>   where
>     reverseLines = unlines . map reverse . lines
For this to work, we need to have some helper functions for reading and writing [Word8]. I don't know if this is the right way to go about it.
> readFileBytes :: FilePath -> IO [Word8]
> readFileBytes f =
>   do h <- openBinaryFile f ReadMode
>      hsize <- fromIntegral `fmap` hFileSize h
>      hGetBytes h hsize
>
> writeFileBytes :: FilePath -> [Word8] -> IO ()
> writeFileBytes f ws =
>  do h <- openBinaryFile f WriteMode
>     hPutBytes h (length ws) ws
 
> hGetBytes :: Handle -> Int -> IO [Word8]
> hGetBytes h c = allocaArray c $ \p ->
>                   do c' <- hGetBuf h p c
>                      peekArray c' p
>
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
> hPutBytes h c ws = allocaArray c $ \p ->
>                      do pokeArray p ws
>                         hPutBuf h p c
We were discussing on the Haskell fr what it would take to work with UTF-8.  Here is an example I cobbled together.  Do whatever you want; it's going in the public domain.  Note that I've also posted it to the wiki, so please make corrections 
there if you can.
Note that I don't really know what the best practices are wrt to reading and writing UTF-8, but here's what works for me.
> module Main where
 
> import Control.Monad (mapM_)
> import Data.Word (Word8)
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
> import System.Environment (getArgs)
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
>                   IOMode(ReadMode, WriteMode))
We're going to be using the 2002 
UTF-8 implementation by Sven Moritz Hallberg. I don't know if this is the best one, but it's what darcs uses. (I do wonder though, if it's worth turning it into a little library, something like Data.Char.UTF8)
> import UTF8
We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its lines, and writing it back out with the extension '.reversed'
> main :: IO ()
> main =
>  do args <- getArgs
>     mapM_ reverseUTF8File args
 
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
>   do fb <- readFileBytes f
>      case decode fb of
>        (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
>        (_,  xs) -> fail $ show xs
>   where
>     reverseLines = unlines . map reverse . lines
For this to work, we need to have some helper functions for reading and writing [Word8]. I don't know if this is the right way to go about it.
> readFileBytes :: FilePath -> IO [Word8]
> readFileBytes f =
>   do h <- openBinaryFile f ReadMode
>      hsize <- fromIntegral `fmap` hFileSize h
>      hGetBytes h hsize
>
> writeFileBytes :: FilePath -> [Word8] -> IO ()
> writeFileBytes f ws =
>  do h <- openBinaryFile f WriteMode
>     hPutBytes h (length ws) ws
 
> hGetBytes :: Handle -> Int -> IO [Word8]
> hGetBytes h c = allocaArray c $ \p ->
>                   do c' <- hGetBuf h p c
>                      peekArray c' p
>
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
> hPutBytes h c ws = allocaArray c $ \p ->
>                      do pokeArray p ws
>                         hPutBuf h p c
reading and writing UTF-8 in Haskell
2 comments:
"I do wonder though, if it's worth turning it into a little library, something like Data.Char.UTF8"
It's always worth taking your code and splitting it up into libraries! It gives back to the community (whose libraries you are using), it encourages you to design a clean and consistent abstraction to the library - which usually results in better code in your application as well.
It probably won't end up in other people contributing (unless you are lucky), but it will make other people think you are cool :)
Thanks for the encouragement. I mentioned this to pesco, the author of the UTF-8 library, and he says that he's been thinking of doing exactly this. Yay!
Post a Comment