package uuuu
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=6b7ee8f3e343813b0c6ac8ddb7f6720b2ccd27b4208313d3bcff5d7d984fc3a6
sha512=4b26676fe809d2aba74614ab739315aa7b0f08469ff97400e05efba05e6e3fed4edf3ce5ed3b920e438f46f6b344f81c30a90534b27a8b09f4f0c69c29f68cc3
Description
A simple mapper between ISO-8859-* to Unicode. Useful for a translation between ISO-8859-* and Unicode
Published: 31 Dec 2021
README
Uuuu
Uhuhuhuhuhuh! uuuu
(Universal Unifier to Unicode Un OCaml) is a little library to normalize an ISO-8859 input to Unicode code-point. This library uses tables provided by the Unicode Consortium:
This project takes tables and converts them to OCaml code. Then, it provides a non-blocking best-effort decoder to translate ISO-8859 codepoint to UTF-8 codepoint.
How to use it?
uuuu
has an dbuenzli interface. So it should be easy to use it and trick on it. uuuu
has a simple goal, offer a general way to decode an ISO-8859 input and normalize it to unicode codepoints. We need to be able to control memory-consumption and ensure to offer a non-blocking computation. Finally, an error should not stop the process of the decoding.
This is a little example with uutf to translate a latin1 to UTF-8:
let trans ic oc =
let decoder = Uuuu.decoder (Uuuu.encoding_of_string "latin1") (`Channel ic) in
let encoder = Uutf.encoder `UTF_8 (`Channel oc) in
let rec go () = match Uuuu.decode decoder with
| `Await -> assert false (* XXX(dinosaure): impossible when you use `String of `Channel as source. *)
| `Uchar _ as uchar -> ignore @@ Uutf.encode encoder uchar ; go ()
| `End -> ignore @@ Uutf.encoder `End
| `Malformed err -> failwith err in
go ()
let () = trans stdin stdout
About encoding_of_string
uuuu
follows aliases availables into IANA character sets database: https://www.iana.org/assignments/character-sets.xhtml
Others aliases will raise an exception. This function is case-insensitive.
About translation tables
uuuu
integrates translation tables provided by Unicode consortium. They should not be updated - so we statically save then into an int array
.
About encoding
uuuu
supports only decoding to Unicode code-point. A support of encoding is not on our plan where people should only use Unicode now.
A larger decoder
uuuu
is a part of a biggest project rosetta which is a decoder for some others encodings. If you want to handle more encodings than ISO-8859, you should look into this higher library.
Distribution
uuuu
integrates a little binary to translate ISO-8859 flow to UTF-8: uuuu.to_utf8
. It is provided as an example of how to use uuuu
with uutf
.