markup 0.6 · OCaml Package

Error-recovering functional HTML and XML parsers and writers.

Markup.ml is an HTML and XML parsing and serialization library. It:

Is error-recovering, so you can get a best-effort parse of malformed input.
Reports all errors before recovery, so you can get strict parsing instead.
Is based on the XML grammar and HTML parser from the respective specifications.
Accepts document fragments, but can be told to accept only full documents.
Detects character encodings automatically.
Supports both simple synchronous (this module) and non-blocking usage (Markup_lwt).
Is streaming and lazy. Partial input is processed as soon as received, but only as needed.
Does one pass over the input and emits a stream of SAX-style parsing signals. A helper (tree) allows that to be easily converted into DOM-style trees.

The usage is straightforward. For example:

open Markup

(* Correct and pretty-print HTML. *)
channel stdin
|> parse_html |> signals |> pretty_print
|> write_html |> to_channel stdout

(* Show up to 10 XML well-formedness errors to the user. Stop after
   the 10th, without reading more input. *)
let report =
  let count = ref 0 in
  fun location error ->
    error |> Error.to_string ~location |> prerr_endline;
    count := !count + 1;
    if !count >= 10 then raise_notrace Exit

string "some xml" |> parse_xml ~report |> signals |> drain

(* Load HTML into a custom document tree data type. *)
type html = Text of string | Element of string * html list

file "some_file"
|> parse_html
|> signals
|> tree
  ~text:(fun ss -> Text (String.concat "" ss))
  ~element:(fun (_, name) _ children -> Element (name, children))

The interface is centered around four functions. In pseudocode:

val parse_html : char stream   -> signal stream
val write_html : signal stream -> char stream
val parse_xml  : char stream   -> signal stream
val write_xml  : signal stream -> char stream

Most of the remaining functions create streams from, or write streams to, strings, files, and channels, or manipulate streams, such as next and the combinators map and fold.

Apart from this module, Markup.ml provides two other top-level modules:

Markup_lwt
Markup_lwt_unix

Most of the interface of Markup_lwt is specified in signature ASYNCHRONOUS, which will later be shared with a planned Markup_async module.

Markup.ml is developed on GitHub and distributed under the BSD license. This documentation is for version 0.6 of the library. Documentation for older versions can be found on the releases page.

Streams

type async

type sync

Phantom types for use with ('a, 's) stream in place of 's. See explanation below.

type ('a, 's) stream

Streams of elements of type 'a.

In simple usage, when using only this module Markup, the additional type parameter 's is always sync, and there is no need to consider it further.

However, if you are using Markup_lwt, you may create some async streams. The difference between the two is that next on a sync stream retrieves an element before next "returns," while next on an async stream might not retrieve an element until later. As a result, it is not safe to pass an async stream where a sync stream is required. The phantom types are used to make the type checker catch such errors at compile time.

Errors

The parsers try to recover from errors automatically. Look in module Error if you need to debug parser output or want stricter behavior.

type location = int * int

Line and column for parsing errors. Both numbers are one-based.

module Error : sig ... end

Error type and to_string function.

Encodings

The parsers detect encodings automatically. Look in module Encoding if you need to specify an encoding.

module Encoding : sig ... end

Common Internet encodings such as UTF-8 and UTF-16; also includes some less popular encodings that are sometimes necessary for parsing XML encoding declarations.

Signals

type name = string * string

Expanded name: a namespace URI followed by a local name.

type xml_declaration = {

version : string;
encoding : string option;
standalone : bool option;

}

Representation of an XML declaration, i.e. <?xml version="1.0" encoding="utf-8"?>.

type doctype = {

doctype_name : string option;
public_identifier : string option;
system_identifier : string option;
raw_text : string option;
force_quirks : bool;

}

Representation of a document type declaration. The HTML parser fills in all fields besides raw_text. The XML parser reads declarations roughly, and fills only the raw_text field with the text found in the declaration.

type signal = [

| `Start_element of name * (name * string) list
| `End_element
| `Text of string list
| `Doctype of doctype
| `Xml of xml_declaration
| `PI of string * string
| `Comment of string

]

Parsing signals. The parsers emit them according to the following grammar:

doc     ::= `Xml? misc* `Doctype? misc* element misc*
misc    ::= `PI | `Comment
element ::= `Start_element content* `End_element
content ::= `Text | element | `PI | `Comment

As a result, emitted `Start_element and `End_element signals are always balanced, and, if there is an XML declaration, it is the first signal.

If parsing with ~context:`Document, the signal sequence will match the doc production until the first error. If parsing with ~context:`Fragment, it will match content*. If ~context is not specified, the parser will pick one of the two by examining the input.

As an example, if the XML parser is parsing

<?xml version="1.0"?><root>text<nested>more text</nested></root>

it will emit the signal sequence

`Xml {version = "1.0"; encoding = None; standalone = None};
`Start_element (("", "root"), []);
`Text ["text"]
`Start_element (("", "nested"), []);
`Text ["more text"]
`End_element
`End_element

The `Text signal carries a string list instead of a single string because on 32-bit platforms, strings cannot be larger than 16MB. In case the parsers encounter a very long sequence of text, one whose length exceeds about Sys.max_string_length / 2, they will emit a `Text signal with several strings.

type content_signal = [

| `Start_element of name * (name * string) list
| `End_element
| `Text of string list

]

A restriction of type signal to only elements and text, i.e. no comments, processing instructions, or declarations. This can be useful for pattern matching in applications that only care about the content and element structure of a document. See the helper content.

val signal_to_string : [< signal ] -> string

Provides a human-readable representation of signals for debugging.

Parsers

type 's parser

An 's parser is a thin wrapper around a (signal, 's) stream that supports access to additional information that is not carried directly in the stream, such as source locations.

val signals : 's parser -> (signal, 's) stream

Converts a parser to its underlying signal stream.

val location : _ parser -> location

Evaluates to the location of the last signal emitted on the parser's signal stream. If no signals have yet been emitted, evaluates to (1, 1).

XML

val parse_xml : 
  ?report:(location -> Error.t -> unit) ->
  ?encoding:Encoding.t ->
  ?namespace:(string -> string option) ->
  ?entity:(string -> string option) ->
  ?context:[ `Document | `Fragment ] ->
  (char, 's) stream ->
  's parser

Creates a parser that converts an XML byte stream to a signal stream.

For simple usage, bytes |> parse_xml |> signals.

If ~report is provided, report is called for every error encountered. You may raise an exception in report, and it will propagate to the code reading the signal stream.

If ~encoding is not specified, the parser detects the input encoding automatically. Otherwise, the given encoding is used.

~namespace is called when the parser is unable to resolve a namespace prefix. If it evaluates to Some s, the parser maps the prefix to s. Otherwise, the parser reports `Bad_namespace.

~entity is called when the parser is unable to resolve an entity reference. If it evaluates to Some s, the parser inserts s into the text or attribute being parsed without any further parsing of s. s is assumed to be encoded in UTF-8. If entity evaluates to None instead, the parser reports `Bad_token. See xhtml_entity if you are parsing XHTML.

The meaning of ~context is described at signal, above.

val write_xml : 
  ?report:((signal * int) -> Error.t -> unit) ->
  ?prefix:(string -> string option) ->
  ([< signal ], 's) stream ->
  (char, 's) stream

Converts an XML signal stream to a byte stream.

If ~report is provided, it is called for every error encountered. The first argument is a pair of the signal causing the error and its index in the signal stream. You may raise an exception in report, and it will propagate to the code reading the byte stream.

~prefix is called when the writer is unable to find a prefix in scope for a namespace URI. If it evaluates to Some s, the writer uses s for the URI. Otherwise, the writer reports `Bad_namespace.

HTML

val parse_html : 
  ?report:(location -> Error.t -> unit) ->
  ?encoding:Encoding.t ->
  ?context:[ `Document | `Fragment of string ] ->
  (char, 's) stream ->
  's parser

Similar to parse_xml, but parses HTML with embedded SVG and MathML, never emits signals `Xml or `PI, and ~context has a different type on tag `Fragment.

For HTML fragments, you should specify the enclosing element, e.g. `Fragment "body". This is because, when parsing HTML, error recovery and the interpretation of text depend on the current element. For example, the text

foo</bar>

parses differently in title elements than in p elements. In the former, it is parsed as foo</bar>, while in the latter, it is foo followed by a parse error due to unmatched tag </bar>. To get these behaviors for a fragment consisting of only the text, you set ~context to `Fragment "title" and `Fragment "p", respectively.

If you use `Fragment "svg", the fragment is assumed to be SVG markup. Likewise, `Fragment "math" causes the parser to parse MathML markup.

If ~context is omitted, the parser guesses it from the input stream. For example, if the first signal would be `Doctype, the context is set to `Document, but if the first signal would be `Start_element "td", the context is set to `Fragment "tr". If the first signal would be `Start_element "g", the context is set to `Fragment "svg".

val write_html : ([< signal ], 's) stream -> (char, 's) stream

Similar to write_xml, but emits HTML5 instead of XML.

Input sources

val string : string -> (char, sync) stream

Evaluates to a stream that retrieves successive bytes from the given string.

val buffer : Buffer.t -> (char, sync) stream

Evaluates to a stream that retrieves successive bytes from the given buffer. Be careful of changing the buffer while it is being iterated by the stream.

val channel : Pervasives.in_channel -> (char, sync) stream

Evaluates to a stream that retrieves bytes from the given channel. If the channel cannot be read, the next read of the stream results in raising Sys_error.

Note that this input source is synchronous because Pervasives.in_channel reads are blocking. For non-blocking channels, see Markup_lwt_unix.

val file : string -> (char, sync) stream * (unit -> unit)

file path opens the file at path, then evaluates to a pair s, close, where reading from stream s retrieves successive bytes from the file, and calling close () closes the file. If the file cannot be opened, raises Sys_error immediately. Otherwise, with respect to s, behaves as channel.

val fn : (unit -> char option) -> (char, sync) stream

fn f is a stream that retrives bytes by calling f (). If the call results in Some c, the stream emits c. If the call results in None, the stream is considered to have ended.

This is actually an alias for stream, restricted to type char.

Output destinations

val to_string : (char, sync) stream -> string

Eagerly retrieves bytes from the given stream and assembles a string.

val to_buffer : (char, sync) stream -> Buffer.t

Eagerly retrieves bytes from the given stream and places them into a buffer.

val to_channel : Pervasives.out_channel -> (char, sync) stream -> unit

Eagerly retrieves bytes from the given stream and writes them to the given channel. If writing fails, raises Sys_error.

val to_file : string -> (char, sync) stream -> unit

Eagerly retrieves bytes from the given stream and writes them to the given file. If writing fails, or the file cannot be opened, raises Sys_error. Note that the file is truncated (cleared) before writing. If you wish to append to file, open it with the appropriate flags and use to_channel on the resulting channel.

Stream operations

val stream : (unit -> 'a option) -> ('a, sync) stream

stream f creates a stream that repeatedly calls f (). Each time f () evaluates to Some v, the next item in the stream is v. The first time f () evaluates to None, the stream ends.

val next : ('a, sync) stream -> 'a option

Retrieves the next item in the stream, if any.

val peek : ('a, sync) stream -> 'a option

Retrieves the next item in the stream, if any, but does not remove the item from the stream.

val fold : ('a -> 'b -> 'a) -> 'a -> ('b, sync) stream -> 'a

fold f init s eagerly folds over the items v, v', v'', ... of s, i.e. evaluates f (f (f init v) v') v''...

val map : ('a -> 'b) -> ('a, 's) stream -> ('b, 's) stream

map f s applies f to each item of s, and produces the resulting stream.

val filter : ('a -> bool) -> ('a, 's) stream -> ('a, 's) stream

filter f s is s without the items for which f evaluates to false.

val filter_map : ('a -> 'b option) -> ('a, 's) stream -> ('b, 's) stream

filter_map f s applies f to each item v of s. If f v evaluates to Some v', the result stream has v'. If f v evaluates to None, no item corresponding to v appears in the result stream.

val iter : ('a -> unit) -> ('a, sync) stream -> unit

iter f s eagerly applies f to each item of s, i.e. evaluates f v; f v'; f v''...

val drain : ('a, sync) stream -> unit

drain s eagerly consumes s. This is useful for observing side effects, such as parsing errors when you don't care about the parsing signals themselves. It is equivalent to iter ignore s.

val of_list : 'a list -> ('a, sync) stream

Produces a (lazy) stream from the given list.

val to_list : ('a, sync) stream -> 'a list

Eagerly converts the given stream to a list.

Utility

val content : ([< signal ], 's) stream -> (content_signal, 's) stream

Converts a signal stream into a content_signal stream by filtering out all signals besides `Start_element, `End_element, and `Text.

val tree : 
  text:(string list -> 'a) ->
  element:(name -> (name * string) list -> 'a list -> 'a) ->
  ([< signal ], sync) stream ->
  'a option

Assembles tree data structures from signal streams. This is done by first ignoring all signals except `Text, `Start_element, and `End_element. The remaining signals are then parsed according to the following grammar:

tree    ::= text | element
text    ::= `Text
element ::= `Start_element tree* `End_element

Each time the function matches text, it calls ~text to convert it into your tree type 'a. Each time the function matches element, it calls ~element with the element's name, attributes, and list of nested subtrees, for the same purpose. The result of the whole call is the tree representing the top-level text or element found. If the signal stream has multiple top-level trees (if it is a sequence of top-level text and elements), only the first one is matched.

For example,

type dom = Text of string | Element of name * dom list

"<p>HTML5 is <em>easy</em> to parse"
|> string
|> parse_html
|> signals
|> tree
  ~text:(fun ss -> Text (String.concat "" ss))
  ~element:(fun (name, _) children -> Element (name, children))

results in the structure

Element ("p" [
  Text "HTML5 is ";
  Element ("em", [Text "easy"]);
  Text " to parse"])

val elements : 
  (name -> (name * string) list -> bool) ->
  ([< signal ] as 'a, 's) stream ->
  (('a, 's) stream, 's) stream

elements f s scans the signal stream s for `Start_element (name, attributes) signals that satisfy f name attributes. Each such matching signal is the beginning of a substream that ends with the corresponding `End_element signal. The result of elements f s is the stream of these substreams. In simpler words, elements f s creates a sequence of streams of elements in s that match f.

Matches don't nest. If there is a matching element contained in another matching element, only the top one results in a substream.

Code using elements does not have to read each substream to completion, or at all. However, once the using code has tried to get the next substream, it should not try to read a previous one.

val text : ([< signal ], 's) stream -> (char, 's) stream

Extracts all the text in a signal stream by discarding all markup, i.e. for each `Text ss signal, the result stream has the bytes of the strings ss, and all other signals are ignored.

val trim : ([> `Text of string list ] as 'a, 's) stream -> ('a, 's) stream

Trims whitespace in a signal stream. For each signal `Text ss, transforms ss so that the result strings ss' satisfy

String.concat "" ss' = String.trim (String.concat "" ss)

All signals for which String.concat "" ss' = "" are then dropped.

val normalize_text : 
  ([> `Text of string list ] as 'a, 's) stream ->
  ('a, 's) stream

Concatenates adjacent `Text signals, then eliminates all empty strings, then all `Text [] signals. Signals besides `Text are unaffected. Note that signal streams emitted by the parsers already have normalized text. This function is useful when you are inserting text into a signal stream after parsing, or generating streams from scratch, and would like to clean up the `Text signals.

val pretty_print : ([> content_signal ] as 'a, 's) stream -> ('a, 's) stream

Adjusts the `Text signals in the given stream so that the output appears nicely-indented when the stream is written.

val html5 : ([< signal ], 's) stream -> (signal, 's) stream

Converts a signal stream into an HTML5 signal stream by stripping any document type declarations, XML declarations, and processing instructions, and prefixing the HTML5 doctype declaration. This is useful when converting between XHTML and HTML, for example.

val xhtml : 
  ?dtd:[ `Strict_1_0 | `Transitional_1_0 | `Frameset_1_0 | `Strict_1_1 ] ->
  ([< signal ], 's) stream ->
  (signal, 's) stream

Similar to html, but does not strip processing instructions, and prefixes an XHTML document type declaration and an XML declaration. The ~dtd argument specifies which DTD to refer to in the doctype declaration. The default is `Strict_1_1.

val xhtml_entity : string -> string option

Translates XHTML entities. This function is for use with the ~entity argument of parse_xml when parsing XHTML.

val strings_to_bytes : (string, 's) stream -> (char, 's) stream

strings_to_bytes s is the stream of all the bytes of all strings in s.

Asynchronous interface

module type ASYNCHRONOUS = sig ... end

Markup.ml interface for monadic I/O libraries such as Lwt and Async. This signature is included in the signature of Markup_lwt, with type 'a io replaced by 'a Lwt.t, and will be included in the planned Markup_async. To use the functions in this interface, use Markup_lwt.

Conformance status

The HTML parser seeks to implement section 8 of the HTML5 specification. That section describes a parser, part of a full-blown user agent, that is building up a DOM representation of an HTML document. Markup.ml is neither inherently part of a user agent, nor does it build up a DOM representation. With respect to section 8 of HTML5, Markup.ml is concerned with only the syntax. When that section requires that the user agent perform an action, Markup.ml emits enough information for a hypothetical user agent based on it to be able to decide to perform this action. Likewise, Markup.ml seeks to emit enough information for a hypothetical user agent to build up a conforming DOM.

The XML parser seeks to be a non-validating implementation of the XML and Namespaces in XML specifications.

This rest of this section lists known deviations from HTML5, XML, and Namespaces in XML. Some of these deviations are meant to be corrected in future versions of Markup.ml, while others will probably remain. The latter satisfy some or all of the following properties:

They require non-local adjustment, especially of past nodes. For example, adjusting the start signal of the root node mid-way through the signal stream is difficult for a one-pass parser.
They are minor. Users implementing less than a conforming browser typically don't care about them, and they typically have to do with obscure error recovery.
They can easily be corrected by code written over Markup.ml that builds up a DOM or maintains other auxiliary data structures during parsing.

To be corrected

XML: There is no attribute value normalization.
HTML: The adoption agency algorithm is not implemented, because it requires non-local adjustments.
HTML: foster parenting is not implemented, because it requires non-local adjustments.
HTML: Quirks mode is not honored. This affects the interaction between automatic closing of p elements and opening of table elements.
HTML: The parser ignores the head element pointer.
HTML: The parser ignores the form element pointer.
HTML: The parser ignores interactions between form and template.
HTML: The form translation for isindex is completely ignored. isindex is handled as an unknown element.

To remain

HTML: Except when detecting encodings, the parser does not try to read <meta> tags for encoding declarations. The user of Markup.ml should read these, if necessary. They are part of the emitted signal stream.
HTML: noscript elements are always parsed, as are script elements. For conforming behavior, if the user of Markup.ml "supports scripts," the user should serialize the content of noscript to a `Text signal using write_html.
HTML: Elements such as title that belong in head, but are found between head and body, are not moved into head.
HTML: <html> tags found in the body do not have their attributes added to the `Start_element "html" signal emitted at the beginning of the document.

package markup