Library
Module
Module type
Parameter
Class
Class type
A monadic parser combinator library.
The parser combinators provided by this module can be used to build parsers for context-sensitive, infinite look-ahead grammars that are reasonably efficient and produce good error messages due to a controlled use of backtracking. The performance of the resulting parsers should be sufficient for most applications. The parsers get their input from character streams provided by the MParser_Char_Stream
module, which means that it is possible to parse files up to a size of at least 1GB.
The MParser
module is an OCaml version of the FParsec library for F# by Stephan Tolksdorf and the Parsec library for Haskell by Daan Leijen. The interface of the MParser
module is very similar to the interfaces of Parsec and FParsec. For this reason, we keep the documentation here rather terse. See the excellent documentation of Parsec and FParsec for more information. Parsers should be easily portable from these two libraries to MParser
(although some functions might behave subtly different). Where the behavior of Parsec and FParsec differs, MParser
generally behaves like FParsec (but there might be exceptions).
A significant drawback of the implementation is that it relies on the standard OCaml types char
and string
and therefore there is currently no support for Unicode.
The state of a parser consists of the input to be parsed, the current position in the input, the number of the current line, the position of the first character of the current line in the input, and an optional user state. A position p
is valid if it satisfies 0 <= p && p < l
, where l
is the length of the input; all other positions are invalid. Characters can only be read from valid positions.
The following functions that directly access or change the parser state should only be used to write very low-level parsers. All other parsers should be composed from parser combinators (see below).
val init : MParser_Char_Stream.t -> 's -> 's state
init input user
returns an initial parser state using the input stream input
and the initial user state user
.
val is_eof : 's state -> bool
is_eof s
returns true
if the current position of s
is not a valid position, and false
otherwise. If is_eof
returns false
, a character can be read from the current position.
advance s
returns the state s
with the position advanced by one character if the current position of s
is a valid position. Otherwise, the same state is returned. This function does not register newlines. If the current character is a newline, advance_state_nl
should be used instead.
advance_state s n
returns the state s
with the position advanced by n
characters if the current position of s
is a valid position. Otherwise, the same state is returned. This function does not register newlines. If the current character is a newline, advance_state_nl
should be used instead.
advance_state_nl s n
returns the state s
with the position advanced by n
characters and the line counter increased by one if the current position of s
is a valid position. Otherwise, the same state is returned.
val read_char : 's state -> char option
read_char s
returns Some c
where c
is the character at the current position, or None
if this position is not a valid position.
val read_index : 's state -> int -> char option
read_index s pos
returns Some c
where c
is the character at the position pos
, or None
if this position is not a valid position.
val next_char : 's state -> char option
next_char s
returns Some c
where c
is the character after the current position, or None
if this position is not a valid position.
val prev_char : 's state -> char option
prev_char s
returns Some c
where c
is the character before the current position, or None
if this position is not a valid position.
val read_string : 's state -> int -> string
read_string s maxlen
returns a string containing the next n
characters, where n
is the minimum of maxlen
and the number of characters remaining from the current position. If the current position is not a valid position, the empty string is returned.
val match_char : 's state -> char -> bool
match_char s c
returns true
if c
is the character at the current position, and false
otherwise.
val match_string : 's state -> string -> bool
match_string s str
returns true
if the input starting at the current position matches the string str
, and false
otherwise.
When building parsers from the parser combinators and running them using the parse
functions (see below), error handling and reporting is nearly automatic. If a parser run fails, the parse
functions return a human-readable (plain English) error message that is generated from the labels attached to the parsers using the labelling operators <?>
and <??>
.
The following types and functions can be used for explicit creation of errors in parsers and for customizing the handling of errors returned by parser runs. For this purpose the parse
functions also return the actual error
value in the case of a failed parser run. Typical applications for customized error handling are the internationalization of error messages and the automatic processing of parse errors.
An input position, consisting of an index into the input, a line number, and a column number.
type error_message =
| Unexpected_error of string
An unexpected symbol occurred in the input.
*)| Expected_error of string
A symbol that was expected in the input could not be parsed.
*)| Message_error of string
An error occurred that does not fit into any other category.
*)| Compound_error of string * error
An error occurred while parsing a part of a compound.
*)| Backtrack_error of error
The parser backtracked after an error occurred.
*)| Unknown_error
An unknown error occurred.
*)The type of error messages returned by parsers.
and error =
| Parse_error of pos * error_message list
| No_error
The type of errors returned by parsers.
*)Creates an Unexpected_error
. The argument should describe the unexpected symbol that occurred in the input.
Creates an Expected_error
. The argument should describe the symbol that was expected but could not be parsed.
Creates a Message_error
. The argument should contain the complete error message.
Creates a Compound_error
. The string argument should describe the compound that could not be parsed; the error argument should be the error that caused to compound parser to fail.
Creates a Backtrack_error
. The argument should be the error that caused the parser to backtrack.
Merges two errors. The behavior of the error reporting is undefined if Parse_error
values from different positions are merged.
To make handling of parse errors possible, the reply of a parser must not only indicate whether the parser has failed or succeeded, but also whether the parser has consumed input. When a parser is run, the general rule is that when it fails, alternative parsers created using the <|>
and choice
combinators are only tried if the first parser did not consume input. Thus by default the resulting parsers are predictive (non-backtracking). This behavior can be changed by using combinators like attempt
and look_ahead
. By this means the MParser
module can be used to build efficient parsers for a very large class of languages that provide nearly automatic handling of errors, which is virtually impossible with full-backtracking parsers (because the position causing the failure cannot be determined).
This approach to combinator parsing has been pioneered by Daan Leijen's Parsec library. A more detailed presentation of it can be found in the following paper: Daan Leijen and Erik Meijer, Parsec: Direct-Style Monadic Parser Combinators For The Real World, Technical Report UU-CS-2001-35, Departement of Computer Science, Universiteit Utrecht, 2001.
type ('a, 's) reply =
| Empty_failed of error
The parser failed without consuming input.
*)| Empty_ok of 'a * 's state * error
The parser succeeded without consuming input.
*)| Consumed_failed of error
The parser failed after consuming input.
*)| Consumed_ok of 'a * 's state * error
The parser succeeded after consuming input.
*)The type of replies returned by parsers.
type ('a, 's) parser = ('a, 's) t
The type of parsers with result type 'a
and user state type 's
.
make_ok consumed result state error
returns Empty_ok (result, state,
error)
if consumed = false
, and Consumed_ok (result, state, error)
if consumed = true
.
make_failed consumed error
returns Empty_failed error
if consumed =
false
, and Consumed_failed error
of consumed = true
.
val is_consumed : ('a, 's) reply -> bool
is_consumed reply
returns true
if reply
is Consumed_failed
or Consumed_ok
, and false
otherwise.
val is_empty : ('a, 's) reply -> bool
is_consumed reply
returns true
if reply
is Empty_failed
or Empty_ok
, and false
otherwise.
val is_error : ('a, 's) reply -> bool
is_error reply
returns true
if reply
is Empty_failed
or Consumed_failed
, and false
otherwise.
val is_ok : ('a, 's) reply -> bool
is_error reply
returns true
if reply
is Empty_ok
or Consumed_ok
, and false
otherwise.
set_error reply error
returns reply
with the error message replaced by error
.
The result of a parser run. In the case of Failed
, it contains a human-readable error message.
val parse : ('a, 's) t -> MParser_Char_Stream.t -> 's -> 'a result
parse p s user
runs the parser p
on the input stream s
using the initial user state user
.
parse_string p str user
runs the parser p
on the input stream produced from the string str
using the initial user state user
. The stream is created with MParser_Char_Stream.from_string
.
val parse_channel : ('a, 's) t -> Pervasives.in_channel -> 's -> 'a result
parse_string p chn user
runs the parser p
on the input stream produced from the channel chn
using the initial user state user
. The stream is created with MParser_Char_Stream.from_channel
.
Note: A statement of the form "parser p
is equivalent to q
", where q
is a compound parser, means that p
is functionally equivalent to q
, that is, it behaves exactly the same as q
, although it might be implemented differently. Using p
is generally more efficient than using the compound parser q
and should therefore be preferred.
val return : 'a -> ('a, 's) t
return x
always succeeds with the result x
without consuming any input.
try_return f x msg s0
succeeds with the result f x
without consuming input if f x
does not raise an exception. Otherwise, it fails with a Message_error
with error message msg
at state s0
. This combinator is useful where a result must be computed from another parser result and where this computation may raise an exception.
A variant of try_return
for functions with two parameters.
A variant of try_return
for functions with three parameters.
val fail : string -> ('a, 's) t
fail msg
always fails with a Message_error
with error message msg
. The fail
parser pretends having consumed input, so that all error messages are overwritten.
val message : string -> ('a, 's) t
message msg
always fails with a Message_error
with error message msg
without consuming input, so that the error message is merged with other errors generated for the same input position.
val zero : ('a, 's) t
zero
always fails with an Unknown_error
without consuming input.
p >>= f
first applies the parser p
, then applies f
to the resulting value, and finally applies the resulting parser. Since the second parser can depend on the result of the first parser, it is possible to parse context-sensitive grammars.
p << q
is equivalent to p >>= (fun x -> q >> return x)
.
p >>? q
behaves like p >> q
, but if q
fails without consuming input, it backtracks and pretends not having consumed input, even if p
has consumed input.
p |>> f
is equivalent to p >>= (fun x -> return (f x))
.
A variant of (|>>)
for functions with two parameters.
A variant of (|>>)
for functions with three parameters.
val pipe4 :
('a, 's) t ->
('b, 's) t ->
('c, 's) t ->
('d, 's) t ->
('a -> 'b -> 'c -> 'd -> 'e) ->
('e, 's) t
A variant of (|>>)
for functions with four parameters.
p <|> q
first applies p
. If p
fails without consuming input, it applies q
.
choice [p1; p2; ...; pn ]
is equivalent to p1 <|> p2 <|> ... <|> pn
<|> zero
.
attempt p
behaves like p
, but if p
fails after consuming input, it backtracks and pretends not having consumed input. The error message of p
is wrapped inside a Backtrack_error
.
p <?> label
attaches the label label
to p
. If p
fails without consuming input, the error message of p
is replaced by an Expected_error
with the label label
.
p <??> label
behaves like p <?> label
, but if p
fails after consuming input, the error message of p
is wrapped inside a Compound_error
.
look_ahead p
behaves like p
, but restores the original state after parsing. It always returns an empty reply.
followed_by p msg
succeeds without consuming input and returns ()
if p
succeeds at the current position. Otherwise, it fails without consuming input and returns an Expected_error
with error message msg
.
not_followed_by p msg
succeeds without consuming input and returns ()
if p
does not succeed at the current position. Otherwise, it fails without consuming input and returns an Unexpected_error
with error message msg
.
option p
is equivalent to p >>= (fun r -> return (Some r)) <|>$
None
.
pair p q
is equivalent to p >>= (fun x -> q >>= (fun y -> return (x,
y)))
.
many p
parses zero or more occurrences of p
and returns a list of the results returned by p
.
many1 p
parses one or more occurrences of p
and returns a list of the results returned by p
.
many_fold_left f a p
is equivalent to many p |>> List.fold_left f a
.
many1_fold_left f a p
is equivalent to many1 p |>> List.fold_left f a
.
many_rev_fold_left f a p
is equivalent to many p |>> List.rev |>> List.fold_left f a
.
many1_rev_fold_left f a p
is equivalent to many1 p |>> List.rev |>> List.fold_left f a
.
chain_left p op x
parses zero or more occurrences of p
, separated by op
. It returns the value obtained by the left-associative application of the functions returned by op
to the results of p
. If there are zero occurrences of p
, the value x
is returned.
chain_left1 p op
parses one or more occurrences of p
, separated by op
. It returns the value obtained by the left-associative application of the functions returned by op
to the results of p
.
chain_right p op x
parses zero or more occurrences of p
, separated by op
. It returns the value obtained by the right-associative application of the functions returned by op
to the results of p
. If there are zero occurrences of p
, the value x
is returned.
chain_right1 p op
parses one or more occurrences of p
, separated by op
. It returns the value obtained by the right-associative application of the functions returned by op
to the results of p
.
count n p
parses exactly n
occurrences of p
and returns a list of the results returned by p
.
between left right p
is equivalent to left >> p << right
.
sep_by p sep
parses zero or more occurrences of p
, separated by sep
. It returns a list of the results returned by p
.
sep_by1 p sep
parses one or more occurrences of p
, separated by sep
. It returns a list of the results returned by p
.
sep_end_by p sep
parses zero or more occurrences of p
, separated and optionally ended by sep
. It returns a list of the results returned by p
.
sep_end_by1 p sep
parses one or more occurrences of p
, separated and optionally ended by sep
. It returns a list of the results returned by p
.
end_by p sep
parses zero or more occurrences of p
, separated and ended by sep
. It returns a list of the results returned by p
.
end_by1 p sep
parses one or more occurrences of p
, separated and ended by sep
. It returns a list of the results returned by p
.
many_until p q
parses zero or more occurrences of p
until q
succeeds and returns a list of the results returned by p
. It is equivalent to many (not_followed_by q "" >> p) << q
. Note that q
is parsed twice and should therefore not have side effects.
skip_many_until p q
is equivalent to skip (many_until p q)
.
val get_input : (MParser_Char_Stream.t, 's) t
Returns the input stream.
val get_index : (int, 's) t
Returns the current index into the input.
val register_nl : int -> int -> (unit, 's) t
register_nl lines chars_after_nl
increases the line counter by lines
and sets the beginning of the current line to chars_after_nl
characters before the current index.
val eof : (unit, 's) t
Parses the end of the input.
val get_user_state : ('s, 's) t
Returns the current user state of the parser.
val set_user_state : 's -> (unit, 's) t
Sets the current user state of the parser.
val update_user_state : ('s -> 's) -> (unit, 's) t
update_user_state f
applies f
to the user state of the parser.
The following specialized parsers and parser combinators work directly on the characters of the input stream and are therefore more efficient than the general combinators. Generally, the basic character and string parsers only consume input when they succeed.
val skip_nchars : int -> (unit, 's) t
skip_nchars n
skips n
characters of the input. Newlines are not registered. This parser never fails, even if there are less than n
characters left.
val char : char -> (char, 's) t
char c
parses the character c
and returns it.
val skip_char : char -> (unit, 's) t
skip_char c
is equivalent to skip (char c)
.
val any_char : (char, 's) t
Parses any character and returns it. This parser does not register newlines. Use any_char_or_nl
if the current character can be a newline.
val skip_any_char : (unit, 's) t
skip_any_char
is equivalent to skip any_char
.
val any_char_or_nl : (char, 's) t
any_char_or_nl
is equivalent to newline <|> any_char
.
val skip_any_char_or_nl : (unit, 's) t
skip_any_char_or_nl
is equivalent to skip any_char_or_nl
.
val peek_char : (char, 's) t
Returns the character at the position after the current position or fails if this is not a valid position. This parser does not consume input.
val string : string -> (string, 's) t
string s
parses the string s
and returns it.
val skip_string : string -> (unit, 's) t
skip_string s
is equivalent to skip (string s)
.
val any_string : int -> (string, 's) t
any_string n
parses any string of n
characters and returns it. Fails if there are less than n
characters left in the input.
many_chars p
parses zero or more occurrences of p
and returns a string of the results returned by p
.
many1_chars p
parses one or more occurrences of p
and returns a string of the results returned by p
.
skip_many_chars p
is equivalent to skip (many_chars p)
.
skip_many1_chars p
is equivalent to skip (many1_chars p)
.
many_chars_until p q
parses zero or more occurrences of p
until q
succeeds and returns a string of the results returned by p
. It is equivalent to many_chars (not_followed_by q "" >> p) << q
. Note that q
is parsed twice and should therefore not have side effects.
skip_many_chars_until p q
is equivalent to skip (many_chars_until p q)
.
val satisfy : (char -> bool) -> (char, 's) t
satisfy p
parses a character for which p
returns true
and returns this character. It fails with an Unknown_error
if the character at the current position does not satisfy p
.
val satisfy_l : (char -> bool) -> string -> (char, 's) t
satisfy_l p label
is equivalent to satisfy p <?> label
.
val skip_satisfy : (char -> bool) -> (unit, 's) t
skip_satisfy p
is equivalent to skip (satisfy p)
.
val skip_satisfy_l : (char -> bool) -> string -> (unit, 's) t
skip_satisfy_l p label
is equivalent to skip (satisfy_l p label)
.
val nsatisfy : int -> (char -> bool) -> (string, 's) t
nsatisfy n p
parses the next n
characters if p
returns true
for each of them. Otherwise it fails with an Unknown_error
without consuming input.
val many_satisfy : (char -> bool) -> (string, 's) t
many_satisfy p
is equivalent to many_chars (satisfy p)
.
val many1_satisfy : (char -> bool) -> (string, 's) t
many1_satisfy p
is equivalent to many1_chars (satisfy p)
.
val skip_many_satisfy : (char -> bool) -> (unit, 's) t
skip_many_satisfy p
is equivalent to skip_many (satisfy p)
.
val skip_many1_satisfy : (char -> bool) -> (unit, 's) t
skip_many1_satisfy p
is equivalent to skip_many1 (satisfy p)
.
val next_char_satisfies : (char -> bool) -> (unit, 's) t
next_char_satisfies p
succeeds without consuming input if p
returns true
for the character after the current position. Otherwise it fails with an Unknown_error
.
val prev_char_satisfies : (char -> bool) -> (unit, 's) t
prev_char_satisfies p
succeeds without consuming input if p
returns true
for the character before the current position. Otherwise it fails with an Unknown_error
.
val any_of : string -> (char, 's) t
any_of str
parses any character occurring in the string str
and returns it.
val none_of : string -> (char, 's) t
none_of str
parses any character not occurring in the string str
and returns it.
is_not c
parses any character that is not accepted by parser c
. Fails with Unknown_error
if the character is accepted by c
.
val uppercase : (char, 's) t
Parses an English uppercase letter and returns it.
val lowercase : (char, 's) t
Parses an English lowercase letter and returns it.
val letter : (char, 's) t
Parses an English letter and returns it.
val digit : (char, 's) t
Parses a decimal digit and returns it.
val hex_digit : (char, 's) t
Parses a hexadecimal digit and returns it.
val oct_digit : (char, 's) t
Parses an octal digit and returns it.
val alphanum : (char, 's) t
Parses an English letter or a decimal digit and returns it.
val tab : (char, 's) t
Parses a tab character ('\t'
) and returns it.
val blank : (char, 's) t
Parses a space or a tab character (' '
or '\t'
and returns it.
val newline : (char, 's) t
Parses a newline ('\n'
, '\r'
, or the sequence '\r', '\n'
). If it succeeds, it always returns '\n'
. The position in the parser state is correctly updated.
val space : (char, 's) t
Parses a space (' '
), a tab ('\t'
) or a newline ('\n'
, '\r'
, or the sequence '\r', '\n'
). If a newline is parsed, it returns '\n'
and correctly updates the position in the parser state. Otherwise it returns the parsed character.
val non_space : (char, 's) t
non_space
is equivalent to is_not space
, with a better error message.
val spaces : (unit, 's) t
spaces
is equivalent to skip_many_chars space
.
val spaces1 : (unit, 's) t
spaces
is equivalent to skip_many_chars1 space
.
The associativity of an operator. An operator (#)
is left-associative if a # b # c = (a # b) # c
, right-associative if a # b # c = a # (b #
c)
, and non-associative if applying (#)
to an expression with head operator (#)
is not allowed. Note that a value of this type specifies only how an expression like a # b # c
is parsed, not how it is interpreted semantically.
The type of operators on type 'a
. The function returned by the parser argument to the Infix
, Prefix
, and Postfix
constructor is used to build the result of applying the operator to its operands.
expression operators term
parses any well-formed expression that can built from the basic terms parsed by term
and the operators specified in the operator table operators
. The operator table is a list of operator
lists that is ordered in descending precedence. All elements in one list of operators
have the same precedence, but may have different associativities.
Adjacent prefix and postfix operators of the same precedence are not well-formed. For example, if (-)
denotes prefix negation, --x
is not a well-formed expression (if (--)
does not denote an operator on its own). If a prefix and a postfix operator of the same precedence are applied to an expression, the prefix operator is applied before the postfix operator.
The following example demonstrates the usage of the expression
parser. It implements a minimalistic calculator that can be used to evaluate expressions like eval "(1 + 2 * 3) / -2"
, which returns -3
.
open MParser
open Tokens
exception Syntax_error
let infix sym f assoc = Infix (skip_symbol sym >> return f, assoc)
let prefix sym f = Prefix (skip_symbol sym >> return f)
let negate x = -x
let operators =
[
[ prefix "-" negate ];
[ infix "*" ( * ) Assoc_left; infix "/" ( / ) Assoc_left ];
[ infix "+" ( + ) Assoc_left; infix "-" ( - ) Assoc_left ];
]
let rec term s = (parens expr <|> decimal) s
and expr s = expression operators term s
let eval s =
match parse_string expr s () with
| Success x -> x
| Failed (msg, _) ->
print_string msg;
raise Syntax_error
module MakeRegexp (Regexp : MParser_Sig.Regexp) : sig ... end