package biocaml

  1. Overview
  2. Docs
On This Page
  1. Parsing
Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module Biocaml_base.FastaSource

FASTA files. The FASTA family of file formats has different incompatible descriptions (1, 2, 3, 4, etc.). Roughly FASTA files are in the format:

# comment
# comment
...
>description
sequence
>description
sequence
...

Comment lines are allowed at the top of the file. Usually comments start with a '#' but sometimes with a ';' character. The fmt properties allow configuring which is allowed during parsing and printing.

Description lines begin with the '>' character. Various conventions are used for the content but there is no requirement. We simply return the string following the '>' character.

Sequences are most often a sequence of characters denoting nucleotides or amino acids, and thus an item's sequence field is set to a string. Sequences may span multiple lines.

However, sequence lines sometimes are used to provide quality scores, either as space separated integers or as ASCII encoded scores. To support the former case, we provide the sequence_to_int_list function. For the latter case, see modules Phred_score and Solexa_score.

FASTA files are used to provide both short sequences and very big sequences, e.g. a genome. In the latter case, the main API of this module, which returns each sequence as an in-memory string, might be too costly. Consider using instead the Parser0 module which does not merge multiple sequence lines into one string. This API is slightly more difficult to use but perhaps a worthwhile trade-off.

Format Specifiers:

Variations in the format are controlled by the following settings, all of which have a default value. These properties are combined into the fmt type for convenience and the defaults into default_fmt.

  • allow_sharp_comments: Allow comment lines beginning with a '#' character. Default: true.
  • allow_semicolon_comments: Allow comment lines beginning with a ';' character. Default: false.

Setting both allow_sharp_comments and allow_semicolon_comments allows both. Setting both to false disallows comment lines.

  • allow_empty_lines: Allow lines with only whitespace anywhere in the file. Default: false.
  • max_line_length: Require sequence lines to be shorter than given length. None means there is no restriction. Note this does not restrict the length of an item's sequence field because this can span multiple lines. Default: None.
  • alphabet: Require sequence characters to be at most those in given string. None means any character is allowed. Default: None.
Sourcetype header = private Base.string Base.list

A header is a list of comment lines.

Sourcetype item = private {
  1. description : Base.string;
  2. sequence : Base.string;
}
Sourceval sexp_of_item : item -> Sexplib0.Sexp.t
Sourceval item_of_sexp : Sexplib0.Sexp.t -> item
Sourceval item : description:Base.string -> sequence:Base.string -> item

Parsing

Sourcetype fmt = {
  1. allow_sharp_comments : Base.bool;
  2. allow_semicolon_comments : Base.bool;
  3. allow_empty_lines : Base.bool;
  4. max_line_length : Base.int Base.option;
  5. alphabet : Base.string Base.option;
}
Sourceval fmt : ?allow_sharp_comments:Base.bool -> ?allow_semicolon_comments:Base.bool -> ?allow_empty_lines:Base.bool -> ?max_line_length:Base.int -> ?alphabet:Base.string -> Base.unit -> fmt
Sourceval default_fmt : fmt
Sourceval sequence_to_int_list : Base.string -> (Base.int Base.list, [> `Msg of Base.string ]) Base.Result.t

Parse a space separated list of integers.

Sourcetype item0 = [
  1. | `Comment of Base.string
  2. | `Empty_line
  3. | `Description of Base.string
  4. | `Partial_sequence of Base.string
]

An item0 is more raw than item. It is useful for parsing files with large sequences because you get the sequence in smaller pieces.

  • `Comment _ - Single comment line without the final newline. Initial comment char is retained.
  • `Empty_line - Got a line with only whitespace characters. The contents are not provided.
  • `Description _ - Single description line without the initial '>' nor final newline.
  • `Partial_sequence _ - Multiple sequential partial sequences comprise the sequence of a single item.
Sourceval sexp_of_item0 : item0 -> Sexplib0.Sexp.t
Sourceval item0_of_sexp : Sexplib0.Sexp.t -> item0
Sourceval __item0_of_sexp__ : Sexplib0.Sexp.t -> item0
Sourcetype parser_error = [
  1. | `Fasta_parser_error of Base.int * Base.string
]
Sourceval sexp_of_parser_error : parser_error -> Sexplib0.Sexp.t
Sourceval parser_error_of_sexp : Sexplib0.Sexp.t -> parser_error
Sourceval __parser_error_of_sexp__ : Sexplib0.Sexp.t -> parser_error
Sourcemodule Parser0 : sig ... end

Low-level parsing

Sourceval unparser0 : item0 -> Base.string
Sourcemodule Parser : sig ... end

High-level parsing

Sourceval unparser : item -> Base.string
OCaml

Innovation. Community. Security.