package lambdasoup
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=05d97f38e534a431176ed8d3dbe6dfb7bdcf7770109193c5a69dff53e38f10fe
Description
Lambda Soup is an HTML scraping library inspired by Python's Beautiful Soup. It provides lazy traversals from HTML nodes to their parents, children, siblings, etc., and to nodes matching CSS selectors. The traversals can be manipulated using standard functional combinators such as fold, filter, and map.
The DOM tree is mutable. You can use Lambda Soup for automatic HTML rewriting in scripts. Lambda Soup rewrites its own ocamldoc page this way.
A major goal of Lambda Soup is to be easy to use, including in interactive sessions, and to have a minimal learning curve. It is a very simple library.
Published: 05 Sep 2024
README
Lambda Soup
Lambda Soup is a functional HTML scraping and manipulation library for OCaml aimed at being easy to use.
Lambda Soup is simple. It provides a set of elementary traversals for getting from node to node, familiar functional combinators such as filter
, map
, and fold
, and support for all CSS selectors that still make sense when not running in a browser (and a few obvious extensions on top of that).
Here is a trivial self-contained example:
(parse "<p class='Hello'>World!</p>") $ ".Hello" |> R.leaf_text;;
- : string = "World!"
And, a mutation:
let soup = parse "<p class='Hello'>World!</p>" in
wrap (soup $ ".Hello" |> R.child) (create_element "strong");
soup |> to_string;;
- : string = "<p class=\"Hello\"><strong>World!</strong></p>"
For some more examples, see the Lambda Soup postprocessor that runs on Lambda Soup's own documentation after it is generated by ocamldoc
.
The library is tested thoroughly.
Lambda Soup is based on Markup.ml. As a consequence, it resolves entity references, detects character encodings automatically, and converts everything to UTF-8. And, you can use Lambda Soup on XML, by parsing the XML with Markup.ml and feeding the signals to Lambda Soup.
Installing
opam install lambdasoup
Starting from scratch
To use Lambda Soup interactively as in the GIF at the top of this README, you need to have done something like this:
your-package-manager install ocaml opam
opam init
eval `opam config env` # Or restart your shell
opam install lambdasoup
and make sure your ~/.ocamlinit
file looks something like this:
let () =
try Topdirs.dir_directory (Sys.getenv "OCAML_TOPLEVEL_PATH")
with Not_found -> ()
;;
#use "topfind";;
Then, run ocaml -short-paths
to start the top-level, and scrape away!
Depending
Lambda Soup uses semantic versioning, but is currently in 0.x.x
. For now, the minor version number will be incremented on breaking changes. So, to give yourself a chance to review the changelog before your code breaks, put the following constraint on Lambda Soup: lambdasoup {< "0.7.0"}
.
Documentation
Lambda Soup's interface consists of one module Soup
, whose signature is documented here.
Developing
See CONTRIBUTING
. All feedback is welcome – open an issue on GitHub, or send me an email at antonbachin@yahoo.com. If you find yourself repeatedly writing the same helper on top of Lambda Soup's functions, perhaps we should add it to Lambda Soup.
History
Lambda Soup was originally written to answer a Stack Overflow question in November 2015.
Dependencies (4)
-
ocaml
>= "4.03.0"
-
markup
>= "1.0.0"
-
dune
>= "2.7.0"
-
camlp-streams
>= "5.0.1"
Dev Dependencies (2)
-
ounit2
with-test
-
bisect_ppx
dev & >= "2.5.0"
Used by (13)
- calculon-web
-
camyll
>= "0.4.0" & < "0.4.2"
-
dream
>= "1.0.0~alpha2"
- dream-livereload
- dream-serve
- gradescope_submit
-
memtrace_viewer
< "v0.15.0"
- minima-theme
-
plist
< "1.0.0"
-
river
>= "0.2"
-
soupault
< "1.9.0" | >= "1.13.0"
- universal-portal
-
virtual_dom
>= "v0.14.0"
Conflicts
None