mirror of
https://github.com/docling-project/docling.git
synced 2026-03-26 06:01:04 +00:00
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document. - Conversion preserves reading order given by HTML DOM tree - Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc. - Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples) - Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks - Support for inline styling (bold, italic, etc.) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>