How to extract and structure text from PDF files with Python and machine learning
Posos aims to centralise all the medical knowledge and to make it searchable for every health professional. The goal is to help them provide the best possible care for each patient by supporting their therapeutic decision-making with reliable and up-to-date data. The data is extracted by a a textual search engine fed with more than 100,000 files (HTML pages and PDF), coming from more than 200 medical authoritative sources.
Summary
Documents are semantically and visually organized into sections and paragraphs by their authors. User queries are often related to only a few sections in a document. Reproducing this structure helps the search engine produce more relevant results.
HTML pages are structured by design, and thus easy to split into parts. HTML tags (e.g. <h1>,<h2>,<p>) usually convey a good understanding about how knowledge is organized. It is a different story for PDF files.
In this article we will explain how to extract and structure text from PDF files with machine learning.
If you are interested about our search engine, we published a series of three articles on how medical-specific NLP tools can be built and used.
- The first article explains why it is necessary to train domain-specific word embedding and how Posos does it.
- The second article describes the Named-Entity Recognition (NER) and Named-Entity Linking (NEL) models that we built.
- The third article details how NLP can be used to improve the accuracy of a medical search engine.
We built an end-to-end pipeline, which collects and transforms data into up-to-date medical knowledge. Transforming PDF files is a two steps process:
- parsing: extracting the characters (letters) along with their metadata (e.g. coordinates, font, size, colors, etc.), but also lines (to build tables) and images,
- structuring: consisting of assembling characters into paragraphs, which are then classified as text or title and finally hierarchized.
Understanding PDF files
As in many professional fields, health authorities convey the majority of their reports via electronic documents first developed with office suites, then converted to searchable PDF files before publication.
A searchable PDF is a computer generated PDF where you can highlight, select and copy text from within the PDF. Searchable PDF are like a text files, they only store the needed characters of the fonts and the layout of the text on each page. Since the fonts are in vector format they are extremely compact and the size can be enlarged without losing sharpness. PDFs also support arbitrary vector graphics as illustrations. Unlike image PDF (scanned PDF) which require OCR, searchable PDF usually contains everything needed for parsing.
Within text strings, characters are shown using character codes (integers) that map to glyphs in the current font using an encoding [1].
Understanding the problem
Because we learned how to read and write, we understand document layouts, from which letters form words, words form sentences, sentences form paragraphs and so on. We also know which sentences are titles, and how titles are hierarchized.
But for a computer a PDF file is simply a list of characters, which are not always ordered. It is like a puzzle where letters are pieces. Assembling characters together to create sentences becomes tricky when dealing with pages made of several columns of text.
Another interesting task is to determine if a sentence should be labeled as text or title (or something else). Although it seems natural for us humans, finding a set of rules to address this problem is complex. Especially since what makes a sentence a title (or a text) may differ from one document to another. For instance colors, left indentation or font case may convey different information about the text we want to classify.
To solve this puzzle we developed a pipeline made of several steps consisting of:
- assembling characters into lines we call textlines,
- merging textlines into blocks of text we call textblocks,
- classifing textblocks as text or title,
- ranking title textblocks.
Partitioning space into textlines
First we divide the page space into textual areas, from which we assemble characters into textlines. Usually letters are stored following the reading order. That being said, it is usually not the case for footers which are likely to appear at the top. Also it is often very difficult to sort characters on multi-column pages.
We developed a layout-based segmentation method to split the page space into similar textual areas, like the several columns of a multi-column page. Then those areas are ordered from top-left to bottom right, from which we assemble characters into textlines based on imaginary horizontal lines. We end up with an ordered list of textlines we will use to create textblocks.
Preprocessing
Some elements extracted in the previous step will hinder the structuration process. Therefore we apply a set of rules-based deterministic algorithms to:
- discard non relevant characters (e.g. invisible, shadowed, upside-down, etc.),
- tag characters with specific metadata (e.g. underlined, subscript, superscript, etc.),
- discard repetitive headers and footers as well as page numbers,
- detect the page title and the published date (if any),
- extract the table-of-content (if any for latter use),
- detect and transform tables (characters surrounded by rectangles) into structured HTML tables.
Characters and theirs metadata are easily extracted with tools like pdftohtml [2] (base on Poppler [3]) or pdfplumber [4] (base on pdfminer.six [5]).
Using feature engineering and machine learning to structure documents
We implemented machine learning models in order to merge textlines into textblocks and classify them. Each model is trained with hand-crafted features which describe textlines or textblocks themselves (unary features) or their relations with other elements (relative features).
Most of the time invested in improving the models’ performances was dedicated to the enhancement of the hand-crafted features, not to reducing the models complexity or to fine-tuning. This problem is a good example where few well designed features are more beneficial than machine learning algorithms themselves.
Merging textlines into textblocks
A textblock is a set of one or more visually similar textlines (horizontal lines of text). It is usually a title or paragraph and can be spread out across several columns or pages.
To decide if two textlines should be merged into the same textblock we create several features. At this stage of the pipeline they are mainly layout-based and boolean, therefore one-hot encoded. A part of the resulting vector represents the style difference between the two texlines (e.g. are they both italic, bold or underlined; have they the same color, font, font size). Some semantic information are also encoded (e.g. punctuation, font case). Finally continuous features are added, like the pixel interval between the two textlines, which is encoded as a discretized variable and normalized over the minimum and maximum intervals found across the document (without outliers).
We end up with 23 features as the input of an ensemble model consisting of a SVM, a random forest and a neural network. The model is trained on 12305 pairs of textlines collected over a dataset of 27 PDF files we labeled. It is a binary classifier which predicts block_in if the second textline should be merged with the first one into the same textblock, or block_start to start a new textblock.
Classifying textblocks
The next step of the pipeline is the classification of textblocks as text or title. In the same way as for the segmentation, we designed a set of 43 hand-crafted features. We use most of the segmentation layout-based and semantic features, plus some title specific ones. For instance: if the text of the textblock appears in the table-of-content (if any) , if it contains verbs (titles usually don’t) or if it is the last textblock of the page. The meaning of the text also convey a lot of information, especially if it starts with some kind of numbering (e.g. “IV.2.a) This is a title”).
Using the same dataset, we trained a similar ensemble model on 5582 textblocks.
Ranking title textblocks
The last step is to hierarchize textblocks into a tree structure. Text textblocks are always leaf nodes as opposed to title textblocks which should always be parent nodes. Title textblocks which have the same hierarchical level usually have the same styles (font, font size, color, indentation) and the same title numbering pattern (e.g. ”1.1”, ”1.2”, ”2.1”, ”3.1”) if any. That being said, it is sometimes not the case in long PDF files (hundreds of pages) where titles with different styles may have the same rank.
First we tried to apply the previous method to title hierarchization, which can be expressed as a classification problem. With two title textblocks as input, the model outputs one of the three labels: same_rank, upper_rank, lower_rank. Though simple, this solution doesn’t take advantage of the global structure and therefore yields mediocre results.
We ended up implementing a deterministic algorithm which outperforms our (simple) machine learning model. It relies on a set of implicit rules which are true on most documents:
- the first found title is at the root and therefore does not have a parent node (other titles with the same style will most likely also be at the root),
- a title with a style never encountered is a child of the previous node (if it checks a few rules),
- all titles with the same style are sibling children (allowing titles with different styles to have the same rank, if they are not in the same branch).
The algorithm achieves a precision score of 0.84 over 1024 title textblocks. In details the algorithm almost outputs perfectly the rank of every titles, but a single error (especially at the beginning) usually results in a rank shift for every following titles, implying a major score drop.
Title hierarchization is more complex than textline segmentation and textblock classification, and therefore cannot be solved easily with a deterministic algorithm nor with simple machine learning models. Deep learning approaches [6] have been proven successful and might be our next focus.
Some implementation details
We trained and tested our pipeline onto a dataset made of 27 PDF files published by several French public health agencies, with a number of pages going up to a few tens. Those files were hand-picked for their layout quality in order for our models to generalize well. Although they are different enough to one another (preventing over-fitting), they also look alike while having different templates (they all are medical documents). This explains the pretty good results we got on textlines segmentation and textblocks classification.
Processing time is related to the PDF complexity (e.g. multi-column, tables, etc.), but it takes approximately up to one second to parse a page (with Python and scikit-learn [7]). We are currently using our pipeline over more than 20k PDF files (some of them made of more than 500 pages), transformed into 420k paragraphs (a text textblock with a set of titles). They are fed into our search engine and are updated on a weekly basis.
Conclusion
The pipeline we developed to structure PDF files is in its early versions. For this first end-to-end implementation we focused our work onto the most common labels (text and title), but we are considering adding others ones like quote, reference, figure/table_label, address and so on.
We are also looking into sequence labeling to improve title hierarchization, as well as computer vision to improve textline segmentation and textblock classification. That being said we want to keep our pipeline fast and with low computational resources use.
This approach has greatly improved the quality of the results in our search engine. This, in turns, helps health care professionals provide better care to their patient. Posos is keeping up-to-date with research on document structuration as it improves healthcare professionals’ daily practice.
References
[1] Wikipedia PDF (https://en.wikipedia.org/wiki/PDF)
[2] Amazon Textract (https://aws.amazon.com/fr/textract/)
[3] GROBID (2008-2021) https://github.com/kermitt2/grobid
[2] pdftohtml (https://doc.ubuntu-fr.org/pdftohtml)
[3] Poppler (https://poppler.freedesktop.org/)
[4] pdfplumber (https://github.com/jsvine/pdfplumber)
[5] pdfminer.six (https://github.com/pdfminer/pdfminer.six)
[6] Bentabet, Najah-Imane & Juge, Rémi & Ferradans, Sira. (2019). Table-Of-Contents generation on contemporary documents.
[7] scikit-learn (https://scikit-learn.org/stable/)