Kevin Brubeck Unhammer via haskell-pdftotext
The pdftotext
package provides functions for extraction of plain text from PDF documents. It uses C++ library Poppler, which is required to be installed in the system. Output of Haskell pdftotext
library is identical to output of Poppler's tool pdftotext
.
import qualified Data.Text.IO as T
import Pdftotext
main :: IO ()
main = do
Just pdf <- openFile "path/to/file.pdf"
T.putStrLn $ pdftotext Physical pdf
pdftotext
comes with executable program pdftotext.hs
which can print text extracted from PDF and basic information from the document.
$> pdftotext.hs info test/simple.pdf
File : test/simple.pdf
Pages : 4
Properties
Title : Simple document for testing
Author : G. Eyaeb
Subject : Testing
Creator : pdflatex
Producer: LaTeX with hyperref
Keywords: haskell,pdf
$> pdftotext.hs text --pages 1,4 test/simple.pdf
Simple document for testing
deserve neither
liberty nor safety.
See help for more information:
$> pdftotext.hs --help
$> pdftotext.hs text --help
$> pdftotext.hs info --help
The library uses poppler via FFI, therefore internally all functions are of type IO
. However, their non-IO
variants (using unsafePerformIO
) should be safe to use. Module Pdftotext.Internal
exposes all IO
-typed functions.
Project is hosted at https://sr.ht/~geyaeb/haskell-pdftotext/ . The homepage provides links to Mercurial repository, mailing list and ticket tracker.
Patches, suggestions, questions and general discussions can be send to the mailing list. Detailed information about sending patches by email can be found at https://man.sr.ht/hg.sr.ht/email.md.