The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

pdf2xml - extract text from PDF files and wraps it in XML

SYNOPSIS

 pdf2xml [OPTIONS] pdf-file > output.xml

For more information, see the man-pages of the command-line tool pdf2xml. Using pdf2xml as a library is possible via the pdf2xml function:

 use Text::PDF2XML

 my $xml = pdf2xml( $pdf_file, %options );

 pdf2xml( $pdf_file, output => \*STDOUT, %options );
 pdf2xml( $pdf_file, output => 'file.xml', %options );

 %options = (
    conversion_tool         => 'pdfXtk',        # use pdfXtk (default = 'tika')
    keep_vocabulary         => 1,               # don't reset the vocabulary
    vocabulary              => 'filename',      # plain text file
    vocabulary_from_pdf     => 0,               # skip pdftotext
    vocabulary_from_raw_pdf => 0,               # skip pdftotext -raw
    vocabulary_from_tika    => 1,               # read voc from Apache Tika
    java                    => '/path/to/java', # java binary
    java_heap               => '8g',            # default = 1g
    split_into_characters   => 1,               # split into characters
    detect_languages        => 1,               # enable language detection
    keep_languages          => 'en',            # only keep English sentences
    lowercase               => 0,               # switch off lower-casing
    dehyphenate             => 0,               # switch off de-hyphenation
    character_merging       => 0,               # skip char merging
    paragraph_merging       => 0,               # skip paragraph merging
    request_timeout         => 180,             # server request timeout (Tika)
    verbose                 => 1                # verbose output
    );

 pdf2xml( $pdf_file, output => 'file.xml', %options );

Note that the options stay for the next pdf2xml call! You need to overwrite them if you want to change the behaviour in subsquent calls while the libraray is loaded!

DESCRIPTION

Extract text from PDF using external tools and some post-processing heuristics. Here is an example with and without post-processing:

  raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
  clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>

  raw:    <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n 
             de c o m b u s - t ib les et l e u r moda l i t é 
             d ' u t i l i s a t i on d 'une p a r t , 
             la concen t r a t ion d ' a u t r e p a r t 16</p>

  clean:  <p>2. Les critères de choix : la consommation 
             de combustibles et leur modalité 
             d'utilisation d'une part, 
             la concentration d'autre part 16</p>

TODO

Character merging heuristics are very simple. Using the longest string forming a valid word from the vocabulary may lead to many incorrect words in context for some languages. Also, the implementation of the merging procedure is probably not the most efficient one.

De-hyphenation heuristics could also be improved. The problem is to keep it as language-independent as possible.

SEE ALSO

Apache Tika: http://tika.apache.org

The Poppler Developers - http://poppler.freedesktop.org

pdfXtk http://sourceforge.net/projects/pdfxtk/

COPYRIGHT AND LICENSE

Copyright (C) 2013 by Joerg Tiedemann

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.