The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PDF::Xtract - Extracting sub PDF documents from a multi page PDF document, much faster than PDF::Extract!

SYNOPSIS

        Please read Manual-PDF-Xtract.pdf in the distribution for detailed documentation.
        
        use PDF::Xtract;
        $pdf=new PDF::Xtract;
        @pages=(10..30,5,7); # Defining pages to be extracted ( 10 to 30, 5 and 7 - in the order required for output).
        $pages=\@pages;
        $pdf->savePDFExtract( PDFDoc=>"c:/Docs/my.pdf", PDFSaveAs="out.pdf", PDFPages=>$pages ); # Saves extracted pages to "out.pdf"

        print "Content-Type text/plain\n\n<xmp>",  $pdf->getPDFExtract; # May be useful for a web-site!
 

OR

        # Extract and save, in the current directory, all the pages in a PDF document with nice names.
        use PDF::Xtract;
        $pdf=new PDF::Xtract( PDFDoc=>"test.pdf" );
        @tmp=$pdf->getPDFExtractVariables(PDFPageCountIn);
        $PageCount=${$tmp[0]};
        print STDERR "Total Pages = $PageCount\n";
        $tmp=length($PageCount);
        for ( $CurPage=1; $CurPage <= $PageCount; $CurPage++ ) {
                @CurPage=($CurPage); $CurRef=\@CurPage;
                $index=sprintf("%0${tmp}d",$CurPage);
                $pdf->savePDFExtract( PDFPages=>$CurRef,PDFSaveAs=>"$index.pdf" );
        }

DESCRIPTION

PDF Xtract module is derived from Noel Sharrok's PDF::Extract module, but a MUCH faster one. It is a group of methods that allow the user to extract required pages as a new PDF document from a pre-existing PDF document.

PDF::Xtract is published as a separate module, because of some significant differences with PDF::Extract in variables and functions implemented. While the code, for most part is a shameless copy of PDF::Extract, there are certain changes in the logic that allow this module to be much much faster with large PDF files.

Notable differences between Xtract and Extract are also highlighted in this document

With PDF::Xtract one can:-

  • Associate a PDF document to a PDF::Xtract object.

  • Get total number of pages in PDF document.

  • Extract required pages from a PDF document , as a new PDF document, in any specified page number order.

  • Specify name of file to save extracted PDF document.

AUTHOR

Sunil S, sunils_at_hpcl_co_in

Created by modifying PDF::Extract module by Noel Sharrock (http://www.lgmedia.com.au/PDF/Extract.asp) (Without PDF::Extract this would not be there!)

Many thanx to inspiration by my collegues at Hindustan Petroleum Corporation Limited, Mumbai, India.

COPYRIGHT

Copyright (c) 2005 by Sunil S. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.

The C library at the core of this Perl module can additionally be redistributed and/or modified under the terms of the ``GNU Library General Public License''.

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the ``GNU General Public License'' for more details.

PDF::Xtract - Extracting sub PDF documents from a multipage PDF document

Notes

Fri Jun 17 00:01:21 IST 2005

Version 0.08

Output does not open properly in Acrobat Reader properly ! Xref table writing changed back to what Noel was doing.

Fri Apr 15 21:15:17 IST 2005

Version 0.07

Trying to accelerate output generation, which seems to be the remaining bottleneck.

Thu Apr 7 15:02:22 IST 2005

Version 0.06

Noticed that Xtract fails with very large PDFs (>400MB). It is now fixed by changing the way the file is read and understood. Fringe benefit: module uses less memory than before. Additional variable is introduced: PDFReadSize, specify the number of bytes to read at a time when reading the input file.

Thu Mar 10 15:02:47 IST 2005

Noticed a problem with include objects! Work around done.

Thu Feb 20 15:02:47 IST 2005

Operational sequences within the module is being changed. New organisation will be as below:

Essentioal variable for doing anything is PDFDoc. Extraction and making of document will run as and when PDFPages is defined. It will be generated into the disk file named as PDFSaveAs if one exist, else will be taken to default extract file named as $TempExtractFile.

Populating the PDFExtract is now secondary! If some one ask for that, we will return the content of the file $TempExtractFile