The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Archive::Tar::Stream - pure perl IO-friendly tar file management

VERSION

Version 0.02

SYNOPSIS

Archive::Tar::Stream grew from a requirement to process very large archives containing email backups, where the IO hit for unpacking a tar file, repacking parts of it, and then unlinking all the files was prohibitive.

Archive::Tar::Stream takes two file handles, one purely for reads, one purely for writes. It does no seeking, it just unpacks individual records from the input filehandle, and packs records to the output filehandle.

This module does not attempt to do any file handle management or compression for you. External zcat and gzip are quite fast and use separate cores.

    use Archive::Tar::Stream;

    my $ts = Archive::Tar::Stream->new(outfh => $fh);
    $ts->AddFile($name, -s $fh, $fh);

    # remove large non-jpeg files from a tar.gz
    my $infh = IO::File->new("zcat $infile |") || die "oops";
    my $outfh = IO::File->new("| gzip > $outfile") || die "double oops";
    my $ts = Archive::Tar::Stream->new(infh => $infh, outfh => $outfh);
    $ts->StreamCopy(sub {
        my ($header, $outpos, $fh) = @_;

        # we want all small files
        return 'KEEP' if $header->{size} < 64 * 1024;
        # and any other jpegs
        return 'KEEP' if $header->{name} =~ m/\.jpg$/i;

        # no, seriously
        return 'EDIT' unless $fh;

        return 'KEEP' if mimetype_of_filehandle($fh) eq 'image/jpeg';

        # ok, we don't want other big files
        return 'SKIP';
    });

SUBROUTINES/METHODS

new

    my $ts = Archive::Tar::Stream->new(%args);

Args: infh - filehandle to read from outfh - filehandle to write to inpos - initial offset in infh outpos - initial offset in outfh safe_copy - boolean.

Offsets are for informational purposes only, but can be useful if you are tracking offsets of items within your tar files separately. All read and write functions update these offsets. If you don't provide offsets, they will default to zero.

Safe Copy is the default - you have to explicitly turn it off. If Safe Copy is set, every file is first extracted from the input filehandle and stored in a temporary file before appending to the output filehandle. This uses slightly more IO, but guarantees that a truncated input file will not corrupt the output file.

SafeCopy

   $ts->SafeCopy(0);

Toggle the "safe_copy" field mentioned above.

InPos

OutPos

Read only accessors for the internal position trackers for the two tar streams.

AddFile

Adds a file to the output filehandle, adding sensible defaults for all the extra header fields.

Requires: outfh

   my $header = $ts->AddFile($name, $size, $fh, %extra);

See TARHEADER for documentation of the header fields.

You must provide 'size' due to the non-seeking nature of this library, but "-s $fh" is usually fine.

Returns the complete header that was written.

   my $header = $ts->AddLink($name, $linkname, %extra);

Adds a symlink to the output filehandle.

See TARHEADER for documentation of the header fields.

Returns the complete header that was written.

StreamCopy

Streams all records from the input filehandle and provides an easy way to write them to the output filehandle.

Requires: infh Optional: outfh - required if you return 'KEEP'

    $ts->StreamCopy(sub {
        my ($header, $outpos, $fh) = @_;
        # ...
        return 'KEEP';
    });

The chooser function can either return a single 'action' or a tuple of action and a new header.

The action can be: KEEP - copy this file as is (possibly changed header) to output tar EDIT - re-call $Chooser with filehandle SKIP - skip over the file and call $Chooser on the next one EXIT - skip and also stop further processing

EDIT mode:

the file will be copied to a temporary file and the filehandle passed to $Chooser. It can truncate, rewrite, edit - whatever. So long as it updates $header->{size} and returns it as $newheader it's all good.

you don't have to change the file of course, it's also good just as a way to view the contents of some files as you stream them.

A standard usage pattern looks like this:

  $ts->StreamCopy(sub {
    my ($header, $outpos, $fs) = @_;

    # simple checks
    return 'KEEP' if do_want($header);
    return 'SKIP' if dont_want($header);

    return 'EDIT' unless $fh;

    # checks that require a filehandle
  });

ReadBlocks

Requires: infh

   my $raw = $ts->ReadBlocks($nblocks);

Reads 'n' blocks of 512 bytes from the input filehandle and returns them as single scalar.

Returns undef at EOF on the input filehandle. Any further calls after undef is returned will die. This is to avoid naive programmers creating infinite loops.

nblocks is optional, and defaults to 1.

WriteBlocks

Requires: outfh

   my $pos = $ts->WriteBlocks($buffer, $nblocks);

Write blocks to the output filehandle. If the buffer is too short, it will be padded with zero bytes. If it's too long, it will be truncated.

nblocks is optional, and defaults to 1.

Returns the position of the header in the output stream.

ReadHeader

Requires: infh

   my $header = $ts->ReadHeader(%Opts);

Read a single 512 byte header off the input filehandle and convert it to a TARHEADER format hashref. Returns undef at the end of the file.

If the option (SkipInvalid => 1) is passed, it will skip over blocks which fail to pass the checksum test.

WriteHeader

Requires: outfh

   my $newheader = $ts->WriteHeader($header);

Read a single 512 byte header off the input filehandle.

If the option (SkipInvalid => 1) is passed, it will skip over blocks which fail to pass the checksum test.

Returns a copy of the header with _pos set to the position in the output file.

ParseHeader

   my $header = $ts->ParseHeader($block);

Parse a single block of raw bytes into a TARHEADER format header. $block must be exactly 512 bytes.

Returns undef if the block fails the checksum test.

BlankHeader

  my $header = $ts->BlankHeader(%extra);

Create a header with sensible defaults. That means time() for mtime, 0777 for mode, etc.

It then applies any 'extra' fields from %extra to generate a final header. Also validates the keys in %extra to make sure they're all known keys.

CreateHeader

   my $block = $ts->CreateHeader($header);

Creates a 512 byte block from the TARHEADER format header.

CopyBytes

   $ts->CopyBytes($bytes);

Copies bytes from input to output filehandle, rounded up to block size, so only whole blocks are actually copied.

DumpBytes

   $ts->DumpBytes($bytes);

Just like CopyBytes, but it doesn't write anywhere. Reads full blocks off the input filehandle, rounding up to block size.

FinishTar

   $ts->FinishTar();

Writes 5 blocks of zero bytes to the output file, which makes gnu tar happy that it's found the end of the file.

Don't use this if you're planning on concatenating multiple files together.

CopyToTempFile

   my $fh = $ts->CopyToTempFile($header->{size});

Creates a temporary file (with File::Temp) and fills it with the contents of the file on the input stream. It reads entire blocks, and discards the padding.

CopyFromFh

   $ts->CopyFromFh($fh, $header->{size});

Copies the contents of the filehandle to the output stream, padding out to block size.

TARHEADER format

This is the "BlankHeader" output, which includes all the fields in a standard tar header:

  my %hash = (
    name => '',
    mode => 0777,
    uid => 0,
    gid => 0,
    size => 0,
    mtime => time(),
    typeflag => '0', # this is actually the STANDARD plain file format, phooey.  Not 'f' like Tar writes
    linkname => '',
    uname => '',
    gname => '',
    devmajor => 0,
    devminor => 0,
    prefix => '',
  );

You can read more about the tar header format produced by this module on wikipedia: http://en.wikipedia.org/wiki/Tar_(file_format)#UStar_format or here: http://www.mkssoftware.com/docs/man4/tar.4.asp

Type flags:

  '0' Normal file
  (ASCII NUL) Normal file (now obsolete)
  '1' Hard link
  '2' Symbolic link
  '3' Character special
  '4' Block special
  '5' Directory
  '6' FIFO
  '7' Contiguous file

Obviously some module wrote 'f' as the type - I must have found that during original testing. That's bogus though.

AUTHOR

Bron Gondwana, <perlcode at brong.net>

BUGS

Please report any bugs or feature requests to bug-archive-tar-stream at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Archive-Tar-Stream. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Archive::Tar::Stream

You can also look for information at:

LATEST COPY

The latest copy of this code, including development branches, can be found at

http://github.com/brong/Archive-Tar-Stream/

LICENSE AND COPYRIGHT

Copyright 2011 Opera Software Australia Pty Limited

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.