The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Alvis::Pipeline - Perl extension for passing XML documents along the Alvis pipeline

SYNOPSIS

 use Alvis::Pipeline;
 $in = new Alvis::Pipeline::Read(host => "harvester.alvis.info",
                                 port => 16716,
                                 spooldir => "/home/alvis/spool");
 $out = new Alvis::Pipeline::Write(port => 29168);
 while ($xml = $in->read(1)) {
     $transformed = process($xml);
     $out->write($transformed);
 }

DESCRIPTION

This module provides a simple means for components in the Alvis pipeline to pass documents between themselves without needing to know about the underlying transfer protocol. Pipe objects may be created either for reading or writing; components in the middle of the pipeline will create one of each. Pipes support exactly one method, which is either read() or write() depending on the type of the pipe. The granularity of reading and writing is the XML document; neither smaller fragments nor larger aggregates can be transferred.

The documents expected to pass through this pipeline are those representing documents acquired for, and being analysed by, Alvis. These documents are expressed as XML contructed according to the specifications described in the Metadata Format for Enriched Documents. However, while this is the motivating example pipeline that led to the creation of this module, there is no reason why other kinds of documents should not also be passed through pipeline using this software.

The pipeline protocol is described below, to facilitate the development of indepedent implementations in other languages.

METHODS

new()

 $in = new Alvis::Pipeline::Read(host => "harvester.alvis.info",
                                 port => 16716,
                                 spooldir => "/home/alvis/spool");
 $out = new Alvis::Pipeline::Write(port => 29168);

Creates a new pipeline, either for reading or for writing. Any number of name-value pairs may be passed as parameters. Among these, most are optional but some are mandatory:

  • Read-pipes must specify both the host and port of the component that they will read from, and spooldir, a directory that is writable to the user the process is running as. (When files become available by being written down a write-pipe, they are immediately read in the background, then stored in the specified spool directory until picked up by a reader.)

  • Pipes may specify loglevel [default 0]: higher levels providing some commentary on under-the-hood behaviour.

option()

 $old = $pipe->option("foo");
 $pipe->option(bar => 23);

Can be used to set the value for a specific option, or to retrieve its value.

read()

 # Read-pipes only
 $xml = $in->read($block);

Reads an XML document from the specified inbound pipe, and returns it as a string. If there is no document ready to read, it either returns an undefined value (if no argment is provided, or if the argument is false) or blocks if the argument is provided and true. read() throws an exception if an error occurs.

Once a document has been read in this way, it will no longer be available for subsequent read()s, so a sequence of read() calls will read all the available records one at a time.

Once a document has been read, it is the responsibility of the reader to process it and pass it on to the next component in the pipeline. If something catastrophic happens, and the record is lost, then an out-of-band mechanism may be used to request a new copy of the record from the writer. The Alvis::Pipeline module does not directly support such requests; they are considered to be application-level and therefore not appropriate for this low-level module to deal with.

(As a matter of application design, we offer the observation that, in Alvis, the <id> attribute on the top-level element specifies the identity of the record, and should remain changed even if the record itself is updated; so any out-of-band request for records to be re-sent should do so by specifying the IDs of the required records.)

write()

 # Write-pipes only
 $out->write($xmlDocument);

Writes an XML document to the specified outbound pipe. The document may be passed in either as a DOM tree (XML::LibXML::Element) or a string containing the text of the document. Throws an exception if an error occurs.

This method returns only when the record has been successfully transferred to the receiver at the other end of the pipeline; so the sender is then able to forget about the transferred, which is now the responsibility of the next component in the pipeline.

close()

 $pipe->close();

Closes a pipe, after which no further reading or writing may be done on it. This is important for read-pipes, as it frees up the Internet port that the server is listening on.

PIPELINE PROTOCOL

Because the pipeline is unidirectional, it is very simple: there is no back-channel by which a downstream component can talk to an upstream one, and the protocol consists entirely of wrappings for the documents that are sent downstream.

Each document packet consists of the following, in order:

  1. The magic literal string Alvis::Pipeline, followed by a single newline character.

  2. Decimal-rendered protocol version-number (currently 1), followed by a single newline character.

  3. Decimal-rendered integer byte-count, followed by a single newline character. Note that the protocol counts bytes rather than characters: these two counts can be different when non-ASCII character sets such as UTF-8 are used.

  4. The XML document itself (or other binary object), of the length specified.

  5. The magic literal string --end--, followed by a single newline character.

For example, the simple document

        <dinosaur type="sauropod">
          Brachiosaurus
        </dinosaur>

would be sent as the following packet:

        Alvis::Pipeline
        1
        55
        <dinosaur type="sauropod">
          Brachiosaurus
        </dinosaur>
        ---end--

This packaging allows the downstream component to locate object boundaries and to consistency-check the stream.

SEE ALSO

Alvis Task T3.2 - Metadata Format for Enriched Documents. Milestone M3.2 - Month 12 (December 2004). Includes a useful overview of the Alvis processing pipeline. http://www.miketaylor.org.uk/alvis/t3-2/m3-2.html

AUTHOR

Mike Taylor, <mike@indexdata.com>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Index Data ApS.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.