The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Sub::Slice::Manual - user guide for Sub::Slice

USING Sub::Slice

Sub::Slice is a way of breaking down a long-running process and maintaining state across a stateless protocol. This allows the client to draw a progress bar or abort the process part-way through.

The mechanism used by Sub::Slice is similar to the session management used on many web user authentication systems. However rather than simply passing an ID back as a token as these systems do, in Sub::Slice a data structure with richer information is passed to the client, allowing the client to make some intelligent decisions rather than blindly maintain state.

Overview

Use of Sub::Slice is best explained with a minimal example. Assume that there is a remoting protocol between the client and server such as XML/HTTP. For the sake of brevity, assume that methods called in package Server:: on the client are magically remoted to the server.

The server does two things. The first is to issue a token for the client to use:

        #Server
        sub create_token {
                my $job = new Sub::Slice();
                return $job->token;
        }

The second is to provide the routine into which the token is passed for each iteration:

        sub do_work {
                my $token = shift;
                my $job = new Sub::Slice(token => $token);

                at_start $job sub {
                        my $files = files_to_process();

                        #Store some data defining the work to do
                        $job->store("files", $files); 
                };

                at_stage $job "each_iteration" sub {

                        #Get some work
                        my $files = $job->fetch("files"); 
                        my $file = shift @$files;

                        my $was_ok = process_file($file);

                        #Record we did the work
                        $job->store("files", $files);

                        #Check if there's any more work left to do 
                        $job->done() unless(@$files); 

                };
        }

The client somehow gets a token back from the server. It then passes this back to the server for each iteration. It can inspect the token to check if there is any more work left.

        #Client
        my $token = Server::create_token();
        for(1 .. MAX_ITERATIONS) {
                Server::do_work($token);
                last if $token->{done};
        }

Named stages

You can create any number of named stages. This is useful if you need to complete several phases of iterative processing.

        sub do_work {
                my $token = shift;
                my $job = new Sub::Slice(token => $token);

                at_start $job sub {
                        my $files = files_to_process();
                        $job->store("files", $files);
                        $job->store("position", 0);
                        $job->next_stage("check_files"); #explicitly start with this stage
                };

                at_stage $job "check_files", sub {
                        my $files = $job->fetch("files");
                        my $position = $job->fetch("position");
                        my $file = $files->[$position];
                        check_file($file) or $job->abort("check failed for $file");
                        $job->store("position", $position+1);

                        #Decide when to move to next stage
                        $job->next_stage("publish_files") if($position == scalar @$files);
                };

                at_stage $job "publish_files" sub {
                        my $files = $job->fetch("files");
                        my $file = shift @$files;               
                        my $was_ok = process_file($file);
                        $job->store("files", $files);
                        $job->done() unless(@$files); 
                };              
        }

If next_stage is NOT called, Sub::Slice begins iterations using the FIRST at_stage block within the routine. In the example above, we could have omitted the next_stage call from at_start and the "check_files" sub would have still been called first.

Return values

Return values from the at_* methods are accessible via $job->return_value(). This allows you to capture the return value from the coderef and do something with it (such as store it into $job with set_data or return it to the caller):

        sub do_work {
                my $token = shift;
                my $job = new Sub::Slice(token => $token);

                at_start $job sub {
                        my $files = files_to_process();
                        $job->store("files", $files);
                        $job->store("position", 0);

                        #This won't return from do_work, only from the at_* coderef
                        return scalar(@$files) . " files to process";
                };

                at_stage $job "check_files", sub {
                        my $files = $job->fetch("files");
                        my $position = $job->fetch("position");
                        my $file = $files->[$position];
                        check_file($file) or $job->abort("check failed for $file");
                        $job->store("position", $position+1);
                        return "processed $file";
                };

                #Get the return value from the at_* coderef
                return $job->return_value; 
        }

Aborting a job

If the there is a problem, the server may abort the Sub::Slice iteration. Extending the example above:

        my $was_ok = process_file($file);
        $job->abort("Unable to process $file") unless($was_ok);

The client can detect this by checking the abort property of the token:

        for(1 .. MAX_ITERATIONS) {
                Server::do_work($token);
                if($token->{done} || $token->{abort}) {
                        warn "Remote error: " .  $token->{error} 
                                if($token->{abort});
                        last;
                }
        }

The error property is also set by the server calling abort().

Premature termination from the client

The client can stop the iteration by setting the done property of the token to a true value:

        for(1 .. MAX_ITERATIONS) {
                Server::do_work($token);
                $token->{done} = 1 if( user_pressed_cancel() );
                last if($token->{done} || $token->{abort});
        }

Handling transport errors

If there is a transport error occurs during one of the client-server exchanges during a sub-sliced job, the client will know there has been an error from it's transport client interface. It WONT know what state the server is in - the server may or may not have successfully completed the last batch of iterations.

The client should therefore simply represent the token to the server (which won't have been updated to reflect the work that may or may not have been done) and let the server worry about how to interpret this. For example if the server could reprocess the units of work if they are re-runnable, or if could elect to skip them if it knows they have already been done.

Of course if the tranport keeps failing, the client is entitled to give up entirely on job or save it for retrying later. If the client does just give up on the job, the server should have a cleaner process in place to mop up any dangling sub-slice jobs that were never finished off by the client (see "Using a belt-and-braces cleaner process").

Tuning the iterations per call from the client

You can use the iterations property of the token to tell the server the maxmimum number of iterations it should perform on the next call. Beware that zero is interpreted as an infinite number of iterations.

        use Time::HiRes;
        use constant ACCEPTABLE_WAIT => 5; #Can cope with 5 sec between updates on client

        while(1) {
                my $i1 = $token->{count};
                my $t1 = time();

                Server::do_work($token);

                my $di = $token->{count} - $i1;
                my $dt = time() - $t;
                my $iterations_per_sec = $di/$dt;

                #Calculate the optimal number of iterations to perform per call
                $token->{iterations} = int($iterations_per_sec * ACCEPTABLE_WAIT) || 1;

                last if($token->{done} || $token->{abort});
        }

Using at_start and at_end

The at_start and at_end blocks are run, respectively, before the first iteration, and after the last. They allow you to perform initialisation and a "commit" once all the at_stage steps have completed successfully (such as sending a set of generated files to a sink):

        sub do_work {
                my $token = shift;
                my $job = new Sub::Slice(token => $token);

                at_start $job sub {
                        find_files($job);                       
                };

                at_stage "each_iteration" sub {
                        work_through_list($job);
                }

                at_end $job sub {
                        post_files($job);
                };
        }

WARNING: You might consider using at_start and at_end to allocate and deallocate resources used for the job (an example would be using a lock file to ensure that jobs are run serially), but at_end will not be run if a job dies in an at_stage coderef. This means that resources allocated in at_start could leak if a job is aborted by something dying in an at_stage coderef. You can code defensively around this, by trapping any exceptions and deallocating resources if the job has been aborted:

        sub do_work {
                my $token = shift;
                my $job = new Sub::Slice(token => $token);

                #Trap any exceptions
                eval {
                        at_start $job sub {
                                my $id = create_lock_file();
                                $job->store(lockfile_id => $id);
                        };
        
                        #NB when something dies, Sub::Slice sets $job->abort($@) and rethrows the exception
                        at_stage "each_iteration" sub {
                                #Either your code or Sub::Slice::Backend::* might raise an exception
                                die("an exception") unless went_smoothly($job);
                        }; 
                };
                my $error = $@; #save away $@

                # Cleanup if job has succeeded or failed
                #  we could have used an at_end block for the successful case, 
                #  but we'd need still need to check for the job being aborted here 
                if($job->token->{done} || $job->token->{abort}) {
                        my $id = $job->fetch(lockfile_id);
                        release_lock_file($id);
                };

                die($error) if($error); #Rethrow exceptions
        }

Storing BLOB data

If you need to store large amounts of data against a key, you can use the store_blob method rather than the store method.

        $job->store('key1' => $wee_thing);
        $job->store_blob('key2' => $huge_thing);

To get this back on a subsequent iteration, use the fetch_blob method

        $wee_thing = $job->fetch('key1');
        $huge_thing = $job->fetch_blob('key2');

Alternatively you can let Sub::Slice take care of when to use BLOB storage by giving it a threshold size:

        my $job = new Sub::Slice(token => $token, 'auto_blob_threshold' => 4096); #Store anything >4K as a BLOB
        $job->store('key1' => $wee_thing);
        $job->store('key2' => $huge_thing); #this will be stored as a blob if bigger than 4K
        $wee_thing = $job->fetch('key1');
        $huge_thing = $job->fetch('key2');

Note that currently auto_blob_threshold only applies to scalars containing characters or bytes (not references).

It's up to the backend exactly what it does with BLOB data. The different store/fetch methods give the backend the opportunity to use a more efficient storage strategy for large chunks of BLOB data. For example, in the default Filesystem backend, normal key-value data is serialised into a storable file for each Sub::Slice job whereas BLOB data is held outside in separate files.

Using a belt-and-braces cleaner process

Although Sub::Slice cleans up jobs that are finished, the data from jobs never completed will persist. In real life, this kind of "should never happen" error has a habit of happening occasionally so it's advisable to use a cleaner process to clear up any ancient Sub::Slice junk:

        # clean up jobs from the default path, after the default period
        perl -MSub::Slice::Backend::Filesystem -we "Sub::Slice::Backend::Filesystem->new()->cleanup"

This will delete anything over the default age threshold from the default subslice path. You can override defaults:

        # clean up jobs from /var/tmp/sslice, after 20 days
        Sub::Slice::Backend::Filesystem->new(path => '/var/tmp/sslice')->cleanup(20)

Alternatively, you can clean up with a couple of find(1) commands (if under UNIX):

        find /var/tmp/sub_slice -type f -mtime +1 -exec rm {} \;
        find /var/tmp/sub_slice -type d -mtime +1 -empty -exec rmdir {} \;

This is basically what the cleanup call does, plus some additional error checking.

EXAMPLES

Simple HTTP remoting

This is a "roll your own remoting protocol" example to show how Sub::Slice iteracts with some real remoting code.

        ##################################################################
        # Client
        ##################################################################

        use XML::Simple;
        use HTTP::Request::Common qw(POST);
        use LWP::UserAgent;
        use constant MAX_ITERATIONS => 100;

        my $token = create_token();
        my $i = 0;
        while (!$token->{done}) { 
                die "Long loop, presumed infinite" if $i++ > MAX_ITERATIONS;
                do_work($token);
                last if $token->{done};
        }

        # Proxy methods for remoting
        sub create_token {
                _call_method("create_token");
        }

        sub do_work {
                _call_method("do_work", shift());
        }

        # Primative HTTP remoting
        sub _call_method {
                my ($name, $arg) = @_;
                my $xml = (defined $arg) ? XMLout($arg) : '<opt/>'; #Serialise data as XML
                my $ua = LWP::UserAgent->new;
                my $req = POST 'http://myserver/cgi-bin/server.pl', 
                        [ method => $name, args => $xml ];      
                my $res = $ua->request($req);
                die($res->message) unless($res->is_success);
                $xml = $res->content();
                %$arg = %{XMLin($xml)};
                $arg;
        }

        ##################################################################
        # Server - server.pl (installed into cgi-bin on myserver)
        ##################################################################

        use CGI;
        use XML::Simple;
        use Sub::Slice;
        use constant ALLOWED_METHOD => 
                { map {$_ => 1} qw(create_token do_work) };

        # Primitive HTTP server dispatcher
        my $method = CGI::param('method');
        die("Unrecognised method ($method) called") 
                unless ALLOWED_METHOD->{$method};
        my $args = XMLin( CGI::param('args')); #Deserialise data
        my $out = $method->($args);
        $out = XMLout ({ %$out }); # won't serialize a blessed class
        print "Content-type: text/xml\n\n";
        print $out;

        sub create_token {
                my $job = new Sub::Slice();
                return $job->token;
        }

        sub do_work { #Mindlessly simple example
                my $token = shift;
                my $job = new Sub::Slice (token => $token);

                at_start $job sub {
                        $job->store("steps_to_take", 50);
                        $job->store("position", 0);
                };

                at_stage $job "check_files", sub {
                        my $position = $job->fetch("position");
                        $job->store("position", $position+1);
                        $job->done() if($position >= $job->fetch("steps_to_take"));
                };
                $job->token;
        }

Note that we return the job token from &do_work, so that it gets updated on the client. If we didn't do this, we would keep resetting position to 0, and we would loop forever (until we hit MAX_ITERATIONS).

Using Sub::Slice with SOAP::Lite

Rolling your own remoting protocol may be fun, but it's often more sensible to use a standard one such as XML/RPC or SOAP. Here's an example SOAP server using Sub::Slice with SOAP::Lite.

        ##################################################################
        # SOAP Server - soapserver.pl (installed into cgi-bin on myserver)
        ##################################################################
        use strict;
        use SOAP::Transport::HTTP;
        SOAP::Transport::HTTP::CGI
                -> dispatch_to('My::SoapServer')
                -> handle;

        package My::SoapServer;

        sub create_token {
                my $class = shift;
                my @args = @_;

                my $job = new Sub::Slice();
                at_start $job sub {
                        $job->store(args => @args);
                        $job->store("steps_to_take", scalar @args);
                        $job->store("position", 0);
                };

                return $job->token;
        }

        sub do_work {
                my $class = shift;
                my ($token) = @_;

                my $job = new Sub::Slice( token => $token);
                at_stage $job "check_files" sub {
                        my $position = $job->fetch("position");
                        $job->store("position", $position+1);
                        $job->done() if($position >= $job->fetch("steps_to_take"));
                };
                return $job->token;
        }

Here's its client-side counterpart. The first few lines are where we define our SOAP application and process the request:

        ##################################################################
        # SOAP Client
        ##################################################################
        use strict;
        use SOAP::Lite;
        my $soap = SOAP::Lite
                ->uri("http://myserver.org/My/SoapServer")
                ->proxy("http://myserver/cgi-bin/soapserver.pl");
        my @filenames = qw(badger.xhtml vole.xhtml dormouse.xhtml);
        my ($token) = $soap->create_token(@filenames)->result;
        while (!$token->{done} && !$token->{abort}) {
                ($token) = $soap->do_work($token)->result;
        }

It starts by defining the $soap object which will connect to our application at the proxy url. The methods called on $soap are transparently proxied over HTTP to our application. The proxied methods don't return the values from the remote method, but rather a SOAP message object. To access the return values from the remote method we call paramsall() on the SOAP message object. And finally, our example retrieves an updated copy of the $token after each remote call.

See SOAP::Lite for more information. In particular, the sections on AutoBinding and AutoDispatching may give you some ideas about how to improve on keeping the token up-to-date in the client.

VERSION

$Revision: 1.16 $ on $Date: 2004/12/17 16:31:27 $ by $Author: tims $

AUTHOR

John Alden <cpan _at_ bbc _dot_ co _dot_ uk>

COPYRIGHT

(c) BBC 2004. This program is free software; you can redistribute it and/or modify it under the GNU GPL.

See the file COPYING in this distribution, or http://www.gnu.org/licenses/gpl.txt