The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Meta::XML::Browser - Perl module to simulate a browser session described in a XML file

SYNOPSIS

  use WWW::Meta::XML::Browser;

  my $session = WWW::Meta::XML::Browser->new();
  $session->process_file('file.xml');
  $session->process_all_request_nodes();
  $session->print_all_request_results();

ABSTRACT

This module reads a XML file from a given source and makes the HTTP-requests defined in this XML file. The result of such a request can be filtered using a XSL stylesheet. The following requests can be build using results from the transformation.

DESCRIPTION

WRITING A SESSION DESCRIPTION FILE

The most important part when working with WWW::Meta::XML::Browser is to write a session description file. Such a file describes which http requests are made and how the results of the requests are handled.

The session description file is a simple XML file. The root element is <www-meta-xml-browser> and the DTD can be found at http://www.boksa.de/pub/xml/dtd/www-meta-xml-browser_v0.08.dtd, which leads us to the following construct:

  <?xml version="1.0" ?>
  <!DOCTYPE www-meta-xml-browser SYSTEM "http://www.boksa.de/pub/xml/dtd/www-meta-xml-browser_v0.08.dtd">
  <www-meta-xml-browser>
  <!-- ... -->
  </www-meta-xml-browser>

The optional meta-element can be specified as a child of the root element. The element acts as a container for different information regarding the handling of the request elements.

META-PERL INFORMATION

The perl element is a child of the meta element and can contain perl related information. The perl element can have one of the child elements described below.

ELEMENT: callback; ATTRIBUTES: name

The callback element is used to define an anonymous subroutine which can later be used as a callback. The name under which the callback can be accessed is specified by the required name attribut. The form of the callback (parameters, return value) depends on the later usage, an example for a (not very useful :-)) result-callback is the following:

  <callback name="some-callback"><![CDATA[
  sub {
    my ($result) = @_;

    return $result;
  }
  ]]></callback>

REQUEST DEFINITIONS

A session description file must contain at least one request.

DEFINING A REQUEST WITHOUT CONTENT

Under the root element we will add some elements for the requests we want to make. A very complete request will look like the following:

  <request url="http://www.google.de/" method="get" stylesheet="google-index.xsl" result-callback="some-callback">
  </request>

The only attribute of the request-element that is required is url, all other attributes can be left out.

If method is left out the default method get will be used.

If stylesheet is left out, the raw html will be transformed to a valid XML document which will than be stored as the result of that request.

The result-callback gives the user the possibility to change the raw html before it will be transformed to a XML document by calling the specified callback. This callback can be an element of the callbacks hash specified when the instance is created or a callback specified in the XML file ("ELEMENT: callback; ATTRIBUTES: name"). If a callback is specified in the callbacks hash as well as in the XML file the callback from the hash will be used. A result callback is called with the raw html as the only parameter and is required to return a valid html document.

DEFINING A REQUEST WITH CONTENT

The request-element has an optional child element, which can be used to specify the content of a request. The element is called content and is used as a child of the request element as follows (remember that & has to be written as &amp; in XML):

  <request url="http://www.google.de/search" method="get">
    <content>q=42&amp;ie=ISO-8859-1&amp;hl=de&amp;meta=</content>
  </request>

This example shows that the content will be sent using the specified method (get in this case) to the url of the request (http://www.google.de/search).

EMBEDDED REQUESTS

Embedded request can be used to fetch pages from a result page. They can be created in the XSL stylesheet to dynamically parse a tree of pages.

As soon as a www-meta-xml-browser-request-element is created in the XSL stylesheet it is processed like a normal request-element and the result is inserted.

If the result consists of multiple pages the container-attribute has to be specified and is used as the new root for the merged (optionally transformed) pages.

REPLACEMENT EXPRESSIONS IN A SESSION DESCRIPTION FILE

There are some cases in which static urls and a static content don't fit the requirements of what has to be done.

For this case WWW::Meta::XML::Browser has an easy way to use arguments passed to the instance during creation or values from a previous result.

To access arguments passed to the instance during creation the following simple syntax is used:

  #{args:key}

The word key has to be replaced with the key of the hash containing the arguments. This will lead to the replacement of #{args:key} with the appropriate value from the hash.

Accessing previous results basically goes the same way, some example show, that it even offers more possibilities:

  #{0:0:/foo}
  #{4:1-3:/foo/too}
  #{1::/foo/@argument}
  #escape{0:0:/foo}
  #escape{4:1-3:/foo/too}
  #escape{1::/foo/@argument}

The first three example and the last three examples have only one difference, which is the word escape. This command simply tells the module to url-escape the value that is returned by that later part of the expression.

Let's look at these expressions in detail:

The first part (the number before the first colon) specifies the index (starting with 0) of the request which we want to access. This index can be mapped directly to the session description file.

The second part (between the first and the second colon) specifies the subrequest results (more about subrequests later) that will be looked at. 0 in the first example specifies the first subrequest. 1-3 in the second example specifies the subrequests 2,3 and 4 (remember, we begin indexing with 0). The third example accesses all subrequests.

The last part (after the second colon) specifies an XPath-Expression, which is looked up in each of the subrequest results and a list of all values which match the Expression is generated.

This list is taken and each value of the list will replace the whole replacement expression, and for each replacement one http request is made.

Naturally if the url or the content contains more than one replacement expression all possible combinations are requested (which actually is the product of the different numbers of matching XPath-Expressions).

These different http requests make up the subrequests which are stored and can be accessed, when needed. Please not that subrequests can be merged into a singele subrequest result using merge_subrequests().

CREATING A NEW BROWSER OBJECT

To create a new browser object the new()-method is called, with an optional hash containing options.

  $browser = WWW::Meta::XML::Browser->new(%options);

The following options are possible:

  args => \%args

\%args is the pointer to a hash which values can be accessed from the session description file by their keys. The syntax to access the hash values from the session file is #{args:key}, where key is a key from the hash.

  debug => 1

When the debug option is set, the module produces a lot of debug output about execution times.

  debug_callback => \&debug

\&debug has to be a pointer to a subroutine taking two parameters. The first parameter is a number >= 0 which describes the logging level. The second parameter is the string which is the message to be printed. Please note that there is a default routine _debug().

  result_doc_callback => \&result

\&result has to be a pointer to a subroutine taking one parameter. The parameter is an instance of XML::LibXML::Document and can be manipulated. The subroutine must return an instance of XML::LibXML::Document. Please note that there is a default routine _result().

  callbacks => \%callbacks

\%callbacks is a pointer to a hash of references to subroutines. These subroutines can be used in various situations during the processing of the XML file.

PROCESSING A SESSION DESCRIPTION FILE

To read the session description file one of the following methods is called, depending on the source of the file.

  $browser->process_file($file);
                 -or-
  $browser->process_url($url);
                 -or-
  $browser->process_string($string);
                 -or-
  $browser->process_xml_doc($doc);

The names of the methods should be self-explaining:

process_file() is called when the session description file is on a local disk an read by the script directly (this should be the most common case).

process_url() is called when the session description file is accessed by an http request.

process_string() is called when the session description data is available in a scalar variable.

process_xml_doc() is called when the XML document has already been parsed (as done by the three methods above and we have a instance of XML::LibXML::Document.

PROCESSING THE REQUESTS FROM THE SESSION DESCRIPTION FILE

After the session description file has been processed as shown above, the request nodes contained in the XML document can be processed.

  $browser->process_all_request_nodes();
                   -or-
  while (my $r_node = $browser->get_next_request_node()) {
    $subrequest_result = $browser->process_request_node($r_node);
  }

process_all_request_nodes() encapsulates the second construction with the while loop. Both constructions execute all http requests generated from the session description file and store the results of the (optionally transformed) requests.

ACCESSING THE RESULTS

The result of a spceific request can be accessed with a simple call which returns an instance of XML::LibXML::Document.

  $result = $browser->get_request_result($request_index, $subrequest_index);

To access the results one has to understand how results are stored. The results are stored in a two-dimensional array.

The first index (which starts with 0 for the first request) describes the request which can be found in the session description file.

The second index describes the real index after all permutations caused by possible replacements in the url or content have been generated.

For example $browser->get_request_result(0, 2) returns the result of the third request generated from the first request node in the session description file.

EXPORT

None by default.

METHODS

The following methods are available:

$browser = WWW::Meta::XML::Browser->new(%options);

This class method contructs a new WWW::Meta::XML::Browser object and returns a reference to it.

The hash %options can be used to control the behaviour of the module and to provide some data to it as well. At the moment the following Key/Value pairs are supported:

  KEY:                    VALUE:          DESCRIPTION:
  ---------------         -----------     -------------
  args                    \%args          a pointer to a hash of arguments which can be used in
                                          requests
  debug                   0/1             a boolean true or boolean false value can be passed to
                                          the module to control weather debugging information are
                                          printed or not
  debug_callback          \&debug         a pointer to a debug-callback
  result_doc_callback     \&result        a pointer to a result-doc-callback
  callbacks               \%callbacks     a pointer to a hash of subroutines which can be used as
                                          callbacks in different situations
process_url($url);

Reads the XML file containing session description from the specified url and constructs a XML document from it which is then passed to process_xml_doc().

process_file($file);

Reads the XML file containing session description and constructs a XML document from it which is then passed to process_xml_doc().

process_string($string);

Constructs a XML document from the given string which is then passed to process_xml_doc().

process_xml_doc($doc);

Takes the given XML ocument and reads the request-nodes in the XML file. These request nodes are stored internally to be processed.

$node = get_next_request_node();

Returns the next request-node which than can be processed using process_request_node()

process_all_request_nodes();

Iterates over all request nodes and processes each of them.

$subrequest_result = process_request_node($r_node);

Processes the request node. This subroutine does the actual work: It generates all permutations of the url It genarates all permutations of the content It generates all permutations ot the url and the content It makes the requests and processes the results it returns the (optionally transformed) results

@processed_content = process_content_nodeset($c_nodeset);

Processes a content nodeset and generates all possible permutations by replacing the tokens.

make_request($url, $method, $content);

Makes a request to $url sending the $content using $method and returns the result. If a username and a password have bee specified within the url, they will be used for HTTP-Basic authentication if necessary.

$doc = process_result_doc($res, $stylesheet);

Processes the result ($res) as returned by make_request() by transforming it into a XML document. Internally process_result() is called with $res->content() and $stylesheet.

$doc = process_result($result, $stylesheet);

Processes the result-string ($result) by transforming it into a XML document. If a XSL-Stylesheet ($stylesheet) has been specified for the given request the XML document will be transformed using that stylesheet. The resulting XML document is then returned.

$xml_string = process_embedded_request($embedded_request_node);

Processes an embedded request node, by processing it as a normal node (using process_request_node()). If the embedded request node returns only one XML document it is transformed to a string and returned. If the embedded request node returns more than one XML documents they are merged unded the name specified by the $EMBEDDED_REQUEST_CONTAINER_ATTRIBUTE-attribute of the embedded requst node.

$result = get_request_result($request_index, $subrequest_index);

Returns the request-result specified by $request_index and $subrequest_index.

Iterates over all the request results and prints them.

Prints the specified request result.

merge_subrequests($request_index, $wrapper_name);

Merges the subrequest of the request (specified by $request_index) in a new XML document which consists of a new root element ($wrapper_name) and all the subrequests as children of this root element.

merge_xml_array($array, $wrapper_name)

Merges the XML documents in @{$array} by building a new XML document with a new root element ($wrapper_name) and the XML documents in @{$array} as children of the root element.

parse_string($s, $r);

Recursively parses the string passed as $s and writes the replacement results to @{$r}, which will be an array containing all possible permutations, created by the replacement of the specified tokens.

$callback = _read_callback($result_callback);

Reads the callback from the callbacks hash or from the XML file and returns a reference to it. If the callback can not be found 'undef' is returned.

_debug($l, $msg);

Default debug-callback. Prints $msg as a debugging message to STDERR. $l gives information about the logging level.

$doc = _result($doc);

Default result-doc-callback. Just returns $doc as it was passed to the subroutine.

SEE ALSO

The DTD for the session description files can be found at: http://www.boksa.de/pub/xml/dtd/www-meta-xml-browser_v0.08.dtd

Documentation and a HOWTO can be found at: http://www.boksa.de/perl/modules/www-meta-xml-browser/

AUTHOR

Benjamin Boksa, <benjamin@boksa.de>

COPYRIGHT AND LICENSE

Copyright 2003 by Benjamin Boksa

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.