The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

LoadWorm - WebSite Stress and Validation Tool

DESCRIPTION

The LoadWorm is a tool to load a website with requests, and to record the resultant performance, from a web client's perspective. It can also be used for various investigative purposes, such as validation of the website, or discovering all the referrers to a page, etc.

It consists of two main parts -

  • LoadWorm - traverses a website, pushing all the buttons, and entering data according to specific input instructions. It will ignore specified URL's, limit the number of times a single page is visited, and limit the depth of the entire search. The amount of processing required to perform all these tricks makes it too slow to act directly as a web-loading tool, so the LoadMaster/Slave was invented to handle that job.

  • LoadMaster/Slave - Takes a list of URL's (such as that produced by the LoadWorm), and directs several "slaves", usually on seperate host computers, to make hits on these URL's at a tunable rate. The "slaves" collect data on the response times (and successes/fails), which can be harvested and analyzed by the LoadMaster.

The LoadWorm's operation is controlled by a configuration file. The LoadMaster/Slave reads the same configuration file for some of it's configurables (proxy, verbosity, etc), but is controlled mainly through a Tk based GUI.

The LoadWorm and LoadMaster/Slave works on Windows NT and Unix (tested on Solaris and Linux), or any combination of these systems.

WEBSITE TRAVERSAL

The LoadWorm takes one or more URLs as input (specified in its configuration file, loadworm.cfg).

  • It follows all links, down to a configurable depth. You may specify a different depth limit for different branches of the website(s).

  • Ignores specific links, as specified by matching the URL to regular expressions in the configuration file.

  • Generates INPUT data for FORMs, and traverses every possible 'SELECT' option and 'SUBMIT' button (filterable by the 'ignore' statements in the configuration file). (Non-multiple type of SELECT, only, in version 1.0).

  • The user may specify lists of values for each INPUT field of any of the FORMs. (Text type, only, in version 1.0).

  • A check for a valid response can be customized for each URL (selected by a regular expression). The validation routine can be written by the user, in Perl, and is automatically embedded into the process.

  • The results of a LoadWorm session are recorded in a Perl accessible database, including a list of all links (child to parent(s)), all errors encountered, all links that were ignored, all images that were downloaded, and the timings for every download. The user's validation routine may also write to any of these tables.

  • A seperate program (LoadMaster/Slave) is a high intensity web loader that will take the route charted by the LoadWorm and repeat the whole route, performing the request, response and validation steps without the overhead of the route calculations inherent in the LoadWorm's configuration.

  • For known bugs and limitations, see NOTES, at the bottom of this document.

WEBSITE LOADING

Website loading is performed by the LoadMaster program. The LoadMaster runs on a master computer. One or more LoadSlaves may run on the same computer, or different computers on the same network. The operator can control all LoadSlaves from the LoadMaster. He can start them, pause them, and tune the loading rate (e.g., total hits per second).

Which URL's are actually loaded by the LoadSlaves is specified in a file named visits.txt. This is simply a list of fully specified URL's, with CGI parameters, such as the one generated by the LoadWorm. (The PUT method is not yet implemented here).

The LoadMaster also reads some parameters from the same configuration file that serves the LoadWorm. It conveys these parameters to all the LoadSlaves, as well as transmitting to them the visits list.

Each LoadSlave can be configured with a simple rewrite mechanism to replace specified parameters in each URL with a value received from a previous response. Thus, if the website supports it's session state via a CGI parameter, each slave can log itself in as a seperate session. This simple mechanism can be enhanced by working over the Perl code.

Since it does not need to do any special calculations for laying down the route, the LoadSlave can perform its operations quickly, utilizing less memory, than the LoadWorm. This makes it possible to run several slaves on the same host computer. Each LoadSlave must be started manually on each of the several hosts. This simplifies the security situation, as the LoadMaster does not need to directly control anyone else's computer. Give each LoadSlave the IP address of the LoadMaster when you start each LoadSlave. You can start the LoadMaster first, or all the LoadSlaves first, or in any combination.

Thus, on the master host computer, use the command:

  • perl loadmaster.pl

and on each slave computer, use the command:

  • perl loadslave.pl {IP_ADDR:port_number of LoadMaster}

The IP_ADDR and port number of the LoadMaster is displayed on the LoadMaster GUI when you start it up. The default port number of the LoadMaster is 9676 ("WORM" on a phone pad), but it's possible to come up differently, especially if you're running two LoadMasters on the same computer.

If the LoadMaster crashes, or is turned off, the LoadSlaves will wait patiently for it to come back up, and each will reconnect when it does. To finish a test, you can terminate all the LoadSlaves from the LoadMaster GUI, then terminate the LoadMaster. The owners of the host computers you've borrowed for the load test might want to terminate the test on their computer. They can do that by closing the LoadSlave on their computer, with no ill effect on your test except for the lost data and load.

THE CONFIGURATION FILE

The process of the LoadWorm is controlled by its configuration file. This file is named loadworm.cfg, and is found in the current working directory. It is structured like a profile.ini file, with [section] specifying seperate sections, and with parameters and attribute=value pairs within each section. The sections include:

[Mode]

Various modes are set here; depth, timeouts, printing, error management, etc. See "[Mode]".

[Traverse]

URLs listed here are the anchor(s) of the target website. See "[Traverse]".

[Ignore]

URLs listed here will be ignored in the traversal. See "[Ignore]".

[Input]

The user may specify values to be tried as input to each INPUT field in each FORM. See "[Input]".

[Limit]

To prevent infinite recursion, each page is visited a limited number of times (see "Recurse" in "[Mode]"). In the section you can specify different limits for different pages. See "[Limit]".

[ReferersReport]

The webpages that link to the URLs listed here will be recorded as such in a "links" database. See "[ReferersReport]".

[Validation]

User customizable routines to validate the data that is returned for each URL requested. See "[Validation]".

[Proxy]

A URL specifying the location of the proxy for web access, if any. See "[Proxy]".

[NoProxy]

Domain names for which the proxy is not to be used. See "[NoProxy]".

[Credentials]

Authentication credentials for different net locations and realms. See "[Credentials]".

[Mode]

Depth = n

The loadworm will go to a maximum of 'n' links down from the anchor URL. Depth=1 would load only the anchor page, and none of its links.

Random = {0,1}

If non-zero, then links will be traversed in random order, rather than in the order that they appear in the visits file. A value of 1 will traverse all links in random order.

Recurse = n

Each URL will be traversed only once, unless the Recurse value is more than one. Then each URL will be traversed the number of times specified by Recurse.

Timeout = secs

Specifies the timeout period for all links (in seconds). If a link does not download completely within the time specified by this value, then it is considered a timeout error. Default = 120 seconds.

NoImages = {0,1}

If non-zero, ignores all image links.

Verbose = {0,1}

Controls the verbosity of standard output as the loadworm processes. Use 0 for the greatest degree of quiet. Reports on the actual performance of the loadworm are created from a database the loadworm creates.

Harvest = {0,1}

Turns off/on the option to harvest the results from the loadslaves. Turning it off improves managability, since the slaves then do not need to maintain a record of the results. This also reduces disk thrashing when multiple loadslaves are running on a host. Harvest=0 is useful if you are monitoring the load on the server's side.

[Traverse]

Specifies the URL(s) that are the anchor(s) of this test. These are the URL(s) that are the anchor(s) of the website to be tested by this loadworm execution.

[Ignore]

A list of regular expressions which, if matching a generated URL, will cause that URL to be ignored. For instance, .*\.netscape\.com would prevent the loadworm from traversing any link to the websites of Netscape. Note that if the URL is explicitly listed in the [Traverse] section, then any [Ignore] match will, in its turn, be ignored.

[ReferersReport]

A list of regular expressions which, when they match a generated URL, will record in a database all webpages that link to that URL.

[Validation]

Each link can be validated with a custom Perl subroutine. The subroutine is selected by matching the URL to a regex. The subroutine is given the URL and the resultant webpage. The validation routine can then verify the accuracy of the response, and can write to the loadworm database files to record successes and/or errors. Particularly, the checks table is reserved for this. It is tied to the hash %main::Checks, which is conventionally a hash whose keys are the URLs, and whose values are whatever string the validation routine wishes to report about this URL/response pair. A zero returned from the validation routine will tell the loadworm to ignore all links within this page. A non-zero return will allow normal processing to continue. For example:

  • .*=AnyURL.pm::Check

This will match any URL, and will call your subroutine, "Check", in your package "AnyURL.pm". AnyURL.pm must be in the @INC path, and must include a package statement (e.g. package AnyURL). See the example, AnyURL.pm. for details.

[Proxy]

A URL specifying the location of the proxy for web access, if any.

[NoProxy]

Domain names for which the proxy is not to be used.

[Credentials]

Specifies a list of user ids and passwords for each of the realms that may require authentication. The "net location" and "realm" are seperated by a slash, then "user id" and "password" are seperated by a comma. "Netlocation/realm" and "userid,password" are then associated with an equals sign, as in:

  • webdev.savesmart.com/Test Server=MyID,twi9y

"webdev.savemart.com" is the net location, "Test Server" is the realm, "MyID" is the user id, and "twi9y" is the password.

[Input]

Each line specifies a list of values to be iterated across whenever a URL and INPUT line name match the specified regular expression. The list is specified as a Perl statement suitable for eval. This feature will later allow more elaborate input generation, but for now it allows the specification of a list of values via qw(list). For example:

  • login.get, name = qw(test1)

  • login.get, cardnumber = qw(test1234)

  • login.get, email = NULL

The URL is matched to the first regex (before the comma), then the NAME of the INPUT field is matched to the second regex (following the comma). Then the list of values specified by the perl statement (following the equals sign) is iterated on the matched URL. The special syntax of NULL is provided to allow the field to have a null value.

[Limit]

Each line specifies a regex that will match a URL, and the number of times that that URL should be visited in a LoadWorm traversal. Thus,

  • owa/categories\.get=50

  • owa/favorite\.get=10

  • owa/cart\.get=5

  • owa/specials\.get=10

  • owa/search\.get=10

The owa/categories.get CGI script will be called only 50 times in the traversal, owa/favorite.get only 10, owa/cart.get only 5, etc. The count is for all URLs that match these regular expressions. Thus, it doesn't matter what the CGI parameters might be to these CGI scripts, the scripts themselves will be called only as many times as the [Limit] section specifies.

Example of a Configuration File

        [Mode]
        Harvest=1
        Depth=10
        Random=1
        Recurse=
        Timeout=30
        Verbose=0
        NoImages=0
        UserAgent=Mozilla/4.01 [en] (WinNT; I)
        Editor="C:\Program Files\TextPad\TxtPad32.exe"
        
        [Traverse]
        http://webdev.savesmart.com
        
        [Credentials]
        webdev.savesmart.com/Test Server=MyID,twi9y
        
        [Ignore]
        www\.
        www6\.
        maps\.
        justgo\.com
        netscape\.com
        /owa/go_home\.get.*
        
        [Limit]
        owa/categories\.get=50
        owa/favorite\.get=10
        owa/cart\.get=5
        owa/specials\.get=10
        owa/search\.get=10
        
        [ReferersReport]
        \.savesmart\.com:900\/
        favorite\.get
        
        [Validation]
        
        [Proxy]
        http://ssgw.savesmart.com
        
        [NoProxy]
        admin
        webdev.savesmart.com
        
        [Input]
        login.get,name=qw(test1)
        login.get,cardnumber=qw(test1234)
        login.get,email=NULL
           
   

THE RESULTS DATABASE

NOTE: This information is not current, but it gives you the general idea of what is possible once we tie up a few loose ends.

The results of a session of LoadWorm are recorded in a Perl accessible database. Although some information is printed to standard output as the session progresses, the most interesting results should be discovered by scanning the LoadWorm database for that session. The database consists of several hash-tied tables. Each table is keyed by the URL associated with it, and the value will be a string representing the result. For some of these tables, the result is an array of strings representing several interactions with that URL. Unfortunately, Perl's built-in Tie::Hash will not record arrays in a tied table. For these tables, the data is converted to ASCII text data and written to a sequential file. The Perl code listed below can be used to pull this sequential file back into a hashed array in your Perl report generator.

referers

This relates URLs of the website to the parent pages that contain them. @referers{$childURL} is an array of URLs of pages that link to $childURL. (This table does not include images. These are recorded in the images table. It does include all ignored URLs.) Note: this file is not a hash-tied database file, but a sequential file containing data that can be imported into a hashed table with the Perl code listed below (tbl2hash.pl).

errors

This is a list of all the URLs that failed to download. $errors{$URL} is the error message associated with the attempt to download $URL.

ignores

This is a list of all URLs that were encountered in the website, but were ignored because they match some regular expression in the [Ignore] section of the configuration file. $ignore{$URL} is the regular expression that caused $URL to be included in this list.

timings

This text file records the time of each request, and the time of completion of that request. Each record consists of two (or more) lines. The first line contains the URL. The second line contains the start time, the finish time, and the size in a string like (hh:mm:ss.hh,hh:mm:ss.hh size). The size might be the string "FAILED", instead, indicating that the request failed. Then, the following lines will contain the reason for the failure, until a line containing a copy of the original "FAILED" line. Thus, timings includes the time for failed downloads as well as successful ones.

checks

This table is written by the user-customized validation routine(s).

   tbl2hash.pl

        %Linkages = ();
        open TBL, "<linkages";
        while ( <TBL> )  {
                if ( $_ !~ /^\s/ )  {
                        $ky = $_;
                }
                else {
                        s/^\s*//;
                        push @{  $Linkages{$ky}  }, $_;
                }
        }

NOTES

  • Watch for #tag in CGI function names.

  • To specify an image click position, define the key as the image name, and the value as "image.x=x&image.y=y".

  • Due to our as yet incomplete control of the TCP/IP layer in this program, we can not actually duplicate the conditions of modem (or any other low data rate) access to the website. Some conditions of our multiple client, high speed data transfers will be different than when many clients are accessing the website at lower speeds.

  • Each loadslave is limited to twenty-three simultaneous connections. Subsequent connections fail when trying to register (or is it when trying to connect?). This is a limitation imposed by the operating system when the Perl executable was compiled. We have hard-coded a governor at 20 connections to avoid this limit. Multiple instances may be run on a single host, but each one has the same limit.

  • The LoadMaster can not accept connections from more that 28 LoadSlaves (for the same reason).

  • Consequently, the upper limit to the loadtest is 28x23, or 644 simultaneous connections to the web-server. Is there anyway to increase this? We can run multiple loadmasters, I suppose; does it make sense, then, to have a super-loadmaster, or perhaps loadslave monitors, so that the load master talks to one slave monitor, which deals with the (up to twenty-eight) loadslaves on it's own NT? Etc.

PREREQUESITES

These are the versions of Perl modules under which LoadWorm is known to work. It may be just fine with earlier or later versions.

  • Perl 5.004 (thanks, Larry!)

  • LWP from libwww-perl-5_20

  • LWP::Parallel from ParallelUserAgent-v2_31 (a special thanks to Marc Langheinrich!)

  • Tk from Tk402_003

  • Time::Local and Time::HiRes for Unix OS.

  • Win32 for Win32 OS.

  • And various core Perl modules, including English, File::Path, File::Copy, Socket, Carp, FileHandle, and Sys::Hostname.

AUTHOR

Glenn Wood, glenwood@alumni.caltech.edu.

Copyright 1997-1998 SaveSmart, Inc.

Released under the Perl Artistic License.

$Id: LoadWorm.pm,v 1.1.1.1 2001/05/19 02:54:40 Glenn Wood Exp $

1 POD Error

The following errors were encountered while parsing the POD:

Around line 193:

=back doesn't take any parameters, but you said =back These sections are explained in detail below.