The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Find - Web Resource Finder

SYNOPSIS

use LWP::UserAgent; use HTTP::Request; use WWW::Find;

$agent = LWP::UserAgent->new;

$request = HTTP::Request->new(GET => 'http://begin.url');

$find = WWW::Find->new(AGENT => $agent, REQUEST => $request, MAX_DEPTH => 2, MATCH_SUB => \&match, FOLLOW_SUB => \&follow );

$find->go;

DEPENDENCIES

HTML::LinkExtor LWP::UserAgent HTTP::Request URI

DESCRIPTION

WWW::Find simplifies the task of searching the web for specific types of information. The inspiration for this project came from the recursive website mirroring program, w3mir. WWW::Find is similar to w3mir, but with a more general feature set.

In a nutshell, a WWW::Find object extracts all the HREF links from an HTML document, creates a HTTP::Request object for each link, matches the HTTP::Response object against user specified criteria, and then does something with the matching links (possibly performing the entire operation all over again on certain links). Be careful not to set the MAX_DEPTH parameter too high, otherwise you could easily begin the endless task of requesting every page on the net!

In addition to a LPW::UserAgent and a HTTP::Request object, you'll need to create two subroutines: a &match subroutine and a &follow subroutine.

The &follow subroutine should attempt to match the HTTP::Response object against user defined criteria. If a match is found, the entire operation is performed all over again on the matching link. For example, the following subroutine matches links where the header content-type matches the regular expression /text/.

sub follow { my $find_obj = shift; my $header = HTTP::Request->new(HEAD => $find_obj->{REQUEST}->uri); my $response = $find_obj->{AGENT}->request($header) || next; $response->content_type =~ /text/io ? return 1 : return 0; }

The &match subroutine should perform some operation on links matching user defined criteria. For example, the following subroutine simply prints out the URL of all links matching the regular expression /html?$/

sub match { my $find_obj = shift; if($find_obj->{REQUEST}->uri =~ /html?$/io) { print $find_obj->{REQUEST}->uri . "\n"; } return; }

SEE ALSO

HTTP::Request LPW::UserAgent

AUTHOR

Nathaniel Graham, <broom@cpan.org<gt> http://www.gnusto.net is the offical home page of WWW::Find

COPYRIGHT AND LICENSE

Copyright 2003 by Nathaniel Graham

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.