WWW::Sitemap - functions for generating a site map for a given site URL.
use WWW::Sitemap; use LWP::UserAgent; my $ua = new LWP::UserAgent; my $sitemap = new WWW::Sitemap( EMAIL => 'your@email.address', USERAGENT => $ua, ROOT => 'http://www.my.com/' ); $sitemap->url_callback( sub { my ( $url, $depth, $title, $summary ) = @_; print STDERR "URL: $url\n"; print STDERR "DEPTH: $depth\n"; print STDERR "TITLE: $title\n"; print STDERR "SUMMARY: $summary\n"; print STDERR "\n"; } ); $sitemap->generate(); $sitemap->option( 'VERBOSE' => 1 ); my $len = $sitemap->option( 'SUMMARY_LENGTH' ); my $root = $sitemap->root(); for my $url ( $sitemap->urls() ) { if ( $sitemap->is_internal_url( $url ) ) { # do something ... } my @links = $sitemap->links( $url ); my $title = $sitemap->title( $url ); my $summary = $sitemap->summary( $url ); my $depth = $sitemap->depth( $url ); } $sitemap->traverse( sub { my ( $sitemap, $url, $depth, $flag ) = @_; if ( $flag == 0 ) { # do something at the start of a list of sub-pages ... } elsif( $flag == 1 ) { # do something for each page ... } elsif( $flag == 2 ) { # do something at the end of a list of sub-pages ... } } )
The WWW::Sitemap module creates a sitemap for a site, by traversing the site using the WWW::Robot module. The sitemap object has methods to access a list of all the urls in the site, and a list of all the links for each of these urls. It is also possible to access the title of each url, and a summary generated from each url. The depth of each url can also be accessed; the depth is the minimum number of links from the root URL to that page.
WWW::Sitemap
Possible option are:
User agent used to do the robot traversal. Defaults to LWP::UserAgent.
Verbose flag, for printing out useful messages during traversal [0|1]. Defaults to 0.
Maximum length of (automatically generated) summary.
E-Mail address robot uses to identify itself with. This option is required.
Maximum depth of traversal.
Root URL of the site for which the sitemap is being created. This option is required.
my $sitemap = new WWW::Sitemap( EMAIL => 'your@email.address', USERAGENT => $ua, ROOT => 'http://www.my.com/' );
Method for generating the sitemap, based on the constructor options.
$sitemap->generate();
This method allows you to define a callback that will be invoked on every URL that is traversed while generating the sitemap. This is basically to allow bespoke verbose reporting. The callback should be of the form:
sub { my ( $url, $depth, $title, $summary ) = @_; # do something ... }
Iterface to get / set options after object construction.
$sitemap->option( 'VERBOSE' => 1 ); my $len = $sitemap->option( 'SUMMARY_LENGTH' );
returns the root URL for the site.
my $root = $sitemap->root();
Returns a list of all the URLs on the sitemap.
for my $url ( $sitemap->urls() ) { # do something ... }
Returns 1 if $url is an internal URL (i.e. if $url =~ /^$root/.
$url =~ /^$root/
if ( $sitemap->is_internal_url( $url ) ) { # do something ... }
Returns a list of all the links from a given URL in the site map.
my @links = $sitemap->links( $url );
Returns the title of the URL.
my $title = $sitemap->title( $url );
Returns a summary of the URL - either from the <META NAME=DESCRIPTION> tag or generated automatically using HTML::Summary.
<META NAME=DESCRIPTION
my $summary = $sitemap->summary( $url );
Returns the minimum number of links to traverse from the root URL of the site to this URL.
my $depth = $sitemap->depth( $url );
The travese method traverses the sitemap, starting at the root node, and visiting each URL in the order that they would be displayed in a sequential sitemap of the site. The callback is called in a number of places in the traversal, indicated by the $flag argument to the callback:
Before each set of daughter URLs of a given URL.
For each URL.
After each set of daughter URLs of a given URL.
See the sitemapper.pl script distributed with this module for an example of the use of the traverse method.
$sitemap->traverse( sub { my ( $sitemap, $url, $depth, $flag ) = @_; if ( $flag == 0 ) { # do something at the start of a list of sub-pages ... } elsif( $flag == 1 ) { # do something for each page ... } elsif( $flag == 2 ) { # do something at the end of a list of sub-pages ... } } );
LWP::UserAgent HTML::Summary WWW::Robot
Ave Wrigley <Ave.Wrigley@itn.co.uk>
Copyright (c) 1997 Canon Research Centre Europe (CRE). All rights reserved. This script and any associated documentation or files cannot be distributed outside of CRE without express prior permission from CRE.
1 POD Error
The following errors were encountered while parsing the POD:
You forgot a '=back' before '=head1'
To install WWW::Sitemap, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::Sitemap
CPAN shell
perl -MCPAN -e shell install WWW::Sitemap
For more information on module installation, please visit the detailed CPAN module installation guide.