Text::Corpus::VoiceOfAmerica - Make a corpus of VOA documents for research.
Text::Corpus::VoiceOfAmerica
use Cwd; use File::Spec; use Data::Dump qw(dump); use Log::Log4perl qw(:easy); use Text::Corpus::VoiceOfAmerica; Log::Log4perl->easy_init ($INFO); my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa'); my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory); $corpus->update (testing => 1, verbose => 1); my $document = $corpus->getDocument (index => 0); dump $document->getBody; dump $document->getCategories; dump $document->getContent; dump $document->getDate; dump $document->getDescription; dump $document->getTitle; dump $document->getUri;
Text::Corpus::VoiceOfAmerica can be used to create a temporary corpus of Voice of America news documents for personal research and testing of information processing methods. Read the Voice of America's Terms of Use statement to ensure you abide by it when using this module.
The categories, description, title, etc... of a specified document are accessed using Text::Corpus::VoiceOfAmerica::Document. Also, all errors and warnings are logged using Log::Log4perl, which should be initialized.
new
The constructor new creates an instance of the Text::Corpus::VoiceOfAmerica class with the following parameters:
corpusDirectory
corpusDirectory => '...'
corpusDirectory is the directory that documents are cached into using CHI. If corpusDirectory is not defined, then the path specified in the environment variable TEXT_CORPUS_VOICEOFAMERICA_CORPUSDIRECTORY is used if it is defined. If the directory defined does not exist, it will be created. A message is logged and an exception is thrown if no directory is specified.
TEXT_CORPUS_VOICEOFAMERICA_CORPUSDIRECTORY
getDocument
getDocument (index => $index, cacheOnly => 0) getDocument (uri => $uri, cacheOnly => 0)
getDocument returns a Text::Corpus::VoiceOfAmerica::Document object for the document with index $documentIndex or uri $uri. The document indices range from zero to getTotalDocument()-1; getDocument returns undef if any errors occurred and logs them using Log::Log4perl.
$documentIndex
$uri
getTotalDocument()-1
undef
index
index => '...'
index should be the number of the document to return. It should be a non-negative integer less than getTotalDocument. If it is out of range undef is returned.
getTotalDocument
uri
uri => '...'
uri should be the URL of the document to return. If the document is not in the cache, it is fetched unless cacheOnly evaluates to false, in that case undef is returned.
cacheOnly
cacheOnly => 0
If cacheOnly evaluates to true, then only documents in the cache are returned, otherwise undef is returned. The default is false.
An example:
getTotalDocuments
getTotalDocuments ()
getTotalDocuments returns the total number of documents in the corpus. The index to the documents in the corpus ranges from zero to getTotalDocuments() - 1.
getTotalDocuments() - 1
getURIsInCorpus
getURIsInCorpus ()
getURIsInCorpus returns an array reference of all the URIs in the corpus.
For example:
use Cwd; use File::Spec; use Data::Dump qw(dump); use Log::Log4perl qw(:easy); use Text::Corpus::VoiceOfAmerica; Log::Log4perl->easy_init ($INFO); my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa'); my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory); $corpus->update (testing => 1, verbose => 1); dump $corpus->getURIsInCorpus;
update
update (verbose => 0)
This method updates the set of documents in the corpus by fetching any newly listed documents in the sitemap.xml file.
sitemap.xml
verbose
verbose => 0
If verbose is positive, then after each new document is fetched a message is logged stating the number of documents remaining to fetch and the approximate time to completion. update returns the number of documents fetched.
testing
testing => 0
If testing is true, only one document is added to the corpus.
use Cwd; use File::Spec; use Data::Dump qw(dump); use Log::Log4perl qw(:easy); use Text::Corpus::VoiceOfAmerica; Log::Log4perl->easy_init ($INFO); my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa'); my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory); $corpus->update (testing => 1, verbose => 1); dump $corpus->getTotalDocuments;
The example below will print out all the information for each document in the corpus.
use Cwd; use File::Spec; use Data::Dump qw(dump); use Log::Log4perl qw(:easy); use Text::Corpus::VoiceOfAmerica; Log::Log4perl->easy_init ($INFO); my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa'); my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory); $corpus->update (testing => 1, verbose => 1); my $totalDocuments = $corpus->getTotalDocuments; for (my $i = 0; $i < $totalDocuments; $i++) { eval { my $document = $corpus->getDocument(index => $i); next unless defined $document; my %documentInfo; $documentInfo{title} = $document->getTitle(); $documentInfo{body} = $document->getBody(); $documentInfo{date} = $document->getDate(); $documentInfo{content} = $document->getContent(); $documentInfo{categories} = $document->getCategories(); $documentInfo{description} = $document->getDescription(); $documentInfo{uri} = $document->getUri(); dump \%documentInfo; }; }
To install the module set TEXT_CORPUS_VOICEOFAMERICA_FULL_TESTING to true and run the following commands:
TEXT_CORPUS_VOICEOFAMERICA_FULL_TESTING
perl Makefile.PL make make test make install
If you are on a windows box you should use 'nmake' rather than 'make'.
The module will install if TEXT_CORPUS_VOICEOFAMERICA_FULL_TESTING is not defined or false, but little testing will be performed.
This module uses xpath expressions to extract links and text which may become invalid as the format of various pages change, causing a lot of bugs.
Please email bugs reports or feature requests to text-corpus-voiceofamerica@rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Bug/Report.html?Queue=Text-Corpus-VoiceOfAmerica. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
text-corpus-voiceofamerica@rt.cpan.org
Jeff Kubina<jeff.kubina@gmail.com>
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
corpus, english corpus, information processing, voa, voice of america
CHI, Log::Log4perl, Text::Corpus::VoiceOfAmerica::Document
To install Text::Corpus::VoiceOfAmerica, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Corpus::VoiceOfAmerica
CPAN shell
perl -MCPAN -e shell install Text::Corpus::VoiceOfAmerica
For more information on module installation, please visit the detailed CPAN module installation guide.