WebSource - a general data wrapping tool particularly well suited for online data (but what data in not online in some way today ;) )
WebSource gives a general and normalized framework way to access data made available via the web. An access to subparts of the Web is made by defining a task. This task is built by composing query building, extraction, fetching and filtering subtasks.
$source = WebSource->new(wsd => $description); @results = $source->query($query); or $result = $source->set_query($query); while($result = $source->next_result()) { ... }
WebSource originally was a generic wrapper around a Web Source. Given an XML description of a source it allows to query the source and retreive its results. The format of the query and the result remain source dependant however.
It is now configurable enough allow to do complex tasks on the web : such as fetching, extracting, filtering data one the Web. Each complex task is described by an XML task description file (WebSource description). This task is decomposed into simple subtasks of different flavors.
Existing subtask flavors are : - extract input an XML::LibXML::Document output an XML::LibXML::Node Applys an Xpath on the document and returns the set of nodes - fetch input a URL (or XML::LibXML::Node containing a url) output an XML::LibXML::Document - format input an XML::Document output a string - filter input anything output anything (but not all) - external This type of subtask uses an external perl module as a task. This allows to define highly configurable tasks. input depends on external module output depends on external module - meta-tag input anything output anything (with updated meta-data)
Create a new WebSource object working with the given a WebSource description
The following named paramters can be given :
wsd
Use a generic engine with the given source description file
max_results
Do not output more than max_results
Pass the initial data to the first subtask
Build a query %hash for the given parameters and push it in
Set the maximum number of results to output to $count
Returns the following result for the task
Returns a has of the initial tasks parameters
Returns the spec of the options translated for Getopt::Mixed
Sets source specific option $opt to value $val
Handles node of type <ws:import href="" /> by inserting nodes from the wsd file referenced by href into (imported document) into the current wsd document (target document). A node is inserted from the imported document into the target document only if a node with the same name does not exist in the target document.
Handles node of type <ws:attribute name="aname" value="oname" /> by adding and attribut name aname with the value of the option named oname to the parent node. The ws:attribute node is then removed.
ws-query, WebSource::Extract, WebSource::Fetch, WebSource::Filter, etc.
2 POD Errors
The following errors were encountered while parsing the POD:
'=item' outside of any '=over'
You forgot a '=back' before '=head1'
To install WebSource, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WebSource
CPAN shell
perl -MCPAN -e shell install WebSource
For more information on module installation, please visit the detailed CPAN module installation guide.