The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Refer - parse Unix "refer" files

This is Alpha code, and may be subject to changes in its public interface. It will stabilize by June 1997, at which point this notice will be removed. Until then, if you have any feedback, please let me know!

SYNOPSIS

Pull in the module:

    use Text::Refer;  

Parse a refer stream from a filehandle:

    while ($ref = input Text::Refer \*FH)  {
        # ...do stuff with $ref...
    }
    defined($ref) or die "error parsing input";

Same, but using a parser object for more control:

    # Create a new parser: 
    $parser = new Text::Refer::Parser LeadWhite=>'KEEP';
    
    # Parse:
    while ($ref = $parser->input(\*FH))  {
        # ...do stuff with $ref...
    }
    defined($ref) or die "error parsing input";

Manipulating reference objects, using high-level methods:

    # Get the title, author, etc.:
    $title      = $ref->title;
    @authors    = $ref->author;      # list context
    $lastAuthor = $ref->author;      # scalar context
    
    # Set the title and authors:
    $ref->title("Cyberiad");
    $ref->author(["S. Trurl", "C. Klapaucius"]);   # arrayref for >1 value!
    
    # Delete the abstract:
    $ref->abstract(undef);

Same, using low-level methods:

    # Get the title, author, etc.:
    $title      = $ref->get('T');
    @authors    = $ref->get('A');      # list context
    $lastAuthor = $ref->get('A');      # scalar context
    
    # Set the title and authors:
    $ref->set('T', "Cyberiad");
    $ref->set('A', "S. Trurl", "C. Klapaucius");
    
    # Delete the abstract:
    $ref->set('X');                    # sets to empty array of values

Output:

    print $ref->as_string;

DESCRIPTION

This module supercedes the old Text::Bib.

This module provides routines for parsing in the contents of "refer"-format bibliographic databases: these are simple text files which contain one or more bibliography records. They are usually found lurking on Unix-like operating systems, with the extension .bib.

Each record in a "refer" file describes a single paper, book, or article. Users of nroff/troff often employ such databases when typesetting papers.

Even if you don't use *roff, this simple, easily-parsed parameter-value format is still useful for recording/exchanging bibliographic information. With this module, you can easily post-process "refer" files: search them, convert them into LaTeX, whatever.

Example

Here's a possible "refer" file with three entries:

    %T Cyberiad
    %A Stanislaw Lem
    %K robot fable 
    %I Harcourt/Brace/Jovanovich
    
    %T Invisible Cities
    %A Italo Calvino
    %K city fable philosophy
    %X In this surreal series of fables, Marco Polo tells an
       aged Kublai Khan of the many cities he has visited in 
       his lifetime.  
    
    %T Angels and Visitations
    %A Neil Gaiman 
    %D 1993

The lines separating the records must be completely blank; that is, they cannot contain anything but a single newline.

See refer(1) or grefer(1) for more information on "refer" files.

Syntax

From the GNU manpage, grefer(1):

The bibliographic database is a text file consisting of records separated by one or more blank lines. Within each record fields start with a % at the beginning of a line. Each field has a one character name that immediately follows the %. It is best to use only upper and lower case letters for the names of fields. The name of the field should be followed by exactly one space, and then by the contents of the field. Empty fields are ignored. The conventional meaning of each field is as follows:

A

The name of an author. If the name contains a title such as Jr. at the end, it should be separated from the last name by a comma. There can be multiple occurrences of the A field. The order is significant. It is a good idea always to supply an A field or a Q field.

B

For an article that is part of a book, the title of the book

C

The place (city) of publication.

D

The date of publication. The year should be specified in full. If the month is specified, the name rather than the number of the month should be used, but only the first three letters are required. It is a good idea always to supply a D field; if the date is unknown, a value such as "in press" or "unknown" can be used.

E

For an article that is part of a book, the name of an editor of the book. Where the work has editors and no authors, the names of the editors should be given as A fields and , (ed) or , (eds) should be appended to the last author.

G

US Government ordering number.

I

The publisher (issuer).

J

For an article in a journal, the name of the journal.

K

Keywords to be used for searching.

L

Label.

NOTE: Uniquely identifies the entry. For example, "Able94".

N

Journal issue number.

O

Other information. This is usually printed at the end of the reference.

P

Page number. A range of pages can be specified as m-n.

Q

The name of the author, if the author is not a person. This will only be used if there are no A fields. There can only be one Q field.

NOTE: Thanks to Mike Zimmerman for clarifying this for me: it means a "corporate" author: when the "author" is listed as an organization such as the UN, or RAND Corporation, or whatever.

R

Technical report number.

S

Series name.

T

Title. For an article in a book or journal, this should be the title of the article.

V

Volume number of the journal or book.

X

Annotation.

NOTE: Basically, a brief abstract or description.

For all fields except A and E, if there is more than one occurrence of a particular field in a record, only the last such field will be used.

If accent strings are used, they should follow the character to be accented. This means that the AM macro must be used with the -ms macros. Accent strings should not be quoted: use one \ rather than two.

Parsing records from "refer" files

You will nearly always use the input() constructor to create new instances, and nearly always as shown in the "SYNOPSIS".

Internally, the records are parsed by a parser object; if you invoke the class method Text::Refer::input(), a special default parser is used, and this will be good enough for most tasks. However, for more complex tasks, feel free to use "class Text::Refer::Parser" to build (and use) your own fine-tuned parser, and input() from that instead.

CLASS Text::Refer

Each instance of this class represents a single record in a "refer" file.

Construction and input

new

Class method, constructor. Build an empty "refer" record.

input FILEHANDLE

Class method. Input a new "refer" record from a filehandle. The default parser is used:

    while ($ref = input Text::Refer \*STDIN) {
        # ...do stuff with $ref...
    }

Do not use this as an instance method; it will not re-init the object you give it.

Getting/setting attributes

attr ATTR, [VALUE]

Instance method. Get/set the attribute by its one-character name, ATTR. The VALUE is optional, and may be given in a number of ways:

  • If the VALUE is given as undefined, the attribute will be deleted:

        $ref->attr('X', undef);        # delete the abstract
  • If a defined, non-reference scalar VALUE is given, it is used to replace the existing values for the attribute with that single value:

        $ref->attr('T', "The Police State Rears Its Ugly Head");
        $ref->attr('D', 1997);
  • If an arrayref VALUE is given, it is used to replace the existing values for the attribute with all elements of that array:

        $ref->attr('A', ["S. Trurl", "C. Klapaucius"]);

    We use an arrayref since an empty array would be impossible to distinguish from the next two cases, where the goal is to "get" instead of "set"...

This method returns the current (or new) value of the given attribute, just as get() does:

  • If invoked in a scalar context, the method will return the last value (this is to mimic the behavior of groff). Hence, given the above, the code:

        $author = $ref->attr('A');

    will set $author to "C. Klapaucius".

  • If invoked in an array context, the method will return the list of all values, in order. Hence, given the above, the code:

        @authors = $ref->attr('A');

    will set @authors to ("S. Trurl", "C. Klapaucius").

Note: this method is used as the basis of all "named" access methods; hence, the following are equivalent in every way:

    $ref->attr(T => $title)    <=>   $ref->title($title);
    $ref->attr(A => \@authors) <=>   $ref->author(\@authors);
    $ref->attr(D => undef)     <=>   $ref->date(undef);
    $auth  = $ref->attr('A')   <=>   $auth  = $ref->author;
    @auths = $ref->attr('A')   <=>   @auths = $ref->author;
author, book, city, ... [VALUE]

Instance methods. For every one of the standard fields in a "refer" record, this module has designated a high-level attribute name:

   A  author     G  govt_no      N  number        S  series   
   B  book       I  publisher    O  other_info    T  title     
   C  city       J  journal      P  page          V  volume    
   D  date       K  keywords     Q  corp_author   X  abstract  
   E  editor     L  label        R  report_no    

Then, for each field F with high-level attribute name FIELDNAME, the method FIELDNAME() works as follows:

    $ref->attr('F', @args)     <=>   $ref->FIELDNAME(@args)

Which means:

    $ref->attr(T => $title)    <=>   $ref->title($title);
    $ref->attr(A => \@authors) <=>   $ref->author(\@authors);
    $ref->attr(D => undef)     <=>   $ref->date(undef);
    $auth  = $ref->attr('A')   <=>   $auth  = $ref->author;
    @auths = $ref->attr('A')   <=>   @auths = $ref->author;

See the documentation of attr() for the argument list.

get ATTR

Instance method. Get an attribute, by its one-character name. In an array context, it returns all values (empty if none):

    @authors = $ref->get('A');      # returns list of all authors

In a scalar context, it returns the last value (undefined if none):

    $author = $ref->get('A');       # returns the last author
set ATTR, VALUES...

Instance method. Set an attribute, by its one-character name.

    $ref->set('A', "S. Trurl", "C. Klapaucius");

An empty array of VALUES deletes the attribute:

    $ref->set('A');       # deletes all authors

No useful return value is currently defined.

Output

as_string [OPTSHASH]

Instance method. Return the "refer" record as a string, usually for printing:

    print $ref->as_string;

The options are:

Quick

If true, do it quickly, but unsafely. This does no fixup on the values at all: they are output as-is. That means if you used parser-options which destroyed any of the formatting whitespace (e.g., Newline=TOSPACE with LeadWhite=KILLALL), there is a risk that the output object will be an invalid "refer" record.

The fields are output with %L first (if it exists), and then the remaining fields in alphabetical order. The following "safety measures" are normally taken:

  • Lines longer than 76 characters are wrapped (if possible, at a non-word character a reasonable length in, but there is a chance that they will simply be "split" if no such character is available).

  • Any occurences of '%' immediately after a newline are preceded by a single space.

These safety measures are slightly time-consuming, and are silly if you are merely outputting a "refer" object which you have read in verbatim (i.e., using the default parser-options) from a valid "refer" file. In these cases, you may want to use the Quick option.

CLASS Text::Refer::Parser

Instances of this class do the actual parsing.

Parser options

The options you may give to new() are as follows:

ForgiveEOF

Normally, the last record in a file must end with a blank line, or else this module will suspect it of being incomplete and return an error. However, if you give this option as true, it will allow the last record to be terminated by an EOF.

GoodFields

By default, the parser accepts any (one-character) field name that is a printable ASCII character (no whitespace). Formally, this is:

    [\041-\176]

However, when compiling parser options, you can supply your own regular expression for validating (one-character) field names. (note: you must supply the square brackets; they are there to remind you that you should give a well-formed single-character expression). One standard expression is provided for you:

    $Text::Refer::GroffFields  = '[A-EGI-LN-TVX]';  # legal groff fields

Illegal fields which are encounterd during parsing result in a syntax error.

NOTE: You really shouldn't use this unless you absolutely need to. The added regular expression test slows down the parser.

LeadWhite

In many "refer" files, continuation lines (the 2nd, 3rd, etc. lines of a field) are written with leading whitespace, like this:

    %T Incontrovertible Proof that Pi Equals Three
       (for Large Values of Three)
    %A S. Trurl
    %X The author shows how anyone can use various common household 
       objects to obtain successively less-accurate estimations of 
       pi, until finally arriving at a desired integer approximation,
       which nearly always is three.                 

This leading whitespace serves two purposes: (1) it makes it impossible to mistake a continuation line for a field, since % can no longer be the first character, and (2) it makes the entries easier to read. The LeadWhite option controls what is done with this whitespace:

    KEEP        - default; the whitespace is untouched
    KILLONE     - exactly one character of leading whitespace is removed
    KILLALL     - all leading whitespace is removed

See the section below on "using the parser options" for hints and warnings.

Newline

The Newline option controls what is done with the newlines that separate adjacent lines in the same field:

    KEEP        - default; the newlines are kept in the field value
    TOSPACE     - convert each newline to a single space
    KILL        - the newlines are removed

See the section below on "using the parser options" for hints and warnings.

Default values will be used for any options which are left unspecified.

Notes on the parser options

The default values for Newline and LeadWhite will preserve the input text exactly.

The Newline=TOSPACE option, when used in conjunction with the LeadWhite=KILLALL option, effectively "word-wraps" the text of each field into a single line.

Be careful! If you use the Newline=KILL option with either the LeadWhite=KILLONE or the LeadWhite=KILLALL option, you could end up eliminating all whitespace that separates the word at the end of one line from the word at the beginning of the next line.

Public interface

new PARAMHASH

Class method, constructor. Create and return a new parser. See above for the "parser options" which you may give in the PARAMHASH.

create [CLASS]

Instance method. What class of objects to create. The default is Text::Refer.

input FH

Instance method. Create a new object from the next record in a "refer" stream. The actual class of the object is given by the class() method.

Returns the object on success, '0' on expected end-of-file, and undefined on error.

Having two false values makes parsing very simple: just input() records until the result is false, then check to see if that last result was 0 (end of file) or undef (failure).

NOTES

Under the hood

Each "refer" object has instance variables corresponding to the actual field names ('T', 'A', etc.). Each of these is a reference to an array of the actual values.

Notice that, for maximum flexibility and consistency (but at the cost of some space and access-efficiency), the semantics of "refer" records do not come into play at this time: since everything resides in an array, you can have as many %K, %D, etc. records as you like, and given them entirely different semantics.

For example, the Library Of Boring Stuff That Everyone Reads (LOBSTER) uses the unused %Y as a "year" field. The parser accomodates this case by politely not choking on LOBSTER .bibs (although why you would want to eat a lobster bib instead of the lobster is beyond me...).

Performance

Tolerable. On my 90MHz/32 MB RAM/I586 box running Linux 1.2.13 and Perl5.002, it parses a typical 500 KB "refer" file (of 1600 records) as follows:

     8 seconds of user time for input and no output
    10 seconds of user time for input and "quick" output
    16 seconds of user time for input and "safe" output

So, figure the individual speeds are:

    input:            200 records ( 60 KB) per second.
    "quick" output:   800 records (240 KB) per second.
    "safe" output:    200 records ( 60 KB) per second.

By contrast, a C program which does the same work is about 8 times as fast. But of course, the C code is 8 times as large, and 8 times as ugly... :-)

Note to serious bib-file users

I actually do not use "refer" files for *roffing... I used them as a quick-and-dirty database for WebLib, and that's where this code comes from. If you're a serious user of "refer" files, and this module doesn't do what you need it to, please contact me: I'll add the functionality in.

BUGS

Some combinations of parser-options are silly.

CHANGE LOG

$Id: Refer.pm,v 1.106 1997/04/22 18:41:41 eryq Exp $

Version 1.101

Initial release. Adapted from Text::Bib.

AUTHOR

Copyright (C) 1997 by Eryq, eryq@enteract.com, http://www.enteract.com/~eryq.

NO WARRANTY

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

For a copy of the GNU General Public License, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.