The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Treebank::Const - Object modeling constituent from a treebank

SYNOPSIS

  use Lingua::Treebank::Const;

  my $text = <<EOTREE
  (S
    (NP-SBJ (DT this) )
    (VP (VBZ is)
      (NP-PRD (NNP Lisa) ))
    (. .) )
  TREE

  my $utt = Lingua::Treebank::Const->new->from_penn_string($text)

  print $utt->as_penn_text(), "\n";;

Results:

  (S
      (NP-SBJ
          (DT this))
      (VP
          (VBZ is)
          (NP-PRD
              (NNP Lisa) ))
      (. .))

This is configurable (TO DO: document how so).

ABSTRACT

  module defines methods for accessing syntactic constituents; it
  identifies its parents and its children, and can write itself out in
  a variety of formats (currently Penn treebank style).

DESCRIPTION

Module for describing simple constituents of the Penn Treebank. Recursive behaviors are implied.

Note assumption that terminal nodes (those with defined word values) will not have children, and vice versa. This assumption is currently unchecked by the code.

For a number of these methods, the jargonish notion of domination plays a large role, so for those who might not know:

a node A dominates another node B if B is a descendant of A.

Class methods

new

Constructs a new (uninitialized) token. If starting from text, can be used together with the from_penn_string initialization method, as below:

  my $text = <<EOTREE
  (S
    (NP-SBJ (DT this) )
    (VP (VBZ is)
      (NP-PRD (NNP Lisa) ))
    (. .) )
  TREE

  my $utt = Lingua::Treebank::Const->new->from_penn_string($text)

Otherwise, resulting new unit will have no values (parent, children, tag or word set by default.

Instance methods

creation methods

These methods help to populate the fields of these objects from external data.

from_penn_string

given a string of the Penn format, e.g.,

  (S
    (NP-SBJ (DT this) )
    (VP (VBZ is)
      (NP-PRD (NNP Lisa) ))
    (. .) )

populates the current node with tag S and the children field with new objects (tag NP, tag VP, and tag .). This method recurses on new and from_penn_string to do its job.

simple attributes

tag

Records the tag of this constituent (0-domination). (This is the part of the constituent-label before the hyphen -- after the hyphen is the annot, not the tag).

TO DO: example here.

annot

Returns whatever comes after the hyphen in the constituent label.

word

If this constituent is terminal, then word should contain the lexical item that is represented.

text

A string containing the word values of the terminal nodes donminated by this constituent. For example, calling text on a node created from the Penn text given in the description of the new function returns the string "this is lisa .".

parent

Returns the parent of the current node.

children

Returns a reference to an array of Lingua::Treebank::Const objects that are the children of the current node.

Currently does not check whether word is populated.

methods about parentage

These methods ask questions about the dominating ancestors and direct children of the current node. Think of them as navigating up-and-down the tree.

is_terminal

No arguments

Returns whether self is a leaf. Does not check whether children are populated; if automatically generated from the from_penn_string method then this will always be correct.

is_root

No arguments. Boolean. Returns whether the instance is a root node (has no parents).

root

No arguments.

Returns the root node for the instance in question (might be itself)

path_up_to

Takes an ancestor node as argument.

Returns a list of all the nodes (distal first) between the instance and the root.

Returns undefined and carps when the given node is not an ancestor of the instance.

is_descendant_of

Takes presumed ancestor as argument.

Returns whether the ancestor is indeed an ancestor of the current instance.

is_ancestor_of

Takes presumed descendant as argument.

Returns whether current instance is an ancestor of the presumed descendant.

is_sibling

Takes presumed sibling as argument.

Returns whether current instance shares an immediate parent with the presumed sibling.

height

No arguments.

Returns the farthest distance from the current node to a terminal node.

depth

No arguments.

Returns the distance from the instance to the root.

depth_from

what's the distance from the current node up to the node given as argument? (return undefined if the node given is not the ancestor of the instance node)

methods about siblings

These methods ask questions about siblings, and left-right movement in the tree. Think of them as moving left-and-right around in the tree.

get_index

One argument (daughter).

Returns the index of the daughter in the instance's children list. Zero-based, of course.

prev_sib
next_sib

No arguments. Returns next (or previous) sibling at the same level (dependent on the same parent), or the empty list if no such leaf exists.

prev_leaf
next_leaf

No arguments. Returns the leaf just before (or after) any of the leaves of this node, or the empty list if no such leaf exists.

left_leaf
right_leaf

No arguments. Returns leftmost (rightmost) leaf dominated by the instance.

get_all_terminals

No arguments. Returns left-to-right list of all terminal nodes at or below the current node.

find_common_ancestor

One argument: a presumed cousin.

returns the lowest ancestor the instance and the cousin share (or undefined if they do not share an ancestor)

select_ancestors
select_children

Both these methods take a subroutine as an argument and return those [child/ancestor] nodes that return true when the sub is called with the node as an argument.

The expectation is that the sub will not modify the node.

methods about structural comparison

These methods are ways of exposing and comparing regions of local structure.

equiv_to

Tests whether the argument has the same structure (and words) as the instance. These need not be the same object, just the same tags and words in the same structure.

equiv_tags
equiv_words

Handy -- and unimplemented -- shorthands for checking certain kinds of matching structure.

methods about tree structure modification

detach

Argument is DAUGHTER node.

Removes the DAUGHTER from the children list of the current instance. DAUGHTER node will still be a valid node, but it will no longer have a parent; it will be a root.

Note that detach may leave a degenerate tree: it may have no terminal node (one with words) at the end of a branch. To avoid this, use wither instead.

wither

No arguments.

Detaches self from parent. self will become an independent root. If the parent has no other children, will recursively call parent-wither>, making a possibly zero-length list of degenerate roots above it until an ancestor has a different child than the one in this line of descent.

         A                   A
        / \          C   B   |
       B   X        / \      X
      /     \   => D   E     |
     C       Y               Y
    / \
   D   E
           Before    After

        calling C->wither()
prepend
append

Arguments are a LIST of new daughters to be inserted at the beginning/end of the children list.

replace

Arguments are (DAUGHTER, LIST). Replaces DAUGHTER with the elements of LIST in the children of the current instance.

DAUGHTER is now its own root; see detach.

flatten

pull up all terminals to be children of the instance, regardless of how deep they are. Re-attach them to the current node, preserving leaf order.

  A->flatten()

       /        /
      A   ==>  A__
     / \      /|\ \
    X   B    C F D G
   /|\   \
  C F D   E
           \
            G
retract

pulls in and removes one non-terminal node (which node is specified by argument), attaching its children directly to the current node, retaining what surface order the children originally had, e.g.:

   A->retract(X)

       /        /
      A   ==>  A
     / \      /|\
    X   B    C D B
   / \   \    / \ \
  C   D   E  F   G E
     / \
    F   G
detach_at

Argument is INDEX. Removes the daughter at INDEX. Will carp if there is no daughter at INDEX.

The daughter at INDEX remains well-formed, though if you do not maintain your own pointer to it, it will probably be collected by the garbage collector.

insert_at

Arguments are INDEX, LIST of daughters. LIST daughters will be inserted beginning at position INDEX of the current instances children.

utility methods

These methods are methods that may (or not) be useful in programming with these objects. These methods are used internally, but are exposed for the programmer who might need them.

stringify overloading is certainly helpful in debugging, since the perl debugger representation of these objects is complicated by their up-reference to parents.

as_penn_text

Returns a text string representing this constituent.

To do: document additional parameters to this, and the possible effects of changing them

$Lingua::Treebank::Const::CHILD_PROLOG
$Lingua::Treebank::Const::INDENT_CHAR
$Lingua::Treebank::Const::CHILD_EPILOG
stringify

This is the method called by default when the object handle is used in a string (see perldoc overload).

Depending on the value of $Lingua::Treebank::Const::STRINGIFY (see below), the string representation of the object varies. The default behavior is as_penn_text, above.

Note that like any object-ref, copying its stringification does NOT work to retain all its behaviors. Nor does an identical string representation necessarily mean the two objects are the same object; merely, that they have the same structure. (see equiv_to).

numerify

This is the mthod called by default when the object handle is used in a numeric context (usually == or !=).

Returns an integer representing the unique object. Identity on this method does indicate identity of the objects.

Rarely used in client code. The numeric inequality operators are unlikely to have any useful meaning on these objects, though they should behave consistently (you should get consistent answers given any two objects, regardless of methods called on those objects).

power user methods

walk ( &action, &stop_crit, $state, $bf_traversal )

An instance method. &action argument is required, others are optional.

Calls &action (a subroutine ref) as a method on node and its children, recursively, passing the node under consideration and the $state value (if provided).

If &stop_crit is defined, calls it on each node; when &stop_crit returns true, children of that node are not pursued.

For both action and stop_crit commands, if a string is passed, it will be called if a method by that name can be found in the object.

$state is passed into each of the child method calls. This is convenient for things like pushing interesting elements onto a list, or updating a counter. It must be a scalar, but can be a reference.

Passing a true value as $bf_traversal tells walk() to explore the tree breadth-first rather than depth-first. passing a false (but defined) value forces depth-first. Undefined values default to the value of $Lingua::Treebank::Const::BF_TRAVERSAL, which is undef (false) -- and thus depth-first by default.

  # find out how many children each NP has, but don't count anything
  # inside an EDITED node
  my $action = sub {
      my ($self, $state) = @_;
      return unless $self->tag() eq 'NP';

      # just print it
      print scalar @{$self->children}, "\n";

      # or store it in the state variable
      push @{$state}, scalar @{$self->children()};
    };

  my $stop_crit = sub {$_[0]->tag() eq 'EDITED'};

  $tree->walk( $action, $stop_crit, \@counts );

  use List::Util 'sum';
  print "there were ", sum (@counts),
        " total children of NP nodes\n";

Class variables

$Lingua::Treebank::Const::BF_TRAVERSAL

Defaults to undefined. If true, changes the default behavior of the walk() method to be breadth-first rather than depth-first.

$Lingua::Treebank::STRINGIFY

Changes the default stringification of the objects. Can be set to any of the following three values:

as_penn_text

default value.

e.g.

  (S
    (NP
      (NNP Joe)
    )
    (VP
      (VB likes)
      (NP
        (NNP Bach)
      )
    )
    (. .)
  )
words

e.g.

  Joe likes Bach .
preterm_tags

e.g.,

  NNP VB NNP .

To Do

check that destroy doesn't leak (undo parent links?)

dump as latex tree

read in other treebank formats (latex trees?)

EXPORT

None by default.

HISTORY

0.01

Original version; created by h2xs 1.22 with options

  -CAX
        Lingua::Treebank::Const
0.02
Improved comparison code by caching numerify results.

Should give minor speed improvements for data that works with the same tree over more than one operation. Little if any degradation (tiny increase in size) for those who only use each tree once.

Improved documentation.

Now lists all instance methods. Instance method documentation also organized better -- now falls into categories.

0.03
new interface variable

added $VERBOSE variable for suppressing non-fatal errors.

improved parsing

now copes with examples like (e.g.):

  ((FRAG (FOO bar))

critically, earlier versions failed when the tag was empty and not followed by whitespace

0.08
added new methods
select_ancestors
select_children
is_empty_root
walk
new interface variable

added $BF_TRAVERSAL for changing walk() method defaults

0.09
added new methods

TODO: document these, add test cases, update version number

edges

now with new ignore feature!

shared_edges
list_constituents
0.16

Version number jump to keep up with Lingua::Treebank

SEE ALSO

Documentation for Penn treebank http://www.cis.upenn.edu/~treebank/.

AUTHOR

Jeremy Gillmor Kahn, <kahn@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2003 by Jeremy Gillmor Kahn

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.