The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::Gender - Guesses author's gender by analyzing text.

SYNOPSIS

  use Lingua::EN::Gender;

  my $text = "These are the days that try men's souls.";

  my $lingua = Lingua::EN::Gender->new($text);
  print $lingua->gender . "\n";  # male

ABSTRACT

  Lingua::EN::Gender guesses an author's gender by analyzing text
  using the Koppel-Argamon algorithm.

DESCRIPTION

This Perl module implements the Koppel-Argamon algorithm for guessing an author's gender. The algorithm was invented by Moshe Koppel (Bar-Ilan University, Israel) and Shlomo Argamon (Illinois Institute of Technology), and is described at:

  http://www.nytimes.com/2003/08/10/magazine/10wwln-test.html

ALGORITHM

Count the number of words in the document.

For each appearance of the following words, add the points indicated:

  "the"                    17
  "a"                       6
  "some"                    6
  number                    5
  "it"                      2
  "with"                  -14

  possessives,
    ending in "'s"         -5
    pronouns               -3

  "for"                    -4
  "not"                    -4
  word ending with "n't"    4

If the total score is greater than the total number of words, the author is probably a male. Otherwise, the author is probably a female.

IMPLEMENTATION

The algorithm is fairly straightforward, although there are a few twists and turns. My implementation does the following:

  * Counts hyphenated words as two words.

  * Knows that "it's" is not a possessive pronoun.

  * Recognizes the British spelling of "fourty."

The biggest complication with my implementation is in how it handles numbers. If a number is preceded by another number, it only scores it as a single number, even though it's counted as two words. For example:

  one hundred

is counted as one number (with a score of 5) and two words. My implementation does not handle the following situation correctly:

  First one.  Two next.

It would count this as one number (score 5) and four words, even though it should be two numbers (score 10) and four words. It wouldn't be that difficult to handle these types of situations, but I was lazy, and I don't think it will make much of a difference. Maybe in the next version.

SEE ALSO

McGrath, Charles. "Sexed Texts." New York Times Magazine, August 10, 2003. http://www.nytimes.com/2003/08/10/magazine/10WWLN.html

Ball, Philip. "Computer program detects author gender." Nature, July 18, 2003. http://www.nature.com/nsu/030714/030714-13.html

I first discovered this work at:

  http://www.bookblog.net/gender/genie.html

AUTHOR

Eugene Eric Kim, <eekim@blueoxen.org>

COPYRIGHT AND LICENSE

Copyright (c) Blue Oxen Associates 2003. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.