The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Geo::Coder::US::Import - Import TIGER/Line data into a Geo::Coder::US database

SYNOPSIS

  use Geo::Coder::US::Import;

  Geo::Coder::US->set_db( "/path/to/geocoder.db", 1 );

  Geo::Coder::US::Import->load_tiger_data( "TGR06075" );

  Geo::Coder::US::Import->load_fips_data( "All_fips55.txt" );

DESCRIPTION

Geo::Coder::US::Import provides methods for importing TIGER/Line data into a BerkeleyDB database for use with Geo::Coder::US.

Instead of using this module directly, you may want to use one of the included utility scripts in the eg/ directory of this distribtion. The import_tiger.pl script imports uncompresed TIGER/Line files from a given location:

  $ perl eg/import_tiger.pl geocoder.db /path/to/tiger/files/TGRnnnnn

Be sure to leave off the .RT? extensions or import_tiger.pl will complain.

The import_tiger_zip.pl script imports compressed TIGER/Line data by using Archive::Zip to extract only the needed files from the ZIP file into a temporary directory, which it cleans up for you afterwards. This is the preferred method of data import, as it can handle multiple ZIP files at once:

  $ perl eg/import_tiger_zip.pl geocoder.db /path/to/tiger/zips/*.zip

Both of these import scripts need to cache a lot of data in memory, so you may find that you need one or two hundred megs of RAM for the import to run to completion. The import process takes about 6 hours to import all 4 gigabytes of compressed TIGER/Line data on a 2 GHz Linux machine, and it appears to be mostly processor bound. The final BerkeleyDB database produced by such an import tops out around 750 megabytes.

One way of avoiding the RAM bloat on import is to use xargs to run import_tiger_zip.pl on each TIGER/Line ZIP separately:

  $ find ~/tiger -name '*.zip' | \
        xargs -n1 perl eg/import_tiger_zip.pl geocoder.db

Similarly, you can import FIPS-55 place name data into a Geo::Coder::US database with eg/import_fips.pl:

  $ perl eg/import_fips.pl geocoder.db All_fips55.txt

Note that you can make a perfectly good geocoder for a particular region of the US by simply importing only the TIGER/Line and FIPS-55 files for the region you're interested in. You only need to import all of the TIGER/Line data sets in the event that you want a geocoder for the whole US.

CLASS METHODS

load_tiger_data( $tiger_basename )

Loads all data from the specified TIGER/Line data set in order of the following record types: C, 5, 1, 4, 6. This ordering ensures that record references are set correctly. You may prefix $tiger_basename with an absolute or relative path, but do not provide the .RT? filename suffix as part of $tiger_basename or load_tiger_data() will become cranky.

Note that you must first call Geo::Coder::US->set_db() with a second argument with a true value, or set_db() won't open the database for writing.

load_fips_data( $fips_file )

Loads all the data from the specified FIPS-55 gazetteer file. This provides additional or alternate place name data to supplement TIGER/Line.

load_rtC( $tiger_basename )
load_rt5( $tiger_basename )
load_rt1( $tiger_basename )
load_rt4( $tiger_basename )
load_rt6( $tiger_basename )

Each of these methods loads all records from the TIGER/Line record type specified, with the following exceptions: Type C data is only loaded for records with a FIPS-55 class code beginning with C, D, E, F, T, U or Z (i.e. inhabited places). Type 1 data is only loaded for records with a Census Feature Class Code beginning with A (i.e. street data). Also, Type 1 data for which no feature name or FIPS place and/or county subdivision is found are not loaded. Finally, Type 6 data lacking a matching Type 1 record in the database are not loaded.

You may prefix $tiger_basename with an absolute or relative path, but do not provide the .RT? filename suffix as part of $tiger_basename or the load_rt*() methods will become cranky.

BUGS

The import throws away probably useful data on the assumption that it's not. Similarly, it imports a lot of data you may never use. Mea culpa. Patches welcome.

Also, you will encounter from time to time errors from your DBI driver about duplicate keys for certain records. I think the TIGER/Line data has the odd duplicated TLID in Record Type 1, even though it's not supposed to. These errors are annoying but not fatal, and can probably be ignored.

The import process can take up huge amounts of RAM. Be forewarned. If anyone really needs it, the data cached in memory by the import process could be buffered to disk, but this would slow down the import process considerably (I think). Contact me if you really want to try this -- it might be faster for you to just download a binary version of the fully imported database.

Right now, I can't afford to make the full 750 megabyte database freely downloadable from my website -- the bandwidth charges would eat me alive. Contact me if you can offer funding or mirroring.

SEE ALSO

Geo::Coder::US(3pm), Geo::StreetAddress::US(3pm), Geo::TigerLine(3pm), Geo::Fips55(3pm), DB_File(3pm), Archive::Zip(3pm)

eg/import_tiger.pl, eg/import_tiger_zip.pl, eg/import_fips.pl

You can download the latest TIGER/Line data (as of this writing) from:

http://www.census.gov/geo/www/tiger/tiger2004fe/tgr2004fe.html

You can get the latest FIPS-55 data from:

http://geonames.usgs.gov/fips55/fips55.html

If you have copious spare time, you can slog through the TIGER/Line 2003 and FIPS-55-3 technical manuals:

http://www.census.gov/geo/www/tiger/tiger2003/TGR2003.pdf

http://www.itl.nist.gov/fipspubs/fip55-3.htm

The TIGER/Line 2004 FE schema is more or less unchanged from 2003.

Finally, a few words about FIPS-55-3 class codes:

http://geonames.usgs.gov/classcode.html

APPRECIATION

Considerable thanks are due to Michael Schwern <schwern@pobox.com> for writing the very useful Geo::TigerLine package, which does all the heavy lifting for this module.

AUTHOR

Schuyler Erle <schuyler@nocat.net>

LICENSE

See Geo::Coder::US(3pm) for licensing details.