The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Record::Deduper - Separate complete, partial and near duplicate text records

SYNOPSIS

    use Text::Record::Deduper;

    my $deduper = new Text::Record::Deduper;

    # Find and remove entire lines that are duplicated
    $deduper->dedupe_file("orig.txt");

    # Dedupe comma separated records, duplicates defined by several fields
    $deduper->field_separator(',');
    $deduper->add_key(field_number => 1, ignore_case => 1 );
    $deduper->add_key(field_number => 2, ignore_whitespace => 1);
    # unique records go to file names_uniqs.csv, dupes to names_dupes.csv
    $deduper->dedupe_file('names.csv');

    # Find 'near' dupes by allowing for given name aliases
    my %nick_names = (Bob => 'Robert',Rob => 'Robert');
    my $near_deduper = new Text::Record::Deduper();
    $near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;
    $near_deduper->dedupe_file('names.txt');

    # Create a text report, names_report.txt to identify all duplicates
    $near_deduper->report_file('names.txt',all_records => 1);

    # Find 'near' dupes in an array of records, returning references 
    # to a unique and a duplicate array
    my ($uniqs,$dupes) = $near_deduper->dedupe_array(\@some_records);

    # Create a report on unique and duplicate records
    $deduper->report_file("orig.txt",all_records => 0);

DESCRIPTION

This module allows you to take a text file of records and split it into a file of unique and a file of duplicate records. Deduping of arrays is also possible.

Records are defined as a set of fields. Fields may be separated by spaces, commas, tabs or any other delimiter. Records are separated by a new line.

If no options are specifed, a duplicate will be created only when all the fields in a record (the entire line) are duplicated.

By specifying options a duplicate record is defined by which fields or partial fields must not occur more than once per record. There are also options to ignore case sensitivity, leading and trailing white space.

Additionally 'near' or 'fuzzy' duplicates can be defined. This is done by creating aliases, such as Bob => Robert.

This module is useful for finding duplicates that have been created by multiple data entry, or merging of similar records

METHODS

new

The new method creates an instance of a deduping object. This must be called before any of the following methods are invoked.

field_separator

Sets the token to use as the field delimiter. Accepts any character as well as Perl escaped characters such as "\t" etc. If this method ins not called the deduper assumes you have fixed width fields .

    $deduper->field_separator(',');

add_key

Lets you add a field to the definition of a duplicate record. If no keys have been added, the entire record will become the key, so that only records duplicated in their entirity are removed.

    $deduper->add_key
    (
        field_number => 1, 
        key_length => 5, 
        ignore_case => 1,
        ignore_whitespace => 1,
        alias => \%nick_names
    );
field_number

Specifies the number of the field in the record to add to the key (1,2 ...). Note that this option only applies to character separated data. You will get a warning if you try to specify a field_number for fixed width data.

start_pos

Specifies the position of the field in characters to add to the key. Note that this option only applies to fixed width data. You will get a warning if you try to specify a start_pos for character separated data. You must also specify a key_length.

Note that the first column is numbered 1, not 0.

key_length

The length of a key field. This must be specifed if you are using fixed width data (along with a start_pos). It is optional for character separated data.

ignore_case

When defining a duplicate, ignore the case of characters, so Robert and ROBERT are equivalent.

ignore_whitespace

When defining a duplicate, ignore white space that leasd or trails a field's data.

alias

When defining a duplicate, allow for aliases substitution. For example

    my %nick_names = (Bob => 'Robert',Rob => 'Robert');
    $near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;

Whenever field 2 contains 'Bob', it will be treated as a duplicate of a record where field 2 contains 'Robert'.

dedupe_file

This method takes a file name basename.ext as it's only argument. The file is processed to detect duplicates, as defined by the methods above. Unique records are place in a file named basename_uniq.ext and duplicates in a file named basename_dupe.ext. Note that If either of this output files exist, they are over written The orignal file is left intact.

    $deduper->dedupe_file("orig.txt");

dedupe_array

This method takes an array reference as it's only argument. The array is processed to detect duplicates, as defined by the methods above. Two array references are retuned, the first to the set of unique records and the second to the set of duplicates.

Note that the memory constraints of your system may prevent you from processing very large arrays.

    my ($unique_records,duplicate_records) = $deduper->dedupe_array(\@some_records);

report_file

This method takes a file name basename.ext as it's initial argument.

A text report is produced with the following columns

    record number : the line number of the record

    key : the key values that define record uniqueness

    type: the type of record
            unique    : record only occurs once
            identical : record occurs more than once, first occurence has parent record number of 0
            alias     : record occurs more than once, after alias substitutions have been applied

    parent record number : the line number of the record that THIS record is a duplicate of.

By default, the report file name is basename_report.ext.

Various setup options may be defined in a hash that is passed as an optional argument to the report_file method. Note that all the arguments are optional. They include

all_records

When this option is set to a positive value, all records will be included in the report. If this value is not set, only the duplicate records will be included in the report

    $deduper->report_file("orig.txt",all_records => 0)

report_array

This method takes an array as it's initial argument. The behaviour is the same as report_file above except that the report file is named deduper_array_report.txt

EXAMPLES

Dedupe an array of single records

Given an array of strings:

    my @emails = 
    (
        'John.Smith@xyz.com',
        'Bob.Smith@xyz.com',
        'John.Brown@xyz.com.au,
        'John.Smith@xyz.com'
    );

    use Text::Record::Deduper;

    my $deduper = new Text::Record::Deduper();
    my ($uniq,$dupe);
    ($uniq,$dupe) = $deduper->dedupe_array(\@emails);

The array reference $uniq now contains

    'John.Smith@xyz.com',
    'Bob.Smith@xyz.com',
    'John.Brown@xyz.com.au'

The array reference $dupe now contains

    'John.Smith@xyz.com'

Dedupe a file of fixed width records

Given a text file names.txt with space separated values and duplicates defined by the second and third columns:

    100 Bob      Smith    
    101 Robert   Smith    
    102 John     Brown    
    103 Jack     White   
    104 Bob      Smythe    
    105 Robert   Smith    


    use Text::Record::Deduper;

    my %nick_names = (Bob => 'Robert',Rob => 'Robert');
    my $near_deduper = new Text::Record::Deduper();
    $near_deduper->add_key(start_pos =>  5, key_length => 9, ignore_whitespace => 1, alias => \%nick_names) or die;
    $near_deduper->add_key(start_pos => 14, key_length => 9,) or die;
    $near_deduper->dedupe_file("names.txt");
    $near_deduper->report_file("names.txt");

Text::Record::Deduper will produce a file of unique records, names_uniqs.txt in the same directory as names.txt.

    101 Robert   Smith    
    102 John     Brown    
    103 Jack     White   
    104 Bob      Smythe    
       

and a file of duplicates, names_dupes.txt in the same directory as names.txt

    100 Bob      Smith    
    105 Robert   Smith   

The original file, names.txt is left intact.

A report file names_report.txt, is created in the same directory as names.txt

    Number Key                            Type       Parent Parent Key                    
    --------------------------------------------------------------------------------
         1 Bob_Smith                      alias           2 Robert_Smith                  
         2 Robert_Smith                   identical       0                               
         3 John_Brown                     unique          0                               
         4 Jack_White                     unique          0                               
         5 Bob_Smythe                     unique          0                               
         6 Robert_Smith                   identical       2 Robert_Smith                  

TO DO

    Allow for multi line records
    Add batch mode driven by config file or command line options
    Allow option to warn user when over writing output files
    Allow user to customise suffix for uniq and dupe output files

SEE ALSO

sort(3), uniq(3), Text::ParseWords, Text::RecordParser, Text::xSV

AUTHOR

Text::Record::Deduper was written by Kim Ryan <kimryan at cpan d o t org>

COPYRIGHT AND LICENSE

Copyright (C) 2011 Kim Ryan.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.