File::FindSimilars - Fast similar-files finder
use File::FindSimilars; my $similars_finder = File::FindSimilars->new( { fc_level => $fc_level, } ); $similars_finder->find_for(\@ARGV); $similars_finder->similarity_check();
Extremely fast file similarity checker. Similar-sized and similar-named files are picked out as suspicious candidates of duplicated files.
It uses advanced soundex vector algorithm to determine the similarity between files. Generally it means that if there are n files, each having approximately m words in the file name, the degree of calculation is merely
O(n^2 * m)
which is over thousands times faster than any existing file fingerprinting technology.
The self-test output will help you understand what the module do and what would you expect from the outcome.
$ make test PERL_DL_NONLAZY=1 /usr/bin/perl "-Iblib/lib" "-Iblib/arch" test.pl 1..5 todo 2; # Running under perl version 5.010000 for linux # Current time local: Wed Nov 5 17:45:19 2008 # Current time GMT: Wed Nov 5 22:45:19 2008 # Using Test.pm version 1.25 # Testing File::FindSimilars version 2.04 . . . . == Testing 2, files under test/ subdir: 9 test/(eBook) GNU - Python Standard Library 2001.pdf 3 test/Audio Book - The Grey Coloured Bunnie.mp3 5 test/ColoredGrayBunny.ogg 5 test/GNU - 2001 - Python Standard Library.pdf 4 test/GNU - Python Standard Library (2001).rar 9 test/LayoutTest.java 3 test/PopupTest.java 2 test/Python Standard Library.zip ok 2 # (test.pl at line 83 TODO?!) Note: - The findsimilars script will pick out similar files from them in next test. - Let's assume that the number represent the file size in KB. == Testing 3 result should be: ## ========= 3 'Audio Book - The Grey Coloured Bunnie.mp3' 'test/' 5 'ColoredGrayBunny.ogg' 'test/' ## ========= 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' ok 3 Note: - There are 2 groups of similar files picked out by the script. - The similar files are picked because their file names look similar. Note that the first group looks different and spells differently too, which means that the script is versatile enough to handle file names that don't have space in it, and robust enough to deal with spelling mistakes. - Apart from the file name, the file size plays an important role as well. - There are 2 files in the second similar files group, the book files group. - The file 'Python Standard Library.zip' is not considered to be similar to the group because its size is not similar to the group. == Testing 4, if Python.zip is bigger, result should be: ## ========= 3 'Audio Book - The Grey Coloured Bunnie.mp3' 'test/' 5 'ColoredGrayBunny.ogg' 'test/' ## ========= 4 'Python Standard Library.zip' 'test/' 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' ok 4 Note: - There are now 3 files in the book files group. - The file 'Python Standard Library.zip' is included in the group because its size is now similar to the group. == Testing 5, if Python.zip is even bigger, result should be: ## ========= 3 'Audio Book - The Grey Coloured Bunnie.mp3' 'test/' 5 'ColoredGrayBunny.ogg' 'test/' ## ========= 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' 6 'Python Standard Library.zip' 'test/' 9 '(eBook) GNU - Python Standard Library 2001.pdf' 'test/' ok 5 Note: - There are 4 files in the book files group now. - The file 'Python Standard Library.zip' is still in the group. - But this time, because it is also considered to be similar to the .pdf file (since their size are now similar, 6 vs 9), a 4th file the .pdf one is now included in the book group. - If the size of file 'Python Standard Library.zip' is 12(KB), then the book files group will be split into two. Do you know why and which files each group will contain?
The File::FindSimilars package comes with a fully functional demo script findsimilars. Please refer to its help file for further explanations.
This package is highly customizable. Refer to the class method new for details.
new
This module depends on Text::Soundex, but not File::Find.
Initialize the object.
my $similars_finder = File::FindSimilars->new();
or,
my $similars_finder = File::FindSimilars->new( {} );
which are the same as:
my $similars_finder = File::FindSimilars->new( { soundex_weight => 50, # percentage of weight that soundex takes, # the rest is for file size fc_threshold => 75, # over which files are considered similar delimiter => "\n## =========\n", # delimiter between files output format => "%12d '%s' %s'%s'", # file info print format fc_level => 0, # file comparison level verbose => 0, } );
What shown above are default settings. Any of the %config_param attribute can be omitted when calling the new method.
%config_param
The new is the only class method. All the rest methods are object methods.
Percentage of weight that soundex takes, the rest of percentage is for file size.
Provide the set_val to change the attribute, omitting it to retrieve the attribute value.
set_val
The threshold over which files are considered similar.
Delimiter printed between file info outputs.
Format used to print file info.
File comparison level. Whether to check similar files within the same folder: 0, no; 1, yes.
Verbose level. Whether to output progress info: 0, no; 1, yes.
Set directory queue for similarity checking. Each entry in $array_ref is a directory to check into. E.g.,
$array_ref
$similars_finder->find_for(\@ARGV);
Do similarity check on the queued directories. Print similar files info on stdout according to the configured format and delimiters. E.g.,
$similars_finder->similarity_check();
File::Compare(3), perl(1) and the following scripts.
File::Find::Duplicates - Find duplicate files
http://belfast.pm.org/Modules/Duplicates.html
my %dupes = find_duplicate_files('/basedir1', '/basedir2');
When passed a base directory (or list of such directories) it returns a hash, keyed on filesize, of lists of the identical files of that size.
ch::claudio::finddups - Find duplicate files in given directory
http://www.claudio.ch/Perl/finddups.html
ch::claudio::finddups is a script as well as a package. When called as script it will search the directory and its subdirectories for files with (possibly) identical content.
To find identical files fast this program will just remember the Digest::SHA1 hash of each file, and signal two files as equal if their hash matches. It will output lines that can be given to a bourne shell to compare the two files, and remove one of them if the comparison indicated that the files are indeed identical.
Besides that it can be used as a package, and gives so access to the following variables, routines and methods.
dupper.pl - finds duplicate files, optionally removes them
http://sial.org/code/perl/scripts/dupper.pl.html
Script to find (and optionally remove) duplicate files in one or more directories. Duplicates are spotted though the use of MD5 checksums.
Please report any bugs or feature requests to bug-file-find-similars at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Find-Similars. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-file-find-similars at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc File::FindSimilars
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=File-Find-Similars
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/File-Find-Similars
CPAN Ratings
http://cpanratings.perl.org/d/File-Find-Similars
Search CPAN
http://search.cpan.org/dist/File-Find-Similars/
SUN, Tong <suntong at cpan.org> http://xpt.sourceforge.net/
<suntong at cpan.org>
Copyright (c) 2001-2016 Tong SUN. All rights reserved.
This program is released under the BSD license.
To install File::FindSimilars, copy and paste the appropriate command in to your terminal.
cpanm
cpanm File::FindSimilars
CPAN shell
perl -MCPAN -e shell install File::FindSimilars
For more information on module installation, please visit the detailed CPAN module installation guide.