The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Acme::Tools - Lots of more or less useful subs lumped together and exported into your namespace

SYNOPSIS

 use Acme::Tools;

 print sum(1,2,3);                   # 6
 print avg(2,3,4,6);                 # 3.75

 my @list = minus(\@listA, \@listB); # set operations
 my @list = union(\@listA, \@listB); # set operations

 print length(gzip("abc" x 1000));   # far less than 3000

 writefile("/dir/filename",$string); # convenient
 my $s=readfile("/dir/filename");    # also conventient

 print "yes!" if between($pi,3,4);

 print percentile(0.05, @numbers);

 my @even = range(1000,2000,2);      # even numbers between 1000 and 2000
 my @odd  = range(1001,2001,2);

 my $dice = random(1,6);
 my $color = random(['red','green','blue','yellow','orange']);

 ...and so on.

ABSTRACT

Useful subroutines for perl exported into the using namespace.

DESCRIPTION

A set of more or less useful subs collected for some time...

EXPORT

Almost every sub, about 60 of them.

Beware of name space pollution. But what did you expect from an acme module?

NUMBERS, SETS, STATISTICS

min

Returns the smallest in a list of numbers. Undef is ignored.

 @lengths=(2,3,5,2,10,5);
 $shortest = min(@lengths);   # 2

max

Returns the largest in a list of numbers. Undef is ignored.

 @heights=(123,90,134,132);
 $highest = max(@heights);   # 134

sum

Returns the sum of a list of numbers.

 print sum(1,3,undef,8);   # 12

avg

Returns the average number of a list of numbers. That is sum / count

 print avg(2, 4, 9);   # 5              (2+4+9) / 3 = 5

Also known as arithmetic mean.

geomavg

Returns the geometric average (a.k.a geometric mean) of a list of numbers.

 print geomavg(10,100,1000,10000,100000);               # 1000
 print 0+ (10*100*1000*10000*100000) ** (1/5);          # 1000 same thing
 print exp(avg(map log($_),10,100,1000,10000,100000));  # 1000 same thing, this is how geomavg() works internally

stddev

Standard_Deviation = sqrt(varians)

Varians = ( sum (x[i]-Avgerage)**2)/(n-1)

Standard deviation (stddev) is a measurement of the width of a normal distribution where one stddev on each side of the mean covers 68% and two stddevs 95%. Normal distributions are sometimes called Gauss curves or Bell shapes.

 stddev(4,5,6,5,6,4,3,5,5,6,7,6,5,7,5,6,4)             # 1.0914103126635
 avg(@IQtestscores) + stddev(@IQtestscores)            # the score for IQ = 115 (by one definition)
 avg(@IQtestscores) - stddev(@IQtestscores)            # the score for IQ = 85

median

Returns the median value of a list of numbers. The list don't have to be sorted.

Example 1, list having an odd number of numbers:

 print median(1, 100, 101);   # 100

100 is the middlemost number after sorting.

Example 2, an even number of numbers:

 print median(1005, 100, 101, 99);   # 100.5

100.5 is the average of the two middlemost numbers.

percentile

Returns one or more percentiles of a list of numbers.

Percentile 50 is the same as the median, percentile 25 is the first quartile, 75 is the third quartile.

Input:

First argument is your wanted percentile, or a refrence to a list of percentiles you want from the dataset.

If the first argument to percentile() is a scalar, this percentile is returned.

If the first argument is a reference to an array, then all those percentiles are returned as an array.

Second, third, fourth and so on argument are the numbers from which you want to find the percentile(s).

Examples:

This finds the 50-percentile (the median) to the four numbers 1, 2, 3 and 4:

 print "Median = " . percentile(50, 1,2,3,4);   # 2.5

This:

 @data=(11, 5, 3, 5, 7, 3, 1, 17, 4, 2, 6, 4, 12, 9, 0, 5);
 @p = map percentile($_,@data), (25, 50, 75);

Is the same as this:

 @p = percentile([25, 50, 75], @data);

But the latter is faster, especially if @data is large since it sorts the numbers only once internally.

Example:

Data: 1, 4, 6, 7, 8, 9, 22, 24, 39, 49, 555, 992

Average (or mean) is 143

Median is 15.5 (which is the average of 9 and 22 who both equally lays in the middle)

The 25-percentile is 6.25 which are between 6 and 7, but closer to 6.

The 75-percentile is 46.5, which are between 39 and 49 but close to 49.

Linear interpolation is used to find the 25- and 75-percentile and any other x-percentile which doesn't fall exactly on one of the numbers in the set.

Interpolation:

As you saw, 6.25 are closer to 6 than to 7 because 25% along the set of the twelve numbers is closer to the third number (6) than to he fourth (7). The median (50-percentile) is also really interpolated, but it is always in the middle of the two center numbers if there are an even count of numbers.

However, there is two methods of interpolation:

Example, we have only three numbers: 5, 6 and 7.

Method 1: The most common is to say that 5 and 7 lays on the 25- and 75-percentile. This method is used in Acme::Tools.

Method 2: In Oracle databases the least and greatest numbers always lay on the 0- and 100-percentile.

As an argument on why Oracles (and others?) definition is not the best way is to look at your data as for instance temperature measurements. If you place the highest temperature on the 100-percentile you are sort of saying that there can never be a higher temperatures in future measurements.

A quick non-exhaustive Google survey suggests that method 1 here is most used.

The larger the data sets, the less difference there is between the two methods.

Extrapolation:

In method one, when you want a percentile outside of any possible interpolation, you use the smallest and second smallest to extrapolate from. For instance in the data set 5, 6, 7, if you want an x-percentile of x < 25, this is below 5.

If you feel tempted to go below 0 or above 100, percentile() will die (or croak to be more precise)

Another method could be to use "soft curves" instead of "straight lines" in interpolation. Maybe B-splines or Bezier curves. This is not used here.

For large sets of data Hoares algorithm would be faster than the simple straightforward implementation used in percentile() here. Hoares don't sort all the numbers fully.

Differences between the two main methods described above:

 Data: 1, 4, 6, 7, 8, 9, 22, 24, 39, 49, 555, 992

 Percentile  Method 1                    Method 2
             (Acme::Tools::percentile  (Oracle)
             and others)
 ----------- --------------------------- ---------
 0           -2                          1
 1           -1.61                       1.33
 25          6.25                        6.75
 50 (median) 15.5                        15.5
 75          46.5                        41.5
 99          1372.19                     943.93
 100         1429                        992

Found like this:

 perl -MAcme::Tools -le 'print for percentile([0,1,25,50,75,99,100], 1,4,6,7,8,9,22,24,39,49,555,992)'

And like this in Oracle-databases:

 create table tmp (n number);
 insert into tmp values (1); insert into tmp values (4); insert into tmp values (6);
 insert into tmp values (7); insert into tmp values (8); insert into tmp values (9);
 insert into tmp values (22); insert into tmp values (24); insert into tmp values (39);
 insert into tmp values (49); insert into tmp values (555); insert into tmp values (992);
 select
   percentile_cont(0.00) within group(order by n) per0,
   percentile_cont(0.01) within group(order by n) per1,
   percentile_cont(0.25) within group(order by n) per25,
   percentile_cont(0.50) within group(order by n) per50,
   percentile_cont(0.75) within group(order by n) per75,
   percentile_cont(0.99) within group(order by n) per99,
   percentile_cont(1.00) within group(order by n) per100
 from tmp;

(Oracle also provides a similar function: percentile_disc where disc is short for discrete, meaning no interpolation is taking place. Instead the closest number from the data set is picked.)

resolve

Resolves an equation by Newtons method.

Input: 1-6 arguments.

First argument: must be a coderef to a subroutine (a function)

Second argument: the target, f(x)=target. Default 0.

Third argument: a start position for x. Default 0.

Fourth argument: a small delta value. Default 1e-4 (0.0001).

Fifth argument: a maximum number of iterations before resolve gives up and carps. Default 100 (if fifth argument is not given or is undef). The number 0 means infinite here. If the derivative of the start position is zero or close to zero more iterations are typically needed.

Sixth argument: A number of seconds to run before giving up. If both fifth and sixth argument is given and > 0, resolve stops at whichever comes first.

Output: returns the number x for f(x) = 0

...or equal to the second input argument if present.

Example:

The equation x^2 - 4x - 21 = 0 has two solutions: -3 and 7.

The result of resolve will depend on the start position:

 print resolve(sub{ my $x=shift; $x**2 - 4*$x - 21 });        # -3 with default start position 0
 print resolve(sub{ my $x=shift; $x**2 - 4*$x - 21 },0,3);    # 7  with start position 3
 print "Iterations: $Acme::Tools::Resolve_iterations\n";      # 3 or larger, about 10-15 is normal

The variable $Acme::Tools::Resolve_iterations (which is exported) will be set to the last number of iterations resolve used. Work also if resolve dies (carps).

The variable $Acme::Tools::Resolve_last_estimate (which is exported) will be set to the last estimate. This number will often be close to the solution and can be used even if resolve dies (carps).

BigFloat-example:

If either second, third or fourth argument is an instance of Math::BigFloat, so will the result be:

 use Acme::Tools;
 use Math::BigFloat try => 'GMP';  # pure perl, no warnings if GMP not installed
 my $start=Math::BigFloat->new(1);
 my $gr1 = resolve(sub{my$x=shift; $x-1-1/$x;},0,1);     # 1/2 + sqrt(5)/2
 my $gr2 = resolve(sub{my$x=shift; $x-1-1/$x;},0,$start);# 1/2 + sqrt(5)/2
 Math::BigFloat->div_scale(50); #default is 40
 my $gr3 = resolve(sub{my$x=shift; $x-1-1/$x;},0,$start);# 1/2 + sqrt(5)/2
 print "Golden ratio 1: $gr1\n";
 print "Golden ratio 2: $gr2\n";
 print "Golden ratio 3: $gr3\n";

Output:

 Golden ratio 1: 1.61803398874989
 Golden ratio 2: 1.61803398874989484820458683436563811772029300310882395927211731893236137472439025
 Golden ratio 3: 1.6180339887498948482045868343656381177203091798057610016490334024184302360920167724737807104860909804

See:

http://en.wikipedia.org/wiki/Newtons_method

Math::BigFloat

http://en.wikipedia.org/wiki/Golden_ratio

nvl

The no value function (or null value function)

nvl() takes two or more arguments. (Oracles take just two)

Returns the value of the first input argument with length() > 0.

Return undef if there is no such input argument.

In perl 5.10 and perl 6 this will most often be easier with the // operator, although nvl() and // treats empty strings "" differently. Sub nvl here considers empty strings and undef the same.

replace

Return the string in the first input argument, but where pairs of search-replace strings (or rather regexes) has been run.

Works as replace() in Oracle, or rather regexp_replace() in Oracle 10. Except that this replace() accepts more than three arguments.

Examples:

 print replace("water","ater","ine");  # Turns water into wine
 print replace("water","ater");        # w
 print replace("water","at","eath");   # weather
 print replace("water","wa","ju",
                       "te","ic",
                       "x","y",        # No x is found, no y is returned
                       'r$',"e");      # Turns water into juice. 'r$' says that the r it wants
                                       # to change should be the last letters. This reveals that
                                       # second, fourth, sixth and so on argument is really regexs,
                                       # not normal strings. So use \ (or \\ inside "") to protect
                                       # the special characters of regexes. You probably also
                                       # should write qr/regexp/ instead of 'regexp' if you make
                                       # use of regexps here, just to make it more clear that
                                       # these are really regexps, not strings.

 print replace('JACK and JUE','J','BL'); # prints BLACK and BLUE
 print replace('JACK and JUE','J');      # prints ACK and UE
 print replace("abc","a","b","b","c");   # prints ccc           (not bcc)

If the first argument is a reference to a scalar variable, that variable is changed "in place".

Example:

 my $str="test";
 replace(\$str,'e','ee','s','S');
 print $str;                         # prints teeSt

decode_num

See "decode".

decode

decode() and decode_num() works just as Oracles decode().

decode() and decode_num() accordingly uses perl operators eq and == for comparison.

Examples:

 $a=123;
 print decode($a, 123,3, 214,4, $a);     # prints 3

The first argument is tested against the second, fourth, sixth and so on argument, and then the third, fifth, seventh and so on argument is returned if decode() finds an equal string or number.

In the above example: 123 maps to 3, 124 maps to 4 and the last argument ($a) is returned if decode as the last resort if every other fails.

Since the operator => is synonymous to the comma operator, the above example is probably more readable rewritten like this:

 my $a=123;
 print decode($a, 123=>3, 214=>4, $a);   # 3

More examples:

 my $a=123;
 print decode($a, 123=>3, 214=>7, $a);              # also 3,  note that => is synonym for , (comma) in perl
 print decode($a, 122=>3, 214=>7, $a);              # prints 123
 print decode($a,  123.0 =>3, 214=>7);              # prints 3
 print decode($a, '123.0'=>3, 214=>7);              # prints nothing (undef), no last argument default value here
 print decode_num($a, 121=>3, 221=>7, '123.0','b'); # prints b

Sort of:

 decode($string, %conversion, $default);

The last argument is returned as a default if none of the keys in the keys/value-pairs matched.

A more perl-ish and probaby faster way of doing the same:

 {123=>3, 214=>7}->{$a} || $a                       # (beware of 0)

between

Input: Three arguments.

Returns: Something true if the first argument is numerically between the two next.

ARRAYS, HASHES

distinct

Returns the values of the input list, sorted alfanumerically, but only one of each value. This is the same as "uniq" except uniq does not sort the returned list.

Example:

 print join(", ", distinct(4,9,3,4,"abc",3,"abc"));    # 3, 4, 9, abc
 print join(", ", distinct(4,9,30,4,"abc",30,"abc"));  # 30, 4, 9, abc       note: alphanumeric sort

in

Returns 1 (true) if first argument is in the list of the remaining arguments. Uses the perl-operator eq.

Otherwise it returns 0 (false).

 print in(  5,   1,2,3,4,6);         # 0
 print in(  4,   1,2,3,4,6);         # 1
 print in( 'a',  'A','B','C','aa');  # 0
 print in( 'a',  'A','B','C','a');   # 1

I guess in perl 5.10 or perl 6 you could use the ~~ operator instead.

in_num

Just as sub "in", but for numbers. Internally uses the perl operator == instead of eq .

 print in(5000,  '5e3');          # 0
 print in(5000,   5e3);           # 1 since 5e3 is converted to 5000 before the call
 print in_num(5000, 5e3);         # 1
 print in_num(5000, '+5.0e03');   # 1

union

Input: Two arrayrefs. (Two lists, that is)

Output: An array containing all elements from both input lists, but no element more than once even if it occurs twice or more in the input.

Example, prints 1,2,3,4:

 perl -MAcme::Tools -le 'print join ",", union([1,2,3],[2,3,3,4,4])'              # 1,2,3,4

minus

Input: Two arrayrefs.

Output: An array containing all elements in the first input array but not in the second.

Example:

 perl -MAcme::Tools -le 'print join " ", minus( ["five", "FIVE", 1, 2, 3.0, 4], [4, 3, "FIVE"] )'

Output is five 1 2.

intersect

Input: Two arrayrefs

Output: An array containing all elements which exists in both input arrays.

Example:

 perl -MAcme::Tools -le 'print join" ", intersect( ["five", 1, 2, 3.0, 4], [4, 2+1, "five"] )'      # 4 3 five

Output: 4 3 five

not_intersect

Input: Two arrayrefs

Output: An array containing all elements member of just one of the input arrays (not both).

Example:

 perl -MAcme::Tools -le ' print join " ", not_intersect( ["five", 1, 2, 3.0, 4], [4, 2+1, "five"] )'

The output is 1 2.

uniq

Input: An array of strings (or numbers)

Output: The same array in the same order, except elements which exists earlier in the list.

Same as "distinct" but distinct sorts the returned list, uniq does not.

Example:

 my @t=(7,2,3,3,4,2,1,4,5,3,"x","xx","x",02,"07");
 print join " ", uniq @t;                          # prints  7 2 3 4 1 5 x xx 07

zip

Input: Two arrayrefs.

That is: two arrays containing numbers, strings or anything really.

Output: An array of the two arrays zipped (interlocked, merged) into each other.

 print join " ", zip( [1,3,5], [2,4,6] );               # 1 2 3 4 5 6

zip() can create hashes where the keys are found in the first array and values in the secord in correct order:

 my @media = qw/CD DVD VHS LP Blueray/;
 my @count = qw/20 12 2 4 3/;
 my %count = zip(\@media,\@count);                 # or zip( [@media], [@count] )
 print "I got $count{DVD} DVDs\n";                 # I got 12 DVDs

Dies (croaks) if the two lists are of different sizes.

(TODO: Merge any number of arrayrefs, not just two)

subhash

Copies a subset of keys/values from one hash to another.

Input: First argument is a reference to a hash. The rest of the arguments are a list of the keys of which key/value-pair you want to be copied.

Output: The hash consisting of the keys and values you specified.

Example:

 %population = ( Norway=>4800000, Sweden=>8900000, Finland=>5000000,
                 Denmark=>5100000, Iceland=>260000,
                 India => 1e9, China=>1.3e9, USA=>300e6, UK=>60e6 );

 %scandinavia = subhash( \%population , 'Norway', 'Sweden', 'Denmark' ); # this
 %scandinavia = (Norway=>4500000,Sweden=>8900000,Denmark=>5100000);      # and this is the same

 print "Population of $_ is $scandinavia{$_}\n" for keys %scandinavia;

...prints the populations of the three scandinavian countries.

Note: The values are NOT deep copied when they are references. (Use Storable::dclone() to do that).

hashtrans

Input: a reference to a hash of hashes

Output: a hash like the input-hash, but matrix transposed (kind of). Think of it as if X and Y has swapped places.

 %h = ( 1 => {a=>33,b=>55},
        2 => {a=>11,b=>22},
        3 => {a=>88,b=>99} );
 print serialize({hashtrans(\%h)},'v');

Gives:

 %v=( 'a'=>{'1'=>'33','2'=>'11','3'=>'88'},
      'b'=>{'1'=>'55','2'=>'22','3'=>'99'} );

RANDOM

random

Input: One or two arguments.

Output:

If the first argument is an arrayref: returns a random member of that array without changing the array.

Else: returns a random integer between the integers in argument one and two.

Note: This is different from int($from+rand($to-$from)) because that never returns $to, but random() will.

If there is no second argument and the first is an integer, a random integer between 0 and that number is returned. Including 0 and the number itself.

Examples:

 $dice=random(1,6);                                # 1, 2, 3, 4, 5 or 6
 $dice=random([1..6]);                             # same as previous
 print random(['head','tail','standing on edge']); # prints one of those three strings
 print random(2);                                  # prints 0, 1 or 2
 print 2**random(7);                               # prints 1, 2, 4, 8, 16, 32, 64 or 128

TODO:

Draw numbers from a normal deviation (bell curve) or other statistical deviations (Although there are probably other modules that do that). I.e.:

 print random({-deviation=>'Normal', -average=>178, -stddev=>15});

Another possible TODO: weighted dices, for cheating:

 print random([[1,0],[2,1],[3,1],[4,1],[5,2],[6,2]]); # never 1 and twice as likely to return  5 and 6 as opposed to 2, 3 and 4
 print random([[1,0],2,3,4,[5,2],[6,2]]);             # same, default weight 1 on 2, 3 and 4

random_gauss

Returns an pseudo-random number with a Gaussian distribution instead of the uniform distribution of perls rand() or random() in this module. The algorithm is a variation of the one at http://www.taygeta.com/random/gaussian.html which is both faster and better than adding a long series of rand().

Uses perls rand function internally.

Input: 0 - 3 arguments.

First argument: the average of the distribution. Default 0.

Second argument: the standard deviation of the distribution. Default 1.

Third argument: If a third argument is present, random_gauss returns an array of that many pseudo-random numbers. If there is no third argument, a number (a scalar) is returned.

Output: One or more pseudo-random numbers with a Gaussian distribution. Also known as a Bell curve or Normal distribution.

Example:

 my @I=random_gauss(100, 15, 100000);         # produces 100000 pseudo-random numbers, average=100, stddev=15
 #my @I=map random_gauss(100, 15), 1..100000; # same but more than three times slower
 print "Average is:    ".avg(@I)."\n";        # prints a number close to 100
 print "Stddev  is:    ".stddev(@I)."\n";     # prints a number close to 15

 my @M=grep $_>100+15*2, @I;                  # those above 130
 print "Percent above two stddevs: ".(100*@M/@I)."%\n"; #prints a number close to 2.2%

Example 2:

 my $num=1e6;
 my @h; $h[$_/2]++ for random_gauss(100,15, $num);
 $h[$_] and printf "%3d - %3d %6d %s\n",
   $_*2,$_*2+1,$h[$_],'=' x ($h[$_]*1000/$num)
     for 1..200/2;

...prints an example of the famous Bell curve:

  44 -  45     70 
  46 -  47    114 
  48 -  49    168 
  50 -  51    250 
  52 -  53    395 
  54 -  55    588 
  56 -  57    871 
  58 -  59   1238 =
  60 -  61   1807 =
  62 -  63   2553 ==
  64 -  65   3528 ===
  66 -  67   4797 ====
  68 -  69   6490 ======
  70 -  71   8202 ========
  72 -  73  10577 ==========
  74 -  75  13319 =============
  76 -  77  16283 ================
  78 -  79  20076 ====================
  80 -  81  23742 =======================
  82 -  83  27726 ===========================
  84 -  85  32205 ================================
  86 -  87  36577 ====================================
  88 -  89  40684 ========================================
  90 -  91  44515 ============================================
  92 -  93  47575 ===============================================
  94 -  95  50098 ==================================================
  96 -  97  52062 ====================================================
  98 -  99  53338 =====================================================
 100 - 101  52834 ====================================================
 102 - 103  52185 ====================================================
 104 - 105  50472 ==================================================
 106 - 107  47551 ===============================================
 108 - 109  44471 ============================================
 110 - 111  40704 ========================================
 112 - 113  36642 ====================================
 114 - 115  32171 ================================
 116 - 117  28166 ============================
 118 - 119  23618 =======================
 120 - 121  19873 ===================
 122 - 123  16360 ================
 124 - 125  13452 =============
 126 - 127  10575 ==========
 128 - 129   8283 ========
 130 - 131   6224 ======
 132 - 133   4661 ====
 134 - 135   3527 ===
 136 - 137   2516 ==
 138 - 139   1833 =
 140 - 141   1327 =
 142 - 143    860 
 144 - 145    604 
 146 - 147    428 
 148 - 149    275 
 150 - 151    184 
 152 - 153    111 
 154 - 155     67 

mix

mix() could also have been named shuffle(), as in shuffling a deck of cards.

Note: List::Util::shuffle() is approximately four times faster. Both respects srand().

Example:

Mixes an array in random order. This:

 print mix("a".."z"),"\n" for 1..3;

...could write something like:

 trgoykzfqsduphlbcmxejivnwa
 qycatilmpgxbhrdezfwsovujkn
 ytogrjialbewcpvndhkxfzqsmu

Input:

1. Either a reference to an array as the only input. This array will then be mixed in-place. The array will be changed:

This: @a=mix(@a) is the same as: mix(\@a).

2. Or an array of zero, one or more elements.

Note that an input-array which COINCIDENTLY SOME TIMES has one element (but more other times), and that element is an array-ref, you will probably not get the expected result.

To check distribution:

 perl -MAcme::Tools -le 'print mix("a".."z") for 1..26000'|cut -c1|sort|uniq -c|sort -n

The letters a-z should occur around 1000 times each.

Shuffles a deck of cards: (s=spaces, h=hearts, c=clubs, d=diamonds)

 perl -MAcme::Tools -le '@cards=map join("",@$_),cart([qw/s h c d/],[2..10,qw/J Q K A/]); print join " ",mix(@cards)'

(Uses "cart", which is not a typo, see further down here)

COMPRESSION

"zipb64", "unzipb64", "zipbin", "unzipbin", "gzip", and "gunzip" compresses and uncompresses strings to save space in disk, memory, database or network transfer. Trades time for space. (Beware of wormholes)

zipb64

Compresses the input (text or binary) and returns a base64-encoded string of the compressed binary data. No known limit on input length, several MB has been tested, as long as you've got the RAM...

Input: One or two strings.

First argument: The string to be compressed.

Second argument is optional: A dictionary string.

Output: a base64-kodet string of the compressed input.

The use of an optional dictionary string will result in an even further compressed output in the dictionary string is somewhat similar to the string that is compressed (the data in the first argument).

If x relatively similar string are to be compressed, i.e. x number automatic of email responses to some action by a user, it will pay of to choose one of those x as a dictionary string and store it as such. (You will also use the same dictionary string when decompressing using "unzipb64".

The returned string is base64 encoded. That is, the output is 33% larger than it has to be. The advantage is that this string more easily can be stored in a database (without the hassles of CLOB/BLOB) or perhaps easier transfer in http POST requests (it still needs some url-encoding, normally). See "zipbin" and "unzipbin" for the same without base 64 encoding.

Example 1, normal compression without dictionary:

  $txt = "Test av komprimering, hva skjer? " x 10;  # ten copies of this norwegian string, $txt is now 330 bytes (or chars rather...)
  print length($txt)," bytes input!\n";             # prints 330
  $zip = zipb64($txt);                              # compresses
  print length($zip)," bytes output!\n";            # prints 65
  print $zip;                                       # prints the base64 string ("noise")

  $output=unzipb64($zip);                              # decompresses
  print "Hurra\n" if $output eq $txt;               # prints Hurra if everything went well
  print length($output),"\n";                       # prints 330

Example 2, same compression, now with dictionary:

  $txt = "Test av komprimering, hva skjer? " x 10;  # Same original string as above
  $dict = "Testing av kompresjon, hva vil skje?";   # dictionary with certain similarities
                                                    # of the text to be compressed
  $zip2 = zipb64($txt,$dict);                          # compressing with $dict as dictionary
  print length($zip2)," bytes output!\n";           # prints 49, which is less than 65 in ex. 1 above
  $output=unzipb64($zip2,$dict);                       # uses $dict in the decompressions too
  print "Hurra\n" if $output eq $txt;               # prints Hurra if everything went well

Example 3, dictionary = string to be compressed: (out of curiosity)

  $txt = "Test av komprimering, hva skjer? " x 10;  # Same original string as above
  $zip3 = zipb64($txt,$txt);                           # hmm
  print length($zip3)," bytes output!\n";           # prints 25
  print "Hurra\n" if unzipb64($zip3,$txt) eq $txt;     # hipp hipp ...

zipb64() and zipbin() is really just wrappers around Compress::Zlib and inflate() & co there.

zipbin

zipbin() does the same as zipb64() except that zipbin() does not base64 encode the result. Returns binary data.

See "zip" for documentation.

unzipb64

Opposite of "zipb64".

Input:

First argument: A string made by "zipb64"

Second argument: (optional) a dictionary string which where used in "zipb64".

Output: The original string (be it text or binary).

See "zipb64".

unzipbin

unzipbin() does the same as "unzip" except that unzipbin() wants a pure binary compressed string as input, not base64.

See "unzipb64" for documentation.

gzip

Input: A string you want to compress. Text or binary.

Output: The binary compressed representation of that input string.

gzip() is really the same as Compress:Zlib::memGzip() except that gzip() just returns the input-string if for some reason Compress::Zlib could not be required. Not installed or not found. (Compress::Zlib is a built in module in newer perl versions).

gzip() uses the same compression algorithm as the well known GNU program gzip found in most unix/linux/cygwin distros. Except gzip() does this in-memory. (Both using the C-library zlib).

gunzip

Input: A binary compressed string. I.e. something returned from gzip() earlier or read from a .gz file.

Output: The original larger non-compressed string. Text or binary.

bzip2

See "gzip" and "gunzip".

bzip2() and bunzip2() works just as gzip() and gunzip(), but use another compression algorithm. This is usually better but slower than the gzip-algorithm. Especially in the compression, decompression speed is less different.

See also man bzip2, man bunzip2 and Compress::Bzip2

bunzip2

Decompressed something compressed by bzip2() or the data from a .bz2 file. See "bzip2".

NET, WEB, CGI-STUFF

ipaddr

Input: an IP-number

Output: either an IP-address machine.sld.tld or an empty string if the DNS lookup didn't find anything.

Example:

 perl -MAcme::Tools -le 'print ipaddr("129.240.8.200")'  # prints www.uio.no

Uses perls gethostbyaddr internally.

ipaddr() memoizes the results internally (using the %Acme::Tools::IPADDR_memo hash) so only the first loopup on a particular IP number might take some time.

Some few DNS loopups can take several seconds. Most is done in a fraction of a second. Due to this slowness, medium to high traffic web servers should probably turn off hostname lookups in their logs and just log IP numbers by using HostnameLookups Off in Apache httpd.conf and then use ipaddr afterwards if necessary.

ipnum

ipnum() does the opposite of ipaddr()

Does an attempt of converting an IP address (hostname) to an IP number. Uses DNS name servers via perls internal gethostbyname(). Return empty string (undef) if unsuccessful.

 print ipnum("www.uio.no");   # prints 129.240.13.152

Does internal memoization via the hash %Acme::Tools::IPNUM_memo.

webparams

Input: (optional)

Zero or one input argument: A string of the same type often found behind the first question mark (?) in URLs.

This string can have one or more parts separated by & chars.

Each part consists of key=value pairs (with the first = char being the separation char).

Both key and value can be url-encoded.

If there is no input argument, webparams uses $ENV{QUERY_STRING} instead.

If also $ENV{QUERY_STRING} is lacking, webparams() checks if $ENV{REQUEST_METHOD} eq 'POST'. In that case $ENV{CONTENT_LENGTH} is taken as the number of bytes to be read from STDIN and those bytes are used as the missing input argument.

The environment variables QUERY_STRING, REQUEST_METHOD and CONTENT_LENGTH is typically set by a web server following the CGI standard (which Apache and most of them can do I guess) or in mod_perl by Apache. Although you are probably better off using CGI. Or $R->args() or $R->content() in mod_perl.

Output:

webparams() returns a hash of the key/value pairs in the input argument. Url-decoded.

If an input string has more than one occurrence of the same key, that keys value in the returned hash will become concatenated each value separated by a , char. (A comma char)

Examples:

 use Acme::Tools;
 my %R=webparams();
 print "Content-Type: text/plain\n\n";                          # or rather \cM\cJ\cM\cJ instead of \n\n to be http-compliant
 print "My name is $R{name}";

Storing those four lines in a file in the directory designated for CGI-scripts on your web server (or perhaps naming the file .cgi is enough), and chmod +x /.../cgi-bin/script and the URL http://some.server.somewhere/cgi-bin/script?name=HAL will print My name is HAL to the web page.

http://some.server.somewhere/cgi-bin/script?name=Bond&name=+James+Bond will print My name is Bond, James Bond.

urlenc

Input: a string

Output: the same string URL encoded so it can be sent in URLs or POST requests.

In URLs (web addresses) certain characters are illegal. For instance space and newline. And certain other chars have special meaning, such as +, %, =, ?, &.

These illegal and special chars needs to be encoded to be sent in URLs. This is done by sending them as % and two hex-digits. All chars can be URL encodes this way, but it's necessary just on some.

Example:

 $search="Østdal, Åge";
 my $url="http://machine.somewhere.com/search?q=" . urlenc($search);
 print $url;

Prints http://machine.somewhere.com/search?q=%D8stdal%2C%20%C5ge

urldec

Opposite of "urlenc".

Example, this returns ' ø'. That is space and ø.

 urldec('+%C3')

ht2t

ht2t is short for html-table to table.

This sub extracts an html-<table>s and returns its <tr>s and <td>s as an array of arrayrefs. And strips away any html inside the <td>s as well.

 my @table = ht2t($html,'some string occuring before the <table> you want');

Input: One or two arguments.

First argument: the html where a <table> is to be found and converted.

Second argument: (optional) If the html contains more than one <table>, and you do not want the first one, applying a second argument is a way of telling ht2t which to capture: the one with this word or string occurring before it.

Output: An array of arrayrefs.

ht2t() is a quick and dirty way of scraping (or harvesting as it is also called) data from a web page. Look too HTML::Parse to do this more accurate.

Example:

 use Acme::Tools;
 use LWP::Simple;
 for(
   ht2t(
     get("http://www.norges-bank.no/templates/article____200.aspx"),
     "Effektiv kronekurs"
   )
 ){
  my($country, $countrycode, $currency) = @$_;
  print "$country ($countrycode) uses $currency\n";
 }

Output:

 Australia (AUD) uses Dollar
 Belgia (BEF) uses Franc (Euro)
 Brasil (BRL) uses Real
 Bulgaria (BGN) uses Lev
 Canada (CAD) uses Dollar
 Danmark (DKK) uses Krone

...and so on.

FILES, DIRECTORIES

chall

Does chmod + utime + chown on one or more files.

Returns the number of files of which those operations was successful.

Mode, uid, gid, atime and mtime are set from the array ref in the first argument.

The first argument references an array which is exactly like an array returned from perls internal stat($filename) -function.

Example:

 my @stat=stat($filenameA);
 chall( \@stat, $filenameB, $filenameC, ... );

Copies the chmod, owner, group, access time and modify time from file A to file B and C.

See perldoc -f stat, perldoc -f chmod, perldoc -f chown, perldoc -f utime

makedir

Input: One or two arguments.

Works like perls mkdir() except that makedir() will create nesessary parent directories if they dont exists.

First input argument: A directory name (absolute, starting with / or relative).

Second input argument: (optional) permission bits. Using the normal 0777^umask() as the default if no second input argument is provided.

Example:

 makedir("dirB/dirC")

...will create directory dirB if it does not already exists, to be able to create dirC inside dirB.

Returns true on success, otherwise false.

makedir() memoizes directories it has checked for existence before (trading memory for speed).

See also perldoc -f mkdir, man umask

OTHER

qrlist

Input: An array of values to be used to test againts for existence.

Output: A reference to a regular expression. That is a qr//

The regex sets $1 if it match.

Example:

  my @list=qw/ABc XY DEF DEFG XYZ/;
  my $filter=qrlist("ABC","DEF","XY.");         # makes a regex of it qr/^(\QABC\E|\QDEF\E|\QXYZ\E)$/
  my @filtered= grep { $_ =~ $filter } @list;   # returns DEF and XYZ, but not XYZ

Note: hash lookups are WAY faster.

Source:

 sub qrlist (@) { my $str=join"|",map quotemeta, @_; qr/^($str)$/ }

ansicolor

Perhaps easier to use than Term::ANSIColor ?

Input: One argument. A string where the char ¤ have special meaning and is replaced by color codings depending on the letter following the ¤.

Output: The same string, but with ¤letter replaces by ANSI color codes respected by many types terminal windows. (xterm, telnet, ssh, telnet, rlog, vt100, cygwin, rxvt and such...).

Codes for ansicolor():

 ¤r red
 ¤g green
 ¤b blue
 ¤y yellow
 ¤m magenta
 ¤B bold
 ¤u underline
 ¤c clear
 ¤¤ reset, quits and returns to default text color.

Example:

 print ansicolor("This is maybe ¤ggreen¤¤?");

Prints This is maybe green? where the word green is shown in green.

If Term::ANSIColor is not installed or not found, returns the input string with every ¤ including the following code letters removed. (That is: ansicolor is safe to use even if Term::ANSIColor is not installed, you just don't get the colors).

See also Term::ANSIColor.

ccn_ok

Checks if a Credit Card number (CCN) has correct control digits according to the LUHN-algorithm from 1960. This method of control digits is used by MasterCard, Visa, American Express, Discover, Diners Club / Carte Blanche, JCB and others.

Input:

A credit card number. Can contain non-digits, but they are removed internally before checking.

Output:

Something true or false.

Or more accurately:

Returns undef (false) if the input argument is missing digits.

Returns 0 (zero, which is false) is the digits is not correct according to the LUHN algorithm.

Returns 1 or the name of a credit card company (true either way) if the last digit is an ok control digit for this ccn.

The name of the credit card company is returned like this (without the ' character)

 Returns (wo '')                Starts on                Number of digits
 ------------------------------ ------------------------ ----------------
 'MasterCard'                   51-55                    16
 'Visa'                         4                        13 eller 16
 'American Express'             34 eller 37              15
 'Discover'                     6011                     16
 'Diners Club / Carte Blanche'  300-305, 36 eller 38     14
 'JCB'                          3                        16
 'JCB'                          2131 eller 1800          15

And should perhaps have had:

 'enRoute'                      2014 eller 2149          15

...but that card uses either another control algorithm or no control digits at all. So enRoute is never returned here.

If the control digits is valid, but the input does not match anything in the column starts on, 1 is returned.

(This is also the same control digit mechanism used in Norwegian KID numbers on payment bills)

The first digit in a credit card number is supposed to tell what "industry" the card is meant for:

 MII Digit Value             Issuer Category
 --------------------------- ----------------------------------------------------
 0                           ISO/TC 68 and other industry assignments
 1                           Airlines
 2                           Airlines and other industry assignments
 3                           Travel and entertainment
 4                           Banking and financial
 5                           Banking and financial
 6                           Merchandizing and banking
 7                           Petroleum
 8                           Telecommunications and other industry assignments
 9                           National assignment

...although this has no meaning to Acme::Tools::ccn_ok().

The first six digits is Issuer Identifier, that is the bank (probably). The rest in the "account number", except the last digits, which is the control digit. Max length on credit card numbers are 19 digits.

KID_ok

Checks if a norwegian KID number has an ok control digit.

To check if a customer has typed the number correctly.

This uses the LUHN algorithm (also known as mod-10) from 1960 which is also used internationally in control digits for credit card numbers, and Canadian social security ID numbers as well.

The algorithm, as described in Phrack (47-8) (a long time hacker online publication):

 "For a card with an even number of digits, double every odd numbered
 digit and subtract 9 if the product is greater than 9. Add up all the
 even digits as well as the doubled-odd digits, and the result must be
 a multiple of 10 or it's not a valid card. If the card has an odd
 number of digits, perform the same addition doubling the even numbered
 digits instead."

Input: A KID-nummer. Must consist of digits 0-9 only, otherwise a die (croak) happens.

Output:

- Returns undef if the input argument is missing.

- Returns 0 if the control digit (the last digit) does not satify the LUHN/mod-10 algorithm.

- Returns 1 if ok

See also: "ccn_ok"

writefile

Justification:

Perl needs three or four operations to make a file out of a string:

 open my $FILE, '>', $filename  or die $!;
 print $FILE $text;
 close($FILE);

This is way simpler:

 writefile($filename,$text);

Sub writefile opens the file i binary mode (binmode()) and has two usage modes:

Input: Two arguments

First argument is the filename. If the file exists, its overwritten. If the file can not be opened for writing, a die (a carp really) happens.

Second input argument is one of:

  • Either a scaler. That is a normal string to be written to the file.

  • Or a reference to a scalar. That referred text is written to the file.

  • Or a reference to an array of scalars. This array is the written to the file element by element and \n is automatically appended to each element.

Alternativelly, you can write several files at once.

Example, this:

 writefile('file1.txt','The text....tjo');
 writefile('file2.txt','The text....hip');
 writefile('file3.txt','The text....and hop');

...is the same as this:

 writefile([
   ['file1.txt','The text....tjo'],
   ['file2.txt','The text....hip'],
   ['file3.txt','The text....and hop'],
 ]);

Output: Nothing (for the time being). die()s (croak($!) really) if something goes wrong.

readfile

Just as with "writefile" you can read in a whole file in one operation with readfile(). Instead of:

 open my $FILE,'<', $filename or die $!;
 my $data = join"",<$FILE>;
 close($FILE);

This is simpler:

 my $data = readfile($filename);

More examples:

Reading the content of the file to a scalar variable: (Any content in $data will be overwritten)

 my $data;
 readfile('filename.txt',\$data);

Reading the lines of a file into an array:

 my @lines;
 readfile('filnavn.txt',\@lines);
 for(@lines){
   ...
 }

Note: Chomp is done on each line. That is, any newlines (\n) will be removed. If @lines is non-empty, this will be lost.

Sub readfile is context aware. If an array is expected it returns an array of the lines without a trailing \n. The last example can be rewritten:

 for(readfile('filnavn.txt')){
   ...
 }

With two input arguments, nothing (undef) is returned from readfile().

readdirectory

Input:

Name of a directory.

Output:

A list of all files in it, except of . and .. (on linux/unix systems, all directories have a . and .. directory).

The names of all types of files are returned: normal files, sub directories, symbolic links, pipes, semaphores. That is every thing shown by ls -la except of . and ..

readdirectory does not recurce into sub directories.

Example:

  my @files = readdirectory("/tmp");

Why?

Sometimes calling the built ins opendir, readdir and closedir seems a bit tedious.

This:

 my $dir="/usr/bin";
 opendir(D,$dir);
 my @files=map "$dir/$_", grep {!/^\.\.?$/} readdir(D);
 closedir(D);

Is the same as this:

 my @files=readdirectory("/usr/bin");

See also: File::Find

Why not?

If you got a huge directory with tens or even houndreds of thousands of files, readdirectory() uses more memory than perls opendir/readdir. This isn't usually a concern anymore for "normal" modern computers, but might be the rationale behind perls more tedious way created back in the 80s. The same argument often goes for file slurping.

range

Input:

One, two or tre numeric arguments: x og y and jump.

Output:

If one argument: returns the array (0..x-1)

If two arguments: returns the array (x..y-1)

If three arguments: returns every jumpth number between x and y.

Dies (croaks) if there are zero or more than 3 arguments, or if the third argument is zero.

Examples:

 print join ",", range(11);      # prints 0,1,2,3,4,5,6,7,8,9,10      (but not 11)
 print join ",", range(2,11);    # prints 2,3,4,5,6,7,8,9,10          (but not 11)
 print join ",", range(11,2,-1); # prints 11,10,9,8,7,6,5,4,3
 print join ",", range(2,11,3);  # prints 2,5,8
 print join ",", range(11,2,-3); # prints 11,8,5
 print join ",", range(11,2,+3); # prints nothing

In the Python language, range is a build in and an iterator instead of an array. This saves memory for large sets.

permutations

What is permutations?

Six friends will be eating at a table with six chairs.

How many ways (permutations) can those six be placed when the number of chairs equal the number of people?

 If one person:          one
 If two persons:         two     (they can swap places)
 If three persons:       six
 If four persons:         24
 If five persons:        120
 If six  persons:        720

The formula is x! where the postfix operator !, also known as faculty is defined like: x! = x * (x-1) * (x-2) ... * 1. Example: 5! = 5 * 4 * 3 * 2 * 1 = 120.

Run this to see the 100 first n!

 perl -le 'use Math::BigInt lib=>'GMP';$i=Math::BigInt->new(1);print "$_! = ",$i*=$_ for 1..100'

  1!  = 1
  2!  = 2
  3!  = 6
  4!  = 24
  5!  = 120
  6!  = 720
  7!  = 5040
  8!  = 40320
  9!  = 362880
 10!  = 3628800
 .
 .
 .
 100! = 93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

permutations() takes a list and return a list of arrayrefs for each of the permutations of the input list:

 permutations('a','b');     #returns (['a','b'],['b','a'])

 permutations('a','b','c'); #returns (['a','b','c'],['a','c','b'],
                            #         ['b','a','c'],['b','c','a'],
                            #         ['c','a','b'],['c','b','a'])

Up to five input arguments permutations() is probably as fast as it can be in this pure perl implementation (see source). For more than five, it could be faster. How fast is it now: Running with different n, this many time took that many seconds:

 n   times    seconds
 -- ------- ---------
  2  100000      0.32
  3  10000       0.09
  4  10000       0.33
  5  1000        0.18
  6  100         0.27
  7  10          0.21
  8  1           0.17
  9  1           1.63
 10  1          17.00

If the first argument is a coderef, that sub will be called for each permutation and the return from those calls with be the real return from permutations(). For example this:

 print for permutations(sub{join"",@_},1..3);

...will print the same as:

 print for map join("",@$_), permutations(1..3);

...but the first of those two uses less RAM if 3 has been say 9. Changing 3 with 10, and many computers hasn't enough memory for the latter.

The examples prints:

 123
 132
 213
 231
 312
 321

If you just want to say calculate something on each permutation, but is not interested in the list of them, you just don't take the return. That is:

 my $ant;
 permutations(sub{$ant++ if $_[-1]>=$_[0]*2},1..9);

...is the same as:

 $$_[-1]>=$$_[0]*2 and $ant++ for permutations(1..9);

...but the first uses next to nothing of memory compared to the latter. They have about the same speed. (The examples just counts the permutations where the last number is at least twice as large as the first)

permutations() was created to find all combinations of a persons name. This is useful in "fuzzy" name searches with String::Similarity if you can not be certain what is first, middle and last names. In foreign or unfamiliar names it can be difficult to know that.

trigram

Input: A string (i.e. a name). And an optional x (see example 2)

Output: A list of this strings trigrams (See examlpe)

Example 1:

 print join ", ", trigram("Kjetil Skotheim");

Prints:

 Kje, jet, eti, til, il , l S,  Sk, Sko, kot, oth, the, hei, eim

Example 2:

Default is 3, but here 4 is used instead in the second optional input argument:

 print join ", ", trigram("Kjetil Skotheim", 4);

And this prints:

 Kjet, jeti, etil, til , il S, l Sk,  Sko, Skot, koth, othe, thei, heim

trigram() was created for "fuzzy" name searching. If you have a database of many names, addresses, phone numbers, customer numbers etc. You can use trigram() to search among all of those at the same time. If the search form only has one input field. One general search box.

Store all of the trigrams of the trigram-indexed input fields coupled with each person, and when you search, you take each trigram of you query string and adds the list of people that has that trigram. The search result should then be sorted so that the persons with most hits are listed first. Both the query strings and the indexed database fields should have a space added first and last before trigram()-ing them.

This search algorithm is not includes here yet...

trigram() should perhaps have been named ngram for obvious reasons.

cart

Cartesian product

Easy usage:

Input: two or more arrayrefs with accordingly x, y, z and so on number of elements.

Output: An array of x * y * z number of arrayrefs. The arrays being the cartesian product of the input arrays.

It can be useful to think of this as joins in SQL. In select statements with more tables behind from, but without any where condition to join the tables.

Advanced usage, with condition(s):

Input:

- Either two or more arrayrefs with x, y, z and so on number of elements.

- Or coderefs to subs containing condition checks. Somewhat like where conditions in SQL.

Output: An array of x * y * z number of arrayrefs (the cartesian product) minus the ones that did not fulfill the condition(s).

This of is as joins with one or more where conditions as coderefs.

The coderef input arguments can be placed last or among the array refs to save both runtime and memory if the conditions depend on arrays further back.

Examples, this:

 for(cart(\@a1,\@a2,\@a3)){
   my($a1,$a2,$a3) = @$_;
   print "$a1,$a2,$a3\n";
 }

Prints the same as this:

 for my $a1 (@a1){
   for my $a2 (@a2){
     for my $a3 (@a3){
       print "$a1,$a2,$a3\n";
     }
   }
 }

And this: (with a condition: the sum of the first two should be dividable with 3)

 for( cart( \@a1, \@a2, sub{sum(@$_)%3==0}, \@a3 ) ) {
   my($a1,$a2,$a3)=@$_;
   print "$a1,$a2,$a3\n";
 }

Prints the same as this:

 for my $a1 (@a1){
   for my $a2 (@a2){
     next if 0==($a1+$a2)%3;
     for my $a3 (@a3){
       print "$a1,$a2,$a3\n";
     }
   }
 }

Examples, from the tests:

 my @a1 = (1,2);
 my @a2 = (10,20,30);
 my @a3 = (100,200,300,400);

 my $s = join"", map "*".join(",",@$_), cart(\@a1,\@a2,\@a3);
 ok( $s eq  "*1,10,100*1,10,200*1,10,300*1,10,400*1,20,100*1,20,200"
           ."*1,20,300*1,20,400*1,30,100*1,30,200*1,30,300*1,30,400"
           ."*2,10,100*2,10,200*2,10,300*2,10,400*2,20,100*2,20,200"
           ."*2,20,300*2,20,400*2,30,100*2,30,200*2,30,300*2,30,400");

 $s=join"",map "*".join(",",@$_), cart(\@a1,\@a2,\@a3,sub{sum(@$_)%3==0});
 ok( $s eq "*1,10,100*1,10,400*1,20,300*1,30,200*2,10,300*2,20,200*2,30,100*2,30,400");

reduce

From: Why Functional Programming Matters: http://www.md.chalmers.se/~rjmh/Papers/whyfp.pdf

http://www.md.chalmers.se/~rjmh/Papers/whyfp.html

DON'T TRY THIS AT HOME, C PROGRAMMERS.

 sub reduce (&@) {
   my ($proc, $first, @rest) = @_;
   return $first if @rest == 0;
   local ($a, $b) = ($first, reduce($proc, @rest));
   return $proc->();
 }

Many functions can then be easily implemented by using reduce. Such as:

 sub mean { (reduce {$a + $b} @_) / @_ }

int2roman

Converts integers to roman numbers.

Examples:

 print int2roman(1234);   # prints MCCXXXIV
 print int2roman(1971);   # prints MCMLXXI

Works for numbers up to 3999.

Subroutine from Peter J. Acklam (jacklam(&)math.uio.no) at Mathematical institutt at University of Oslo:

 I = 1
 V = 5
 X = 10
 L = 50
 C = 100 I<centum>
 D = 500
 M = 1000 I<mille>

See http://en.wikipedia.org/wiki/Roman_numbers for more.

num2code

See "code2num"

code2num

num2code() convert numbers (integers) from the normal decimal system to some arbitrary other number system. That can be binary (2), oct (8), hex (16) or others.

Example:

 print num2code(255,2,"0123456789ABCDEF");  # prints FF
 print num2code(14,2,"0123456789ABCDEF");   # prints 0E

...because 255 are converted to hex (0-F) with a return of 2 digits: FF ...and 14 are converted to 0E, with leading 0 because of the second argument 2.

Example:

 print num2code(1234,16,"01")

Prints the 16 binary digits 0000010011010010 which is 1234 converted to binary 0s and 1s.

To convert back:

 print code2num("0000010011010010","01");  #prints 1234

num2code() can be used to compress numeric IDs to something shorter:

 $chars='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_';
 $code=num2code("241274432",5,$chars);

gcd

"The Euclidean algorithm (also called Euclid's algorithm) is an algorithm to determine the greatest common divisor (gcd) of two integers. It is one of the oldest algorithms known, since it appeared in the classic Euclid's Elements around 300 BC. The algorithm does not require factoring."

Input: two or more positive numbers (integers, without decimals that is)

Output: an integer

Example:

  print gcd(12, 8);   # prints 4

Because (prime number) factoring of 12 is 2 * 2 * 3 and factoring 4 is 2 * 2 and the common ('overlapping') for both 12 and 4 is then 2 * 2. The result is 4.

Example two:

  print gcd(90, 135, 315);               # prints 45
  print gcd(2*3*3*5, 3*3*3*5, 3*3*5*7);  # prints 45

...same tre numbers, 3*3*5 is common = 45.

 sub gcd { my($a,$b,@r)=@_; @r ? gcd($a,gcd($b,@r)) : $b==0 ? $a : gcd($b, $a % $b) }

lcm

lcm() finds the Least Common Multiple of two or more numbers (integers).

Input: two or more positive numbers (integers)

Output: an integer number

Example: 2/21 + 1/6 = 4/42 + 7/42 = 11/42

Where 42 = lcm(21,6).

Example:

  print lcm(45,120,75);   # prints 1800

Because the factors are:

  45 = 2^0 * 3^2 * 5^1
 120 = 2^3 * 3^1 * 5^1
  75 = 2^0 * 3^1 * 5^2

Take the bigest power of each primary number (2, 3 and 5 here). Which is 2^3, 3^2 and 5^2. Multiplied this is 8 * 9 * 25 = 1800.

 sub lcm { my($a,$b,@r)=@_; @r ? lcm($a,lcm($b,@r)) : $a*$b/gcd($a,$b) }

pivot

Resembles the pivot table function in Excel.

pivot() is used to spread out a slim and long table to a visually improved layout.

For instance spreading out the results of group by-selects from SQL:

 pivot( arrayref, columnname1, columnname2, ...)

 pivot( ref_to_array_of_arrayrefs, @list_of_names_to_down_fields )

The first argument is a ref to a two dimensional table.

The rest of the arguments is a list which also signals the number of columns from left in each row that is ending up to the left of the data table, the rest ends up at the top and the last element of each row ends up as data.

                   top1 top1 top1 top1
 left1 left2 left3 top2 top2 top2 top2
 ----- ----- ----- ---- ---- ---- ----
                   data data data data
                   data data data data
                   data data data data

Example:

 my @table=(
               ["1997","Gerd", "Weight", "Summer",66],
               ["1997","Gerd", "Height", "Summer",170],
               ["1997","Per",  "Weight", "Summer",75],
               ["1997","Per",  "Height", "Summer",182],
               ["1997","Hilde","Weight", "Summer",62],
               ["1997","Hilde","Height", "Summer",168],
               ["1997","Tone", "Weight", "Summer",70],
 
               ["1997","Gerd", "Weight", "Winter",64],
               ["1997","Gerd", "Height", "Winter",158],
               ["1997","Per",  "Weight", "Winter",73],
               ["1997","Per",  "Height", "Winter",180],
               ["1997","Hilde","Weight", "Winter",61],
               ["1997","Hilde","Height", "Winter",164],
               ["1997","Tone", "Weight", "Winter",69],
 
               ["1998","Gerd", "Weight", "Summer",64],
               ["1998","Gerd", "Height", "Summer",171],
               ["1998","Per",  "Weight", "Summer",76],
               ["1998","Per",  "Height", "Summer",182],
               ["1998","Hilde","Weight", "Summer",62],
               ["1998","Hilde","Height", "Summer",168],
               ["1998","Tone", "Weight", "Summer",70],
 
               ["1998","Gerd", "Weight", "Winter",64],
               ["1998","Gerd", "Height", "Winter",171],
               ["1998","Per",  "Weight", "Winter",74],
               ["1998","Per",  "Height", "Winter",183],
               ["1998","Hilde","Weight", "Winter",62],
               ["1998","Hilde","Height", "Winter",168],
               ["1998","Tone", "Weight", "Winter",71],
             );

.

 my @reportA=pivot(\@table,"Year","Name");
 print "\n\nReport A\n\n".tablestring(\@reportA);

Will print:

 Report A
 
 Year Name  Height Height Weight Weight
            Summer Winter Summer Winter
 ---- ----- ------ ------ ------ ------
 1997 Gerd  170    158    66     64
 1997 Hilde 168    164    62     61
 1997 Per   182    180    75     73
 1997 Tone                70     69
 1998 Gerd  171    171    64     64
 1998 Hilde 168    168    62     62
 1998 Per   182    183    76     74
 1998 Tone                70     71

.

 my @reportB=pivot([map{$_=[@$_[0,3,2,1,4]]}(@t=@table)],"Year","Season");
 print "\n\nReport B\n\n".tablestring(\@reportB);

Will print:

 Report B
 
 Year Season Height Height Height Weight Weight Weight Weight
             Gerd   Hilde  Per    Gerd   Hilde  Per    Tone
 ---- ------ ------ ------ -----  -----  ------ ------ ------
 1997 Summer 170    168    182    66     62     75     70
 1997 Winter 158    164    180    64     61     73     69
 1998 Summer 171    168    182    64     62     76     70
 1998 Winter 171    168    183    64     62     74     71

.

 my @reportC=pivot([map{$_=[@$_[1,2,0,3,4]]}(@t=@table)],"Name","Attributt");
 print "\n\nReport C\n\n".tablestring(\@reportC);

Will print:

 Report C
 
 Name  Attributt 1997   1997   1998   1998
                 Summer Winter Summer Winter
 ----- --------- ------ ------ ------ ------
 Gerd  Height     170    158    171    171
 Gerd  Weight      66     64     64     64
 Hilde Height     168    164    168    168
 Hilde Weight      62     61     62     62
 Per   Height     182    180    182    183
 Per   Weight      75     73     76     74
 Tone  Weight      70     69     70     71

.

 my @reportD=pivot([map{$_=[@$_[1,2,0,3,4]]}(@t=@table)],"Name");
 print "\n\nReport D\n\n".tablestring(\@reportD);

Will print:

 Report D
 
 Name  Height Height Height Height Weight Weight Weight Weight
       1997   1997   1998   1998   1997   1997   1998   1998
       Summer Winter Summer Winter Summer Winter Summer Winter
 ----- ------ ------ ------ ------ ------ ------ ------ ------
 Gerd  170    158    171    171    66     64     64     64
 Hilde 168    164    168    168    62     61     62     62
 Per   182    180    182    183    75     73     76     74
 Tone                              70     69     70     71

Options:

Options to sort differently and show sums and percents are available. (...MORE DOC ON THAT LATER...)

See also Data::Pivot

tablestring

Input: a reference to an array of arrayrefs -- a two dimensional table of strings and numbers

Output: a string containing the textual table -- a string of two or more lines

The first arrayref in the list refers to a list of either column headings (scalar) or ... (...more later...)

In this output table:

- the columns will not be wider than necessary by its widest value (any <html>-tags are removed in every internal width-calculation)

- multi-lined cell values are handled also

- and so are html-tags, if the output is to be used inside <pre>-tags on a web page.

- columns with just numeric values are right justified (header row excepted)

Example:

 print tablestring([
   [qw/AA BB CCCC/],
   [123,23,"d"],
   [12,23,34],
   [77,88,99],
   ["lin\nes",12,"asdff\nfdsa\naa"],[0,22,"adf"]
 ]);

Prints this string of 11 lines:

 AA  BB CCCC
 --- -- -----
 123 23 d
 12  23 34
 77   8 99
 
 lin 12 asdff
 es     fdsa
        aa
 
 10  22 adf

As you can see, rows containing multi-lined cells gets an empty line before and after the row to separate it more clearly.

upper

Returns input string as uppercase.

Used if perls build in uc() for some reason does not convert æøå and other letters outsize a-z.

æøåäëïöüÿâêîôûãõàèìòùáéíóúýñ => ÆØÅÄËÏÖÜ?ÂÊÎÔÛÃÕÀÈÌÒÙÁÉÍÓÚÝÑ

See also perldoc -f uc

lower

Returns input string as lowercase.

Used if perls build in lc() for some reason does not convert ÆØÅ and other letters outsize A-Z.

ÆØÅÄËÏÖÜ?ÂÊÎÔÛÃÕÀÈÌÒÙÁÉÍÓÚÝÑ => æøåäëïöü?âêîôûãõàèìòùáéíóúýñ

See also perldoc -f lc

serialize

Returns a data structure as a string. See also Data::Dumper (serialize was created long time ago before Data::Dumper appeared on CPAN, before CPAN even...)

Input: One to four arguments.

First argument: A reference to the structure you want.

Second argument: (optional) The name the structure will get in the output string. If second argument is missing or is undef or '', it will get no name in the output.

Third argument: (optional) The string that is returned is also put into a created file with the name given in this argument. Putting a > char in from of the filename will append that file instead. Use '' or undef to not write to a file if you want to use a fourth argument.

Fourth argument: (optional) A number signalling the depth on which newlines is used in the output. The default is infinite (some big number) so no extra newlines are output.

Output: A string containing the perl-code definition that makes that data structure. The input reference (first input argument) can be to an array, hash or a string. Those can contain other refs and strings in a deep data structure.

Limitations:

- Code refs are not handled (just returns sub{die()})

- Regex, class refs and circular recursive structures are also not handled.

Examples:

  $a = 'test';
  @b = (1,2,3);
  %c = (1=>2, 2=>3, 3=>5, 4=>7, 5=>11);
  %d = (1=>2, 2=>3, 3=>\5, 4=>7, 5=>11, 6=>[13,17,19,{1,2,3,'asdf\'\\\''}],7=>'x');
  print serialize(\$a,'a');
  print serialize(\@b,'tab');
  print serialize(\%c,'c');
  print serialize(\%d,'d');
  print serialize(\("test'n roll",'brb "brb"'));
  print serialize(\%d,'d',undef,1);

Prints accordingly:

 $a='test';
 @tab=('1','2','3');
 %c=('1','2','2','3','3','5','4','7','5','11');
 %d=('1'=>'2','2'=>'3','3'=>\'5','4'=>'7','5'=>'11','6'=>['13','17','19',{'1'=>'2','3'=>'asdf\'\\\''}]);
 ('test\'n roll','brb "brb"');
 %d=('1'=>'2',
 '2'=>'3',
 '3'=>\'5',
 '4'=>'7',
 '5'=>'11',
 '6'=>['13','17','19',{'1'=>'2','3'=>'asdf\'\\\''}],
 '7'=>'x');

Areas of use:

- Debugging (first and foremost)

- Storing arrays and hashes and data structures of those on file, database or sending them over the net

- eval earlier stored string to get back the data structure

Be aware of the security implications of evaling a perl code string stored somewhere that unauthorized users can change them! You are probably better of using YAML::Syck or Storable without enabling the CODE-options if you have such security issues. See perldoc Storable or perldoc B::Deparse for how to decompile perl.

dserialize

Debug-serialize, dumping data structures for you to look at.

Same as serialize() but the output is given a newline every 80th character. (Every 80th or whatever $Acme::Tools::Dserialize_width contains)

easter

Input: A year, four digits

Output: array of two numbers: month and date of Easter Sunday that year. Month 3 means March and 4 means April.

 sub easter { use integer;my$Y=shift;my$C=$Y/100;my$L=($C-$C/4-($C-($C-17)/25)/3+$Y%19*19+15)%30;
             (($L-=$L>28||($L>27?1-(21-$Y%19)/11:0))-=($Y+$Y/4+$L+2-$C+$C/4)%7)<4?($L+28,3):($L-3,4) }

...is a "golfed" version of Oudins algorithm (1940) http://astro.nmsu.edu/~lhuber/leaphist.html (see also http://www.smart.net/~mmontes/ec-cal.html )

Valid for any Gregorian year. Dates repeat themselves after 70499183 lunations = 2081882250 days = ca 5699845 year ...but the earth will before that have a different rotation time around the sun and spin time around itself...

time_fp

No input arguments.

Return the same number as perls time() except with decimals (fractions of a second, _fp as in floating point number).

 print time_fp(),"\n";
 print time(),"\n";

Could write:

 1116776232.38632

...if that is the time now.

Or just:

 1116776232

...from perl's internal time() if Time::HiRes isn't installed and available.

sleep_fp

sleep_fp() works just as the built in sleep(), but accepts fractional seconds. Example:

 sleep_fp(0.02);  # sleeps for 20 milliseconds

Sub sleep_fp requires Time::HiRes internally, thus it might take some extra time first time called. To avoid that, use use Time::HiRes in your code.

bytes_readable

Input: a number

Output:

the number with a B behind if the number is less than 1000

the number divided by 1024 with two decimals and "kB" behind if the number is less than 1024*1000

the number divided by 1048576 with two decimals and "MB" behind if the number is less than 1024*1024*1000

the number divided by 1073741824 with two decimals and "GB" behind if the number is less than 1024*1024*1024*1000

the number divided by 1099511627776 with two decimals and "TB" behind otherwise

Examples:

 print bytes_readable(999);                              # 999 B
 print bytes_readable(1000);                             # 0.98 kB
 print bytes_readable(1024);                             # 1.00 kB
 print bytes_readable(1153433.6);                        # 1.10 MB
 print bytes_readable(1181116006.4);                     # 1.10 GB
 print bytes_readable(1209462790553.6);                  # 1.10 TB
 print bytes_readable(1088516511498.24*1000);            # 990.00 TB

BLOOM FILTER SUBROUTINES

Bloom filters can be used to check whether an element (a string) is a member of a large set using much less memory or disk space than other data structures. Trading speed and accuracy for memory usage. While risking false positives, Bloom filters have a very strong space advantage over other data structures for representing sets.

In the example below, a set of 100000 phone numbers (or any string of any length) can be "stored" in just 11992 bytes if you accept that you can only check the data structure for existence of a string and accept false positives with an error rate of 0.01 (that is one percent, error rates are given in numbers larger then 0 and smaller than 1).

You can not retrieve the strings in the set without using "brute force" methods and even then you would get slightly more strings than you put in because of the error rate inaccuracy.

Bloom Filters have many uses.

See also: http://en.wikipedia.org/wiki/Bloom_filter

See also: Bloom::Filter

bfinit

Initialize a new Bloom Filter:

  my $bf = bfinit( error_rate=>0.01, capacity=>100000 );

The same:

  my $bf = bfinit( 0.01, 100000 );

since two arguments is interpreted as error_rate and capacity accordingly.

bfadd

  bfadd($bf, $_) for @phone_numbers;   # Adding strings one at a time

  bfadd($bf, @phone_numbers);          # ...or all at once (faster)

Returns 1 on success. Dies (croaks) if more strings than capacity is added.

bfcheck

  my $phone_number="97713246";
  if ( bfcheck($bf, $phone_number) ) {
    print "Yes, $phone_number was PROBABLY added\n";
  }
  else{
    print "No, $phone_number was DEFINITELY NOT added\n";
  }

Returns true if $phone_number exists in @phone_numbers.

Returns false most of the times, but sometimes true*), if $phone_number doesn't exists in @phone_numbers.

*) This is called a false positive.

Checking more than one key:

 @bools = bfcheck($bf, @keys);    #or ...
 @bools = bfcheck($bf, \@keys);   #better if @keys is large

Returns an array the same size as @keys where each element is true or false accordingly.

bfgrep

Same as bfcheck except it returns the keys that exists in the bloom filter

 @found = bfgrep($bf, @keys);           #or ...
 @found = bfgrep($bf, \@keys);          #better if @keys is large, or ...
 @found = grep bfcheck($bf,$_), @keys;  #same but slower

bfgrepnot

Same as bfgrep except it returns the keys that NOT exists in the bloom filter:

 @not_found = bfgrepnot($bf, @keys);          #or ...
 @not_found = bfgrepnot($bf, \@keys);         #better if @keys is large
 @not_found = grep !bfcheck($bf,$_), @keys);  #same but slower

bfdelete

Deletes from a counting bloom filter.

To enable deleting be sure to initialize the bloom filter with the numeric counting_bits argument. The number of bits could be 2 or 3 for small filters with a small capacity (a small number of keys), but setting the number to 4 ensures that even very large filters with very small error rates would not overflow.

Acme::Tools does not currently support counting_bits => 3 so 4 or 8 is the only practical alternatives.

 my $bf=bfinit(
   error_rate=>0.001,
   capacity=>10e6,
   counting_bits=>4     # a power of 2, i.e. 2, 4, 8, 16 or 32
 );
 bfadd(   $bf, @phone_numbers);     # make sure the phone numbers are unique!
 bfdelete($bf, @phone_numbers);

To examine the frequency of the counters with 4 bit counters and 4 million keys:

 my $bf=bfinit( error_rate=>0.001, capacity=>4e6, counting_bits=>4 );
 bfadd($bf,[1e3*$_+1 .. 1e3*($_+1)]) for 0..4000-1;  # adding 4 million keys one thousand at a time
 my %c; $c{vec($$bf{filter},$_,$$bf{counting_bits})}++ for 0..$$bf{filterlength}-1;
 printf "%8d counters is %2d\n",$c{$_},$_ for sort{$a<=>$b}keys%c;

The output:

 28689562 counters is  0
 19947673 counters is  1
  6941082 counters is  2
  1608250 counters is  3
   280107 counters is  4
    38859 counters is  5
     4533 counters is  6
      445 counters is  7
       46 counters is  8
        1 counters is  9

Even after the error_rate is changed from 0.001 to a percent of that, 0.00001, the limit of 16 (4 bits) is still far away:

 47162242 counters is  0
 33457237 counters is  1
 11865217 counters is  2
  2804447 counters is  3
   497308 counters is  4
    70608 counters is  5
     8359 counters is  6
      858 counters is  7
       65 counters is  8
        4 counters is  9

In algorithmic terms the number of bits needed is ln of ln of n. Thats why 4 bits (counters up to 15) is "always" good enough.

(Except when adding the same key many times, Acme::Tools::bfadd does not check for that).

Counting bloom filters are not very space efficient: The tables above shows that 84%-85% of the counters are 0 or 1. Most bits are zero-bits.

Deletion of non-existing keys bfdelete croaks on deletion of a non-existing key

bfdelete

Deletes from a counting bloom filter:

 bfdelete($bf, @keys);
 bfdelete($bf, \@keys);

Returns $bf after deletion.

bfaddbf

Adds another bloom filter to a bloom filter.

Bloom filters has the proberty that bit-wise or-ing the bit-filters of two filters with the same capacity and the same number and type of hash functions, adds the filters:

  my $bf1=bfinit(error_rate=>0.01,capacity=>$cap,keys=>[1..500]);
  my $bf2=bfinit(error_rate=>0.01,capacity=>$cap,keys=>[501..1000]);

  bfaddbf($bf1,$bf2);

  print "Yes!" if bfgrep($bf1, 1..1000) == 1000;

Prints yes since bfgrep now returns an array of all the 1000 elements.

Croaks if the filters are of different dimensions.

Works for counting bloom filters as well (counting_bits=4> e.g.)

bfsum

Returns the number of 1's in the filter.

 my $percent=100*bfsum($bf)/$$bf{filterlength};
 printf "The filter is %.1f%% filled\n",$percent; #prints 50.0% or so if filled to capacity

Sums the counters for counting bloom filters (much slower than for non counting).

bfdimensions

Input, two numeric arguments: Capacity and error_rate.

Outputs an array of two numbers: m and k.

  m = - n * log(p) / log(2)**2   # n = capacity, m = bits in filter (divide by 8 to get bytes)
  k = log(1/p) / log(2)          # p = error_rate, uses perls internal log() with base e (2.718)

...that is: m = the best number of bits in the filter and k = the best number of hash functions optimized for the given capacity (n) and error_rate (p). Note that k is a dependent only of the error_rate. At about two percent error rate the bloom filter needs just the same number of bytes as the number of keys.

 Storage (bytes):
 Capacity      Error-rate  Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate Error-rate
               0.000000001 0.00000001 0.0000001  0.000001   0.00001    0.0001     0.001      0.01       0.02141585 0.1        0.5        0.99
 ------------- ----------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
            10 54.48       48.49      42.5       36.51      30.52      24.53      18.53      12.54      10.56      6.553      2.366      0.5886
           100 539.7       479.8      419.9      360        300.1      240.2      180.3      120.4      100.6      60.47      18.6       0.824
          1000 5392        4793       4194       3595       2996       2397       1798       1199       1001       599.6      180.9      3.177
         10000 5.392e+04   4.793e+04  4.194e+04  3.594e+04  2.995e+04  2.396e+04  1.797e+04  1.198e+04  1e+04      5991       1804       26.71
        100000 5.392e+05   4.793e+05  4.193e+05  3.594e+05  2.995e+05  2.396e+05  1.797e+05  1.198e+05  1e+05      5.991e+04  1.803e+04  262
       1000000 5.392e+06   4.793e+06  4.193e+06  3.594e+06  2.995e+06  2.396e+06  1.797e+06  1.198e+06  1e+06      5.991e+05  1.803e+05  2615
      10000000 5.392e+07   4.793e+07  4.193e+07  3.594e+07  2.995e+07  2.396e+07  1.797e+07  1.198e+07  1e+07      5.991e+06  1.803e+06  2.615e+04
     100000000 5.392e+08   4.793e+08  4.193e+08  3.594e+08  2.995e+08  2.396e+08  1.797e+08  1.198e+08  1e+08      5.991e+07  1.803e+07  2.615e+05
    1000000000 5.392e+09   4.793e+09  4.193e+09  3.594e+09  2.995e+09  2.396e+09  1.797e+09  1.198e+09  1e+09      5.991e+08  1.803e+08  2.615e+06
   10000000000 5.392e+10   4.793e+10  4.193e+10  3.594e+10  2.995e+10  2.396e+10  1.797e+10  1.198e+10  1e+10      5.991e+09  1.803e+09  2.615e+07
  100000000000 5.392e+11   4.793e+11  4.193e+11  3.594e+11  2.995e+11  2.396e+11  1.797e+11  1.198e+11  1e+11      5.991e+10  1.803e+10  2.615e+08
 1000000000000 5.392e+12   4.793e+12  4.193e+12  3.594e+12  2.995e+12  2.396e+12  1.797e+12  1.198e+12  1e+12      5.991e+11  1.803e+11  2.615e+09

Error rate: 0.99 Hash functions: 1 Error rate: 0.5 Hash functions: 1 Error rate: 0.1 Hash functions: 3 Error rate: 0.0214158522653385 Hash functions: 6 Error rate: 0.01 Hash functions: 7 Error rate: 0.001 Hash functions: 10 Error rate: 0.0001 Hash functions: 13 Error rate: 0.00001 Hash functions: 17 Error rate: 0.000001 Hash functions: 20 Error rate: 0.0000001 Hash functions: 23 Error rate: 0.00000001 Hash functions: 27 Error rate: 0.000000001 Hash functions: 30

bfstore

Storing and retrieving bloom filters to and from disk uses Storables store and retrieve. This:

 bfstore($bf,'filename.bf');

It the same as:

 use Storable qw(store retrieve);
 ...
 store($bf,'filename.bf');

bfretrieve

This:

 my $bf=bfretrieve('filename.bf');

Or this:

 my $bf=bfinit('filename.bf');

Is the same as:

 use Storable qw(store retrieve);
 my $bf=retrieve('filename.bf');

bfclone

Deep copies the bloom filter data structure. (Which is not very deep)

This:

 my $bfc = bfclone($bf);

Works just as:

 use Storable;
 my $bfc=Storable::dclone($bf);

Object oriented interface to bloom filters

 my $bf=new Acme::Tools::BloomFilter(0.1,1000); #see bfinit above, the same as new
 print ref($bf),"\n";                        # prints Acme::Tools:BloomFilter
 $bf->add(@keys);
 $bf->check($keys[0]) and print "ok\n";      # prints ok
 $bf->grep(\@keys)==@keys and print "ok\n";  # prints ok
 $bf->store('filename.bf');
 my $bf2=bfretrieve('filename.bf');
 $bf2->check($keys[0]) and print "ok\n";  # still ok

 $bf2=$bf->clone();

To instantiate a previously stored bloom filter:

 my $bf = Acme::Tools::BloomFilter->new( '/path/to/stored/bloomfilter.bf' );

The o.o. interface has the same methods as the bf...-subs without the bf-prefix in the names. The bfretrieve is not available as a method, although bfretrieve, Acme::Tools::bfretrieve and Acme::Tools::BloomFilter::retrieve are synonyms.

Internals and speed

The internal hash-functions are md5( "$key$salt" ) from Digest::MD5.

Since md5 returns 128 bits and most medium to large sized bloom filters need only a 32 bit hash function, the result from md5() are split (unpack-ed) into 4 parts 32 bits each and are treated as if 4 hash functions was called. Using different salts to the key on each md5 results in different hash functions.

Digest::SHA512 would have been better, but its slower than Digest::MD5.

String::CRC32::crc32 is faster than Digest::MD5, but not 4 times faster:

 time perl -e'use Digest::MD5 qw(md5);md5("asdf$_") for 1..10e6'       #5.56 sec
 time perl -e'use String::CRC32;crc32("asdf$_") for 1..10e6'           #2.79 sec, faster but not per bit
 time perl -e'use Digest::SHA qw(sha512);sha512("asdf$_") for 1..10e6' #36.10 sec, too slow (sha1, sha224, sha256 and sha384 too)

Theory and math behind bloom filters

http://www.internetmathematics.org/volumes/1/4/Broder.pdf

http://blogs.sun.com/jrose/entry/bloom_filters_in_a_nutshell

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

See also Scaleable Bloom Filters: http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf (not implemented here)

...and perhaps http://intertrack.naist.jp/Matsumoto_IEICE-ED200805.pdf

HISTORY

Release history

 0.13   Oct 2010   Non-linux test issue, resolve. improved: bloom filter, tests, doc
 0.12   Oct 2010   Improved tests, doc, bloom filter, random_gauss, bytes_readable
 0.11   Dec 2008   Improved doc
 0.10   Dec 2008

SEE ALSO

AUTHOR

Kjetil Skotheim, <kjetil.skotheim@gmail.com>, <kjetil.skotheim@usit.uio.no>

COPYRIGHT AND LICENSE

1995-2010, Kjetil Skotheim

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1591:

Non-ASCII character seen before =encoding in '$search="Østdal,'. Assuming CP1252