Data::Freq - Collects data, counts frequency, and makes up a multi-level counting report
Version 0.04
use Data::Freq; my $data = Data::Freq->new('date'); while (my $line = <STDIN>) { $data->add($line); } $data->output();
Data::Freq is an object-oriented module to collect data from log files or any kind of data sources, count frequency of particular patterns, and generate a counting report.
Data::Freq
See also the command-line tool data-freq.
The simplest usage is to count lines of a log files in terms of a particular category such as date, username, remote address, and so on.
For more advanced usage, Data::Freq is capable of aggregating counting results at multiple levels. For example, lines of a log file can be grouped into months first, and then under each of the months, they can be further grouped into individual days, where all the frequency of both months and days is summed up consistently.
The example below is a copy from the "SYNOPSIS" section.
my $data = Data::Freq->new('date'); while (my $line = <STDIN>) { $data->add($line); } $data->output();
It will generate a report that looks something like this:
123: 2012-01-01 456: 2012-01-02 789: 2012-01-03 ...
where the left column shows the number of occurrences of each date.
The date/time value is automatically extracted from the log line, where the first field enclosed by a pair of brackets [...] is parsed as a date/time text by the Date::Parse::str2time() function. (See Date::Parse.)
[...]
Date::Parse::str2time()
See also "logsplit" in Data::Freq::Record.
The initialization parameters for the new() method can be customized for a multi-level analysis.
If the field specifications are given, e.g.
Data::Freq->new( {type => 'date'}, # field spec for level 1 {type => 'text', pos => 2}, # field spec for level 2 ); # assuming the position 2 (third portion, 0-based) # is the remote username.
then the output will look like this:
123: 2012-01-01 100: user1 20: user2 3: user3 456: 2012-01-02 400: user1 50: user2 6: user3 ...
Below is another example along this line:
Data::Freq->new('month', 'day'); # Level 1: 'month' # Level 2: 'day'
with the output:
12300: 2012-01 123: 2012-01-01 456: 2012-01-02 789: 2012-01-03 ... 45600: 2012-02 456: 2012-02-01 789: 2012-02-02 ...
See "field specification" for more details about the initialization parameters.
The data source is not restricted to log files. For example, a CSV file can be analyzed as below:
my $data = Data::Freq->new({pos => 0}, {pos => 1}); # or more simply, Data::Freq->new(0, 1); open(my $csv, 'source.csv'); while (<$csv>) { $data->add([split /,/]); }
Note: the add() method accepts an array ref, so that the input does not have to be split by the default "logsplit" in Data::Freq::Record function.
For more generic input data, a hash ref can also be given to the add() method.
E.g.
my $data = Data::Freq->new({key => 'x'}, {key => 'y'}); # Note: keys *cannot* be abbrebiated like Data::Freq->new('x', 'y') $data->add({x => 'foo', y => 'abc'}); $data->add({x => 'bar', y => 'def'}); $data->add({x => 'foo', y => 'ghi'}); $data->add({x => 'bar', y => 'jkl'}); ...
In the field specifications, the value of pos or key can also be an array ref, where the multiple elements selected by the pos or key will be join'ed by a space (or the value of $").
pos
key
join
$"
This is useful when a log format contains a date that is not enclosed by a pair of brackets [...].
my $data = Data::Freq->new({type => 'date', pos => [0..3]}); # Log4x with %d{dd MMM yyyy HH:mm:ss,SSS} $data->add("01 Jan 2012 01:02:03,456 INFO - test log\n"); # pos 0: "01" # pos 1: "Jan" # pos 2: "2012" # pos 3: "01:02:03,456"
As a result, "01 Jan 2012 01:02:03,456" will be parsed as a date string.
The output() method accepts different types of parameters as below:
A file handle or an instance of IO::*
IO::*
By default, the result is printed out to STDOUT. With this parameter given, it can be any other output destination.
STDOUT
A callback subroutine ref
If a callback is specified, it will be invoked with a node object (Data::Freq::Node) passed as an argument. See "frequency tree" for more details about the tree structure.
Roughly, each node represents a counting result for each line in the default output format, in the depth-first order (i.e. the same order as the default output lines).
$data->output(sub { my $node = shift; print "Count: ", $node->count, "\n"; print "Value: ", $node->value, "\n"; print "Depth: ", $node->depth, "\n"; print "\n"; });
A hash ref of options to control output format
$data->output({ with_root => 0 , # also prints total (root node) transpose => 0 , # prints values before counts indent => ' ', # repeats (depth - 1) times separator => ': ' , # separates the count and the value prefix => '' , # prepended before the count no_padding => 0 , # disables padding for the count });
The format option can be specified together with a file handle.
$data->output(\*STDERR, {indent => "\t"});
The output does not include the grand total by default. If the with_root option is set to a true value, the total count will be printed as the first line (level 0), and all the subsequent levels will be shifted to the right.
with_root
The transpose option flips the order of the count and the value in each line. E.g.
transpose
2012-01: 12300 2012-01-01: 123 2012-01-02: 456 2012-01-03: 789 ... 2012-02: 45600 2012-02-01: 456 2012-02-02: 789 ...
The indent unit (repeated appropriate times) and the separator (between the count and the value) can be customized with the respective options, indent and separator.
indent
separator
The default output format has apparent ambiguity between the indent and the padding for alignment.
For example, consider the output below:
1200000: Level 1 900000: Level 2 900000: Level 3 5: Level 2 ...
where the second "Level 2" appears to have a deeper indent than the "Level 3."
Although the positions of colons (:) are consistently aligned, it may seem to be slightly inconsistent.
:
The indent depth will be clearer if a prefix is added:
prefix
$data->output({prefix => '* '}); * 1200000: Level 1 * 900000: Level 2 * 900000: Level 3 * 5: Level 2 ...
Alternatively, the no_padding option can be set to a true value to disable the left padding.
no_padding
$data->output({no_padding => 1}); 1200000: Level 1 900000: Level 2 900000: Level 3 5: Level 2 ...
Each argument passed to the new() method is passed to the "new" in Data::Freq::Field method.
For example,
Data::Freq->new( 'month', 'day', );
is equivalent to
Data::Freq->new( Data::Freq::Field->new('month'), Data::Freq::Field->new('day'), );
and because of the way the argument is interpreted by the Data::Freq::Field class, it is also equivalent to
Data::Freq->new( Data::Freq::Field->new({type => 'month'}), Data::Freq::Field->new({type => 'day'}), );
type => { 'text' | 'number' | 'date' }
The basic data types are 'text', 'number', and 'date', which determine how each input data is normalized for the frequency counting, and how the results are sorted.
'text'
'number'
'date'
The 'date' type can also be written as the format string for POSIX::strftime() function. (See POSIX.)
POSIX::strftime()
Data::Freq->new('%Y-%m'); Data::Freq->new({type => '%H'});
If the type is simply specified as 'date', the format defaults to '%Y-%m-%d'.
'%Y-%m-%d'
In addition, the keywords below can be used as synonims:
'year' : equivalent to '%Y' 'month' : equivalent to '%Y-%m' 'day' : equivalent to '%Y-%m-%d' 'hour' : equivalent to '%Y-%m-%d %H' 'minute': equivalent to '%Y-%m-%d %H:%M' 'second': equivalent to '%Y-%m-%d %H:%M:%S'
aggregate => { 'unique' | 'max' | 'min' | 'average' }
The aggregate parameter alters how each count is calculated, where the default count is equal to the sum of all the count's for its child nodes.
aggregate
count
'unique' : the number of distinct child values 'max' : the maximum count of the child nodes 'min' : the minimum count of the child nodes 'average': the average count of the child nodes
sort => { 'value' | 'count' | 'first' | 'last' }
The sort parameter is used as the key by which the group of records will be sorted for the output.
sort
'value': sort by the normalized value 'count': sort by the frequency count 'first': sort by the first occurrence in the input 'last' : sort by the last occurrence in the input
order => { 'asc' | 'desc' }
The order parameter controls the sorting in the either ascending or descending order.
order
pos => { 0, 1, 2, -1, -2, ... }
If the pos parameter is given or an integer value (or a list of integers) is given without a parameter name, the value whose frequency is counted will be selected at the indices from an array ref input or a text split by the logsplit() function.
key => { any key(s) for input hash refs }
If the pos parameter is given, it is assumed that the input is a hash ref, where the value whose frequency is counted will be selected by the specified key(s).
convert => sub {...}
If the convert parameter is set to a subroutine ref, it is invoked to convert the value to a normalized form for frequency counting.
convert
The subroutine is expected to take one string argument and return a converted string.
If the type parameter is either text or number, the results are sorted by count in the descending order by default (i.e. the most frequent value first).
type
text
number
For the date type, the sort parameter defaults to value, and the order parameter defaults to asc (i.e. the time-line order).
date
value
asc
Once all the data have been collected with the add() method, a frequency tree has been constructed internally.
frequency tree
Suppose the Data::Freq instance is initialized with the two fields as below:
my $field1 = Data::Freq::Field->new({type => 'month'}); my $field2 = Data::Freq::Field->new({type => 'text', pos => 2}); my $data = Data::Freq->new($field1, $field2); ...
a result tree that looks like below will be constructed as each data record is added:
Depth 0 Depth 1 Depth 2 $field1 $field2 {432: root}--+--{123: "2012-01"}--+--{10: "user1"} | +--{ 8: "user2"} | +--{ 7: "user3"} | ... +--{135: "2012-02"}--+--{11: "user3"} | +--{ 9: "user2"} | ... ...
In the diagram, a node is represented by a pair of braces {...}, and each integer value is the total number of occurrences of the node value, under its parent category.
{...}
The root node maintains the grand total of records that have been added.
The tree structure can be recursively visited by the traverse() method.
Below is an example to generate a HTML:
print qq(<ul>\n); $data->traverse(sub { my ($node, $children, $recurse) = @_; my ($count, $value) = ($node->count, $node->value); # HTML-escape $value if necessary print qq(<li>$count: $value); if (@$children > 0) { print qq(\n<ul>\n); for my $child (@$children) { $recurse->($child); # invoke recursion } print qq(</ul>\n); } print qq(</li>\n); }); print qq(</ul>\n);
Usage:
Data::Freq->new($field1, $field2, ...);
Constructs a Data::Freq object.
The arguments $field1, $field2, etc. are instances of Data::Freq::Field, or any valid arguments that can be passed to "new" in Data::Freq::Field.
$field1
$field2
The actual data to be analyzed need to be added by the add() method one by one.
The Data::Freq object maintains the counting results, based on the specified fields. The first field ($field1) is used to group the added data into the major category. The next subsequent field ($field2) is for the sub-category under each major group. Any more subsequent fields are interpreted recursively as sub-sub-category, etc.
If no fields are given to the new() method, one field of the text type will be assumed.
$data->add("A record"); $data->add("A log line text\n"); $data->add(['Already', 'split', 'data']); $data->add({key1 => 'data1', key2 => 'data2', ...});
Adds a record that increments the counting by 1.
The interpretation of the input depends on the type of fields specified in the new() method. See "evaluate_record" in Data::Freq::Field.
# I/O $data->output(); # print results (default format) $data->output(\*OUT); # print results to open handle $data->output($io); # print results to IO::* object # Callback $data->output(sub { my $node = shift; # $node is a Data::Freq::Node instance }); # Options $data->output({ with_root => 0 , # if true, prints total at root transpose => 0 , # if true, prints values before counts indent => ' ', # repeats (depth - 1) times separator => ': ', # separates the count and the value prefix => '' , # prepended before the count no_padding => 0 , # if true, disables padding for the count }); # Combination $data->output(\*STDERR, {opt => ...}); $data->output($open_fh, {opt => ...});
Generates a report of the counting results.
If no arguments are given, default format results are printed out to STDOUT. Any open handle or an instance of IO::* can be passed as the output destination.
If the argument is a subroutine ref, it is regarded as a callback that will be called for each node of the frequency tree in the depth-first order. (See "frequency tree" for details.)
The following arguments are passed to the callback:
$node: Data::Freq::Node
The current node (Data::Freq::Node)
$children: [$child_node1, $child_node2, ...]
An array ref to the list of child nodes, sorted based on the field
Note: $node->children is a hash ref (unsorted) of a raw counting data.
$node->children
$data->traverse(sub { my ($node, $children, $recurse) = @_; # Do something with $node before its child nodes # $children is a sorted list of child nodes, # based on the field specification for my $child (@$children) { $recurse->($child); # invoke recursion } # Do something with $node after its child nodes });
Provides a way to traverse the result tree with more control than the output() method.
A callback must be passed as an argument, and will ba called with the following arguments:
$recurse: sub ($a_child_node)
A subroutine ref, with which the resursion is invoked at a desired time
When the traverse() method is called, the root node is passed as the $node parameter first. Until the $recurse subroutine is explicitly invoked for the child nodes, no recursion will be invoked automatically.
$node
$recurse
Returns the root node of the frequency tree. (See "frequency tree" for details.)
The root node is created during the new() method call, and maintains the total number of added records and a reference to its direct child nodes for the first field.
Returns the array ref to the list of fields (Data::Freq::Field).
The returned array should not be modified.
Mahiro Ando, <mahiro at cpan.org>
<mahiro at cpan.org>
Please report any bugs or feature requests to bug-data-freq at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Freq. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-data-freq at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Data::Freq
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Data-Freq
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Data-Freq
CPAN Ratings
http://cpanratings.perl.org/d/Data-Freq
Search CPAN
http://search.cpan.org/dist/Data-Freq/
Copyright 2012 Mahiro Ando.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install Data::Freq, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Data::Freq
CPAN shell
perl -MCPAN -e shell install Data::Freq
For more information on module installation, please visit the detailed CPAN module installation guide.