DataCube::Schema - An Object Oriented Perl Module for creating Snowflake Schemas.
use strict; use warnings; use DataCube::Schema; # the new constructor my $schema = DataCube::Schema->new; # basic: adding dimensions, hierarchies and measures $schema->add_dimension('country'); $schema->add_dimension('product'); $schema->add_dimension('salesperson'); $schema->add_hierarchy('year','quarter','month','day'); $schema->add_measure('sum','units_sold'); $schema->add_measure('sum','dollar_volume'); $schema->add_measure('average','price_per_unit'); # advanced: adding strict dimensions / hierarchies $schema->add_strict_dimension('country'); $schema->add_strict_hierarchy('year','quarter','month','day'); # advanced: suppressing lattice points $schema->suppress_lattice_point('country','salesperson');
This module provides a pure perl, object oriented, embeddable Star and Snowflake Schema engine. It is self contained and ready to use in data mining and data warehousing applications.
All schemas created by this module are Snowflake Schemas.
Star and Snowflake Schemas are used to organize dimensions and measurements in Data Warehouses. Snowflake Schemas encompass all the functionality of Star Schemas, and provide direct support for hierarchies.
In a Star Schema, many peripheral tables of data are joined to one central table called the "Fact Table". Each peripheral table represents a single dimension.
The Snowflake Schema is an extension of the Star Schema, in which each peripheral "Dimension Table" holding hierarchical data is replaced by a group of tables representing that hierarchy.
To illustrate the difference, consider a single table in a Star Schema which contains a date field of the form "year/month/day". In a Snowflake Schema, this table would be replaced by 3 tables: one containg the year, one containing the month, and one containing the day, all linked together by a primary key / foreign key relationship.
A special table called the "Fact Table" resides in the middle of both Star and Snowflake Schemas. Fact Tables contain a single row of data for each factual event logged during the course of business.
The Fact Table contains redundant and repetitious data (usually in Second Normal Form) and is therefore subject to update anamolies. For this reason, Snowflake Schemas should rarely be used for high performance Relational Databases. They should be used, however, to design "Dimensional Databases" and Data Warehouses, where such redundancy allows for extreme performance gains for complex sql queries, especially those containing aggregation functions on hierarchical relationships.
This module provides several methods to design Snowflake Schemas.
This method adds a single dimension to a schema.
This method adds a single measures to the cubes measure table.
Supported measures inlcude:
count min [field name] max [field name] sum [field name] count [field name] average [field name] product [field name] multi_count [field name]
Here is a description of each measure:
count init_value 0 update_rule ++ report_format integer additivity additive declaration $schema->add_measure('count') description the number of times a dimensional tuple has been inserted into the cube min init_value undef update_rule = if < or undefined report_format decimal additivity additive declaration $schema->add_measure('min','field') description the minimal value of inserted numbers from 'field' max init_value undef update_rule = if > or undefined report_format decimal additivity additive declaration $schema->add_measure('max','field') description the maximal value of inserted numbers from 'field' sum init_value 0 update_rule += report_format decimal additivity additive declaration $schema->add_measure('sum','field') description the sum of inserted numbers from 'field' product init_value 1 update_rule *= report_format decimal additivity additive (ie separable) declaration $schema->add_measure('product','field') description the multiplication of inserted numbers from 'field' average init_value 0 update_rule {average}->{$field}->{sum_total} += $field_value; {average}->{$field}->{observations}++ report_format decimal (sum_total / observations) additivity non-additive declaration $schema->add_measure('average','field') description the average of inserted values from 'field' count (distinct) init_value {} (empty hashref) update_rule {count}->{$field}->{$field_value} = undef report_format integer (ie scalar(keys(%{{count}->{$field}->{$field_value}}))) additivity non-additive declaration $schema->add_measure('count','field') description the count distinct of inserted values from 'field' multi_count (distinct with multiplicity) init_value {} (empty hashref) update_rule {count}->{$field}->{$field_value}++ report_format integer (same as count) additivity non-additive declaration $schema->add_measure('multi_count','field') description the count distinct of inserted values from 'field', also stores the *number of times* that $field_value was 'uniquefied'
This method adds a single hierarchy to a schema. Hierarchies are like Dimensions, except that aggregate measures will be computed on complete Parent - Child chains.
For example, consider the hierarchy "year", "month", "day" and the measure "sum" of "dollars".
The following code:
$schema->add_measure('sum','dollars'); $schema->add_hierarchy('year','month','day'); $cube = DataCube->new($schema); # a bunch of data is fed to the cube # # [...] # # and some time later: $cube->rollup; $cube->report;
will create the following reports:
1. sum_of_dollars 2. sum_of_dollars by year 3. sum_of_dollars by year, month 4. sum_of_dollars by year, month, day
as it probably should.
This method adds a single dimension to a schema, over which no aggregation will be performed.
For example, consider the following code:
$schema->add_dimension('product'); $schema->add_strict_dimension('country'); $schema->add_measure('sum','dollars'); $cube = DataCube->new($schema); # a bunch of data is fed to the cube # # [...] # # and some time later: $cube->rollup; $cube->report;
1. sum_of_dollars by country 2. sum_of_dollars by country, product
Notice that the datacube did not produce the sum_of_dollars irrespective of country.
This method adds a single hierarchy to a schema. No aggregation will be performed over the top-most dimension.
This method suppresses specific rollups / reports from being created during a call to rollup, which may lead to a saving of both time and space.
$schema->add_measure('sum','dollars'); $schema->add_hierarchy('year','month','day'); $schema->suppress_lattice_point('year','month'); $cube = DataCube->new($schema); # a bunch of data is fed to the cube # # [...] # # and some time later: $cube->rollup; $cube->report;
1. sum_of_dollars 2. sum_of_dollars by year 3. sum_of_dollars by year, month, day
This method restricts a datacube to only the specified list of dimensions during rollup.
This method superscedes all others except for add_strict_dimension and add_strict_hierarchy, and may lead to a saving of both time and space.
$schema->add_measure('sum','dollars'); $schema->add_hierarchy('year','month','day'); $schema->assert_lattice_point('overall'); $schema->assert_lattice_point('year','month'); $cube = DataCube->new($schema); # a bunch of data is fed to the cube # # [...] # # and some time later: $cube->rollup; $cube->report;
1. sum_of_dollars 2. sum_of_dollars by year, month
If you do this:
$schema->add_measure('sum','dollars'); $schema->add_strict_hierarchy('year','month','day'); $schema->assert_lattice_point('overall'); $schema->assert_lattice_point('year','month');
you will not get the report
1. sum_of_dollars
because the method call
$schema->add_strict_hierarchy('year','month','day');
confines 'year' to always be present.
When in doubt, do not use 'assert_lattice_point' in the presence of the other lattice assertions (such as 'strict' and 'suppress').
This method restricts a datacube to only the specified list of dimensions and superscedes all other methods.
The base table becomes fixed to the confined point and no rollup occurs even if called.
my $schema = DataCube::Schema->new; $schema->add_dimension('country'); $schema->add_dimension('product'); $schema->add_dimension('salesperson'); $schema->add_hierarchy('year','quarter','month','day'); $schema->add_measure('sum','units_sold'); $schema->add_measure('sum','dollar_volume'); $schema->add_measure('average','price_per_unit'); $schema->confine_to('country','product','year'); my $cube = DataCube->new($schema);
will create a cube with only one table (the base table: 'country','product','year') and only one report:
1. sum_of_dollars etc. by country, product, year
None
Wikipedia on Snowflake Schema:
http://en.wikipedia.org/wiki/Snowflake_schema
David Williams, <david@namimedia.com>
Copyright (C) 2009 by David Williams
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.
To install DataCube, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DataCube
CPAN shell
perl -MCPAN -e shell install DataCube
For more information on module installation, please visit the detailed CPAN module installation guide.