The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataCube::Schema - An Object Oriented Perl Module for creating Snowflake Schemas.

SYNOPSIS

  use strict;
  use warnings;
  
  use DataCube::Schema;
  

  # the new constructor
  
  my $schema = DataCube::Schema->new;
    
  
  # basic: adding dimensions, hierarchies and measures
  
  $schema->add_dimension('country');
  $schema->add_dimension('product');
  $schema->add_dimension('salesperson');
  
  $schema->add_hierarchy('year','quarter','month','day');
  
  $schema->add_measure('sum','units_sold');
  $schema->add_measure('sum','dollar_volume');
  $schema->add_measure('average','price_per_unit');
  
    
  # advanced: adding strict dimensions / hierarchies
  
  $schema->add_strict_dimension('country');
  $schema->add_strict_hierarchy('year','quarter','month','day');
  
  
  # advanced: suppressing lattice points
    
  $schema->suppress_lattice_point('country','salesperson');
  
  
  
  

DESCRIPTION

This module provides a pure perl, object oriented, embeddable Star and Snowflake Schema engine. It is self contained and ready to use in data mining and data warehousing applications.

All schemas created by this module are Snowflake Schemas.

BACKGROUND

Star and Snowflake Schemas are used to organize dimensions and measurements in Data Warehouses. Snowflake Schemas encompass all the functionality of Star Schemas, and provide direct support for hierarchies.

STAR SCHEMAS

    In a Star Schema, many peripheral tables of data are joined to one central table called the "Fact Table". Each peripheral table represents a single dimension.

SNOWFLAKE SCHEMAS

    The Snowflake Schema is an extension of the Star Schema, in which each peripheral "Dimension Table" holding hierarchical data is replaced by a group of tables representing that hierarchy.

    To illustrate the difference, consider a single table in a Star Schema which contains a date field of the form "year/month/day". In a Snowflake Schema, this table would be replaced by 3 tables: one containg the year, one containing the month, and one containing the day, all linked together by a primary key / foreign key relationship.

FACT TABLES

    A special table called the "Fact Table" resides in the middle of both Star and Snowflake Schemas. Fact Tables contain a single row of data for each factual event logged during the course of business.

    The Fact Table contains redundant and repetitious data (usually in Second Normal Form) and is therefore subject to update anamolies. For this reason, Snowflake Schemas should rarely be used for high performance Relational Databases. They should be used, however, to design "Dimensional Databases" and Data Warehouses, where such redundancy allows for extreme performance gains for complex sql queries, especially those containing aggregation functions on hierarchical relationships.

BASIC OPERATIONS

This module provides several methods to design Snowflake Schemas.

add_dimension

    This method adds a single dimension to a schema.

add_measure

    This method adds a single measures to the cubes measure table.

    Supported measures inlcude:

        count
        min         [field name]
        max         [field name]
        sum         [field name]
        count       [field name]
        average     [field name]
        product     [field name]
        multi_count [field name]

    Here is a description of each measure:

        count
        
        init_value     0
        update_rule    ++
        report_format  integer
        additivity     additive
        declaration    $schema->add_measure('count')
        description    the number of times a dimensional tuple has been inserted into the cube
    
    
    
        min
        
        init_value     undef
        update_rule    = if < or undefined
        report_format  decimal
        additivity     additive
        declaration    $schema->add_measure('min','field')
        description    the minimal value of inserted numbers from 'field'
    
    
    
        max
        
        init_value     undef
        update_rule    = if > or undefined
        report_format  decimal
        additivity     additive
        declaration    $schema->add_measure('max','field')
        description    the maximal value of inserted numbers from 'field'
    
    
    
        sum
        
        init_value     0
        update_rule    +=
        report_format  decimal
        additivity     additive
        declaration    $schema->add_measure('sum','field')
        description    the sum of inserted numbers from 'field'
    
    
    
        product
        
        init_value     1
        update_rule    *=
        report_format  decimal
        additivity     additive (ie separable)
        declaration    $schema->add_measure('product','field')
        description    the multiplication of inserted numbers from 'field'
    
    
    
        average
        
        init_value     0
        update_rule    {average}->{$field}->{sum_total}   += $field_value;
                       {average}->{$field}->{observations}++
        report_format  decimal (sum_total / observations)
        additivity     non-additive
        declaration    $schema->add_measure('average','field')
        description    the average of inserted values from 'field'
    
    
    
        count (distinct) 
        
        init_value     {} (empty hashref)
        update_rule    {count}->{$field}->{$field_value} = undef
        report_format  integer (ie scalar(keys(%{{count}->{$field}->{$field_value}})))
        additivity     non-additive
        declaration    $schema->add_measure('count','field')
        description    the count distinct of inserted values from 'field'
    
    
    
        multi_count (distinct with multiplicity) 
        
        init_value     {} (empty hashref)
        update_rule    {count}->{$field}->{$field_value}++
        report_format  integer (same as count)
        additivity     non-additive
        declaration    $schema->add_measure('multi_count','field')
        description    the count distinct of inserted values from 'field', also stores the *number of times* that $field_value was 'uniquefied'

add_hierarchy

    This method adds a single hierarchy to a schema. Hierarchies are like Dimensions, except that aggregate measures will be computed on complete Parent - Child chains.

    For example, consider the hierarchy "year", "month", "day" and the measure "sum" of "dollars".

    The following code:

      $schema->add_measure('sum','dollars');
      $schema->add_hierarchy('year','month','day');
      
      $cube = DataCube->new($schema);
    
      # a bunch of data is fed to the cube
      #
      # [...]
      #
      # and some time later:
    
      $cube->rollup;
      $cube->report;

    will create the following reports:

      1.  sum_of_dollars
      2.  sum_of_dollars by year
      3.  sum_of_dollars by year, month
      4.  sum_of_dollars by year, month, day 

    as it probably should.

ADVANCED OPERATIONS

add_strict_dimension

This method adds a single dimension to a schema, over which no aggregation will be performed.

For example, consider the following code:

  $schema->add_dimension('product');
  $schema->add_strict_dimension('country');

  $schema->add_measure('sum','dollars');
  
  $cube = DataCube->new($schema);

  # a bunch of data is fed to the cube
  #
  # [...]
  #
  # and some time later:

  $cube->rollup;
  $cube->report;

will create the following reports:

  1.  sum_of_dollars by country
  2.  sum_of_dollars by country, product

Notice that the datacube did not produce the sum_of_dollars irrespective of country.

add_strict_hierarchy

This method adds a single hierarchy to a schema. No aggregation will be performed over the top-most dimension.

suppress_lattice_point

    This method suppresses specific rollups / reports from being created during a call to rollup, which may lead to a saving of both time and space.

    For example, consider the following code:

      $schema->add_measure('sum','dollars');
      $schema->add_hierarchy('year','month','day');
      $schema->suppress_lattice_point('year','month');
      
      $cube = DataCube->new($schema);
    
      # a bunch of data is fed to the cube
      #
      # [...]
      #
      # and some time later:
    
      $cube->rollup;
      $cube->report;

    will create the following reports:

      1.  sum_of_dollars
      2.  sum_of_dollars by year
      3.  sum_of_dollars by year, month, day 

assert_lattice_point

    This method restricts a datacube to only the specified list of dimensions during rollup.

    This method superscedes all others except for add_strict_dimension and add_strict_hierarchy, and may lead to a saving of both time and space.

    For example, consider the following code:

      $schema->add_measure('sum','dollars');
      $schema->add_hierarchy('year','month','day');
      
      $schema->assert_lattice_point('overall');
      $schema->assert_lattice_point('year','month');
      
      $cube = DataCube->new($schema);
    
      # a bunch of data is fed to the cube
      #
      # [...]
      #
      # and some time later:
    
      $cube->rollup;
      $cube->report;

    will create the following reports:

      1.  sum_of_dollars
      2.  sum_of_dollars by year, month
      

    If you do this:

      $schema->add_measure('sum','dollars');
      $schema->add_strict_hierarchy('year','month','day');
      
      $schema->assert_lattice_point('overall');
      $schema->assert_lattice_point('year','month');

    you will not get the report

      1.  sum_of_dollars

    because the method call

      $schema->add_strict_hierarchy('year','month','day');
      

    confines 'year' to always be present.

    When in doubt, do not use 'assert_lattice_point' in the presence of the other lattice assertions (such as 'strict' and 'suppress').

confine_to

    This method restricts a datacube to only the specified list of dimensions and superscedes all other methods.

    The base table becomes fixed to the confined point and no rollup occurs even if called.

    For example, consider the following code:

        my $schema = DataCube::Schema->new;
        $schema->add_dimension('country');
        $schema->add_dimension('product');
        $schema->add_dimension('salesperson');
        
        $schema->add_hierarchy('year','quarter','month','day');
        
        $schema->add_measure('sum','units_sold');
        $schema->add_measure('sum','dollar_volume');
        $schema->add_measure('average','price_per_unit');
        
        $schema->confine_to('country','product','year');
        
        my $cube = DataCube->new($schema);

    will create a cube with only one table (the base table: 'country','product','year') and only one report:

      1.  sum_of_dollars etc. by country, product, year

EXPORT

None

SEE ALSO

Wikipedia on Snowflake Schema:

http://en.wikipedia.org/wiki/Snowflake_schema

AUTHOR

David Williams, <david@namimedia.com>

COPYRIGHT AND LICENSE

Copyright (C) 2009 by David Williams

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.