The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::LSNoHistory - Least-Squares linear regression package without data history

SYNOPSIS

  # construct from points
  $reg = Statistics::LSNoHistory->new(points => [
    1.0 => 1.0,
    2.1 => 1.9,
    2.8 => 3.2,
    4.0 => 4.1,
    5.2 => 4.9
  ]);

  # other equivalent constructions
  $reg = Statistics::LSNoHistory->new(
    xvalues => [1.0, 2.1, 2.8, 4.0, 5.2],
    yvalues => [1.0, 1.9, 3.2, 4.1, 4.9]
  );
  # or
  $reg = Statistics::LSNoHistory->new;
  $reg->append_arrays(
    [1.0, 2.1, 2.8, 4.0, 5.2],
    [1.0, 1.9, 3.2, 4.1, 4.9]
  );
  # or
  $reg = Statistics::LSNoHistory->new;
  $reg->append_points(
    1.0 => 1.0, 2.1 => 1.9, 2.8 => 3.2, 4.0 => 4.1, 5.2 => 4.9
  );

  # You may also construct from the preliminary statistics of a 
  # previous regression:
  $reg = Statistics::LSNoHistory->new(
    num => 5,
    sumx => 15.1,
    sumy => 15.1,
    sumxx => 56.29,
    sumyy => 55.67,
    sumxy => 55.83,
    minx => 1.0,
    maxx => 5.2,
    miny => 1.0,
    maxy => 4.9
  );
  # thus a branch may be instantiated as follows
  $branch = Statistics::LSNoHistory->new(%{$reg->dump_stats});
  $reg->append_point(6.1, 5.9);
  $branch->append_point(5.8, 6.0);

  # calculate regression values, print some
  printf("Slope: %.2f\n", $reg->slope);
  printf("Intercept %.2f\n", $reg->intercept);
  printf("Correlation Coefficient: %.2f\n", $reg->pearson_r);
  ...

DESCRIPTION

This package provides standard least squares linear regression functionality without the need for storing the complete data history. Like any other, it finds best m,k (in least squares sense) so that y = m*x + k fits data points (x_1,y_1),...,(x_n,y_n).

In many applications involving linear regression, it is desirable to compute a regression based on the intermediate statistics of a previous regression along with any new data points. Thus there is no need to store a complete data history, but rather only a minimal set of intermediate statistics, the number of which, thanks to Gauss, is 6.

The user interface provides a way to instantiate a regression object with either raw data or previous intermediate statistics.

CONSTRUCTOR ARGUMENTS

The constructor (or class method new) takes several possible arguments. The initialization scenario depends on the kinds of arguments passed and falls into one of the following categories:

  • default: new() by itself is equivalent to initializing with no data. All internal statistics are set to zero.

  • data points array: new(points => [x_1 => y_1, x_2 => y_2,..., x_n => y_n]) processes the n specified data points. Note that points expects an array reference even though we've written it in "hash notation" for clarity.

  • data value arrays: new(xvalues => [x_1, x_2,..., x_n], yvalues => [y_1, y_2,..., y_n]) is equivalent to the above.

  • previous state: new(state arguments) requires all of the following intermediate statistics:

    num

    => Number of points.

    sumx

    => Sum of x values.

    sumy

    => Sum of y values.

    sumxx

    => Sum of x values squared.

    sumyy

    => Sum of y values squared.

    sumxy

    => Sum of x*y products.

    minx

    => Minimum x value.

    maxx

    => Maximum x value.

    miny

    => Minimum y value.

    maxy

    => Maximum y value.

METHODS

    *

    append_point(x,y) process an additional data point.

    *

    append_points(x_1 => y_1,..., x_n => y_n) process additional data points, which is equivalent to calling append_point() n times.

    *

    append_arrays([x_1, x_2,..., x_n], [y_1, y_2,..., y_n]) process additional data points given a pair x and y data array references. Also equivalent to calling append_point() n times.

    *

    average_x returns the mean of the x values.

    *

    average_y returns the mean of the y values.

    *

    variance_x returns the (n-1)-style variance of the x values.

    *

    variance_y returns the (n-1)-style variance of the y values.

    *

    slope returns the slope m so that y = m*x + k is a least squares fit. Note that this is the least (y-y_avg)**2, and thus the standard slope.

    *

    intercept returns the intercept k so that y = m*x + k is a least squares fit. Note again that this is the least (y-y_avg)**2, and thus the standard intercept.

    *

    predict(x) predicts a y value, given an x value. Computes m*x + k, where m, k are the standard regression slope and intercept (->slope and ->intercept, respectively) for the most recent data.

    *

    slope_y returns the slope m' so that y = m'*x + k' is a least squares fit. Note that this is the least (x-x_avg)**2, and thus not the standard slope.

    *

    intercept_y returns the intercept k' so that y = m'*x + k' is a least squares fit. Note that this is the least (x-x_avg)**2, and thus not the standard intercept.

    *

    predict_x(y) predicts an x value given a y value. Computes m'*y + k', where m', k' are the regression (y-reletive) slope and intercept (->slope_y and ->intercept_y, respectively) for the most recent data.

    *

    pearson_r returns Pearson's r correlation coefficient.

    *

    chi_squared returns the chi squared statistic.

    *

    minimum_x returns the minimum x value

    *

    maximum_x returns the maximum x value

    *

    minimum_y returns the minimum y value

    *

    maximum_y returns the maximum y value

    *

    dump_stats returns a hash reference of the form

            { num => <val>,
              sumx => <val>,
              sumy => <val>,
              sumxx => <val>,
              sumyy => <val>,
              sumxy => <val>,
              minx => <val>,
              maxx => <val>,
              miny => <val>,
              maxy => <val> }

    in other words, containing all the stats required by the final constructor above. This effectively dumps the regression history.

BUGS

This technique is more susceptible to roundoff errors than others which store the data. Extra care must be taken to scale the data before processing.

AUTHOR

John Pliam <pliam@cpan.org>

SEE ALSO

CPAN modules: Statistics::OLS, Statistics::Descriptive, Statistics::GaussHelmert, Statistics::Regression.

Any book on statistics, any handbook of mathematics, any comprehensive book on numerical algorithms.

Press et al, Numerical Recipes in L [L in {C,Fortran, ...}], Nth edition [N > 0], Cambridge Univ Press.

COPYING

See distribution file COPYING for complete information.

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 175:

=back doesn't take any parameters, but you said =back 6

Around line 177:

=back doesn't take any parameters, but you said =back 2

Around line 257:

You can't have =items (as at line 268) unless the first thing after the =over is an =item

Around line 654:

You forgot a '=back' before '=head1'