The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Algorithm::ExpectationMaximization -- A pure-Perl implementation for clustering numerical multi-dimensional data with the Expectation-Maximization algorithm.

SYNOPSIS

  use Algorithm::ExpectationMaximization;

  #  First name the data file:

  my $datafile = "mydatafile1.dat";

  #  Next, set the mask to indicate which columns of the datafile to use for
  #  clustering and which column contains a symbolic ID for each data record. For
  #  example, if the symbolic name is in the first column, you want the second column
  #  to be ignored, and you want the next three columns to be used for 3D clustering:

  my $mask = "N0111";

  #  Now construct an instance of the clusterer.  The parameter `K' controls the
  #  number of clusters.  Here is an example call to the constructor for instance
  #  creation:

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'random',
                                      terminal_output     => 1,
                                      debug               => 0,
                  );
 
  #  Note the choice for `seeding'. The choice `random' means that the clusterer will
  #  randomly select `K' data points to serve as initial cluster centers.  Other
  #  possible choices for the constructor parameter `seeding' are `kmeans' and
  #  `manual'.  With the `kmeans' option for `seeding', the output of a K-means
  #  clusterer is used for the cluster seeds and the initial cluster covariances.  If
  #  you use the `manual' option for seeding, you must also specify the data elements
  #  to use for seeding the clusters.

  #  Here is an example of a call to the constructor when we choose the `manual'
  #  option for seeding the clusters and for specifying the data elements for
  #  seeding.  The data elements are specified by their tag names.  In this case,
  #  these names are `a26', `b53', and `c49':

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      class_priors        => [0.6, 0.2, 0.2],
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'manual',
                                      seed_tags           => ['a26', 'b53', 'c49'],
                                      terminal_output     => 1,
                                      debug               => 0,
                                    );

  #  This example call to the constructor also illustrates how you can inject class
  #  priors into the clustering process. The class priors are the prior probabilities
  #  of the class distributions in your dataset.  As explained later, injecting class
  #  priors in the manner shown above makes statistical sense only for the case of
  #  manual seeding.  When you do inject class priors, the order in which the priors
  #  are expressed must correspond to the manually specified seeds for the clusters.

  #  After the invocation of the constructor, the following calls are mandatory
  #  for reasons that should be obvious from the names of the methods:

  $clusterer->read_data_from_file();
  srand(time);
  $clusterer->seed_the_clusters();
  $clusterer->EM();
  $clusterer->run_bayes_classifier();
  my $clusters = $clusterer->return_disjoint_clusters();

  #  where the call to `EM()' is the invocation of the expectation-maximization
  #  algorithm.  The call to `srand(time)' is to seed the pseudo random number
  #  generator afresh for each run of the cluster seeding procedure.  If you want to
  #  see repeatable results from one run to another of the algorithm with random
  #  seeding, you would obviously not invoke `srand(time)'.

  #  The call `run_bayes_classifier()' shown above carries out a disjoint clustering
  #  of all the data points using the naive Bayes' classifier. And the call
  #  `return_disjoint_clusters()' returns the clusters thus formed to you.  Once you
  #  have obtained access to the clusters in this manner, you can display them in
  #  your terminal window by

  foreach my $index (0..@$clusters-1) {
      print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
  }

  #  If you would like to also see the clusters purely on the basis of the posterior
  #  class probabilities exceeding a threshold, call

  my $theta1 = 0.2;
  my $posterior_prob_clusters =
           $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

  #  where you can obviously set the threshold $theta1 to any value you wish.  Note
  #  that now you may end up with clusters that overlap.  You can display them in
  #  your terminal window in the same manner as shown above for the naive Bayes'
  #  clusters.

  #  You can write the naive Bayes' clusters out to files, one cluster per file, by
  #  calling

  $clusterer->write_naive_bayes_clusters_to_files();  

  #  The clusters are placed in files with names like

         naive_bayes_cluster1.dat
         naive_bayes_cluster2.dat
         ...

  #  In the same manner, you can write out the posterior probability based possibly
  #  overlapping clusters to files by calling:

  $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);

  #  where the threshold $theta1 sets the probability threshold for deciding which
  #  data elements to place in a cluster.  These clusters are placed in files with
  #  names like

         posterior_prob_cluster1.dat
         posterior_prob_cluster2.dat
         ...

  # CLUSTER VISUALIZATION:

  #  You must first set the mask for cluster visualization. This mask tells the 
  #  module which 2D or 3D subspace of the original data space you wish to visualize 
  #  the clusters in:

  my $visualization_mask = "111";
  $clusterer->visualize_clusters($visualization_mask);
  $clusterer->visualize_distributions($visualization_mask);
  $clusterer->plot_hardcopy_clusters($visualization_mask);
  $clusterer->plot_hardcopy_distributions($visualization_mask);

  #  where the last two invocations are for writing out the PNG plots of the
  #  visualization displays to disk files.  The PNG image of the posterior
  #  probability distributions is written out to a file named posterior_prob_plot.png
  #  and the PNG image of the disjoint clusters to a file called cluster_plot.png.

  # SYNTHETIC DATA GENERATION:

  #  The module has been provided with a class method for generating multivariate
  #  data for experimenting with the EM algorithm.  The data generation is controlled
  #  by the contents of a parameter file that is supplied as an argument to the data
  #  generator method.  The priors, the means, and the covariance matrices in the
  #  parameter file must be according to the syntax shown in the `param1.txt' file in
  #  the `examples' directory. It is best to edit a copy of this file for your
  #  synthetic data generation needs.

  my $parameter_file = "param1.txt";
  my $out_datafile = "mydatafile1.dat";
  Algorithm::ExpectationMaximization->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          total_number_of_data_points => $N );

  #  where the value of $N is the total number of data points you would like to see
  #  generated for all of the Gaussians.  How this total number is divided up amongst
  #  the Gaussians is decided by the prior probabilities for the Gaussian components
  #  as declared in input parameter file.  The synthetic data may be visualized in a
  #  terminal window and the visualization written out as a PNG image to a diskfile
  #  by

  my $data_visualization_mask = "11";                                            
  $clusterer->visualize_data($data_visualization_mask);                          
  $clusterer->plot_hardcopy_data($data_visualization_mask);

CHANGES

Version 1.1 incorporates much cleanup of the documentation associated with the module. Both the top-level module documentation, especially the Description part, and the comments embedded in the code were revised for better utilization of the module. The basic implementation code remains unchanged.

DESCRIPTION

Algorithm::ExpectationMaximization is a perl5 module for the Expectation-Maximization (EM) method of clustering numerical data that lends itself to modeling as a Gaussian mixture. Since the module is entirely in Perl (in the sense that it is not a Perl wrapper around a C library that actually does the clustering), the code in the module can easily be modified to experiment with several aspects of EM.

Gaussian Mixture Modeling (GMM) is based on the assumption that the data consists of K Gaussian components, each characterized by its own mean vector and its own covariance matrix. Obviously, given observed data for clustering, we do not know which of the K Gaussian components was responsible for any of the data elements. GMM also associates a prior probability with each Gaussian component. In general, these priors will also be unknown. So the problem of clustering consists of estimating the posterior class probability at each data element and also estimating the class priors. Once these posterior class probabilities and the priors are estimated with EM, we can use the naive Bayes' classifier to partition the data into disjoint clusters. Or, for "soft" clustering, we can find all the data elements that belong to a Gaussian component on the basis of the posterior class probabilities at the data elements exceeding a prescribed threshold.

If you do not mind the fact that it is possible for the EM algorithm to occasionally get stuck in a local maximum and to, therefore, produce a wrong answer even when you know the data to be perfectly multimodal Gaussian, EM is probably the most magical approach to clustering multidimensional data. Consider the case of clustering three-dimensional data. Each Gaussian cluster in 3D space is characterized by the following 10 variables: the 6 unique elements of the 3x3 covariance matrix (which must be symmetric positive-definite), the 3 unique elements of the mean, and the prior associated with the Gaussian. Now let's say you expect to see six Gaussians in your data. What that means is that you would want the values for 59 variables (remember the unit-summation constraint on the class priors which reduces the overall number of variables by one) to be estimated by the algorithm that seeks to discover the clusters in your data. What's amazing is that despite the large number of variables that are being simultaneously optimized, the chances are that the EM algorithm will give you a good approximation to the right answer.

At its core, EM depends on the notion of unobserved data and the averaging of the log-likelihood of the data actually observed over all admissible probabilities for the unobserved data. But what is unobserved data? While in some cases where EM is used, the unobserved data is literally the missing data, in others, it is something that cannot be seen directly but that nonetheless is relevant to the data actually observed. For the case of clustering multidimensional numerical data that can be modeled as a Gaussian mixture, it turns out that the best way to think of the unobserved data is in terms of a sequence of random variables, one for each observed data point, whose values dictate the selection of the Gaussian for that data point. This point is explained in great detail in my on-line tutorial at https://engineering.purdue.edu/kak/ExpectationMaximization.pdf.

The EM algorithm in our context reduces to an iterative invocation of the following steps: (1) Given the current guess for the means and the covariances of the different Gaussians in our mixture model, use Bayes' Rule to update the posterior class probabilities at each of the data points; (2) Using the updated posterior class probabilities, first update the class priors; (3) Using the updated class priors, update the class means and the class covariances; and go back to Step (1). Ideally, the iterations should terminate when the expected log-likelihood of the observed data has reached a maximum and does not change with any further iterations. The stopping rule used in this module is the detection of no change over three consecutive iterations in the values calculated for the priors.

This module provides three different choices for seeding the clusters: (1) random, (2) kmeans, and (3) manual. When random seeding is chosen, the algorithm randomly selects K data elements as cluster seeds. That is, the data vectors associated with these seeds are treated as initial guesses for the means of the Gaussian distributions. The covariances are then set to the values calculated from the entire dataset with respect to the means corresponding to the seeds. With kmeans seeding, on the other hand, the means and the covariances are set to whatever values are returned by the kmeans algorithm. And, when seeding is set to manual, you are allowed to choose K data elements --- by specifying their tag names --- for the seeds. The rest of the EM initialization for the manual mode is the same as for the random mode. The algorithm allows for the initial priors to be specified for the manual mode of seeding.

Much of code for the kmeans based seeding of EM was drawn from the Algorithm::KMeans module by me. The code from that module used here corresponds to the case when the cluster_seeding option in the Algorithm::KMeans module is set to smart. The smart option for KMeans consists of subjecting the data to a principal components analysis (PCA) to discover the direction of maximum variance in the data space. The data points are then projected on to this direction and a histogram constructed from the projections. Centers of the K largest peaks in this smoothed histogram are used to seed the KMeans based clusterer. As you'd expect, the output of the KMeans used to seed the EM algorithm.

This module uses two different criteria to measure the quality of the clustering achieved. The first is the Minimum Description Length (MDL) proposed originally by Rissanen (J. Rissanen: "Modeling by Shortest Data Description," Automatica, 1978, and "A Universal Prior for Integers and Estimation by Minimum Description Length," Annals of Statistics, 1983.) The MDL criterion is a difference of a log-likelihood term for all of the observed data and a model-complexity penalty term. In general, both the log-likelihood and the model-complexity terms increase as the number of clusters increases. The form of the MDL criterion in this module uses for the penalty term the Bayesian Information Criterion (BIC) of G. Schwartz, "Estimating the Dimensions of a Model," The Annals of Statistics, 1978. In general, the smaller the value of MDL quality measure, the better the clustering of the data.

For our second measure of clustering quality, we use `trace( SW^-1 . SB)' where SW is the within-class scatter matrix, more commonly denoted S_w, and SB the between-class scatter matrix, more commonly denoted S_b (the underscore means subscript). This measure can be thought of as the normalized average distance between the clusters, the normalization being provided by average cluster covariance SW^-1. Therefore, the larger the value of this quality measure, the better the separation between the clusters. Since this measure has its roots in the Fisher linear discriminant function, we incorporate the word fisher in the name of the quality measure. Note that this measure is good only when the clusters are disjoint. When the clusters exhibit significant overlap, the numbers produced by this quality measure tend to be generally meaningless.

METHODS

The module provides the following methods for EM based clustering, for cluster visualization, for data visualization, and for the generation of data for testing a clustering algorithm:

new():

A call to new() constructs a new instance of the Algorithm::ExpectationMaximization class. A typical form of this call when you want to use random option for seeding the algorithm is:

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                                debug               => 0,
                    );

where K is the expected number of clusters and max_em_iterations the maximum number of EM iterations that you want to allow until convergence is achieved. Depending on your dataset and on the choice of the initial seeds, the actual number of iterations used could be as few as 10 and as many as reaching 300. The output produced by the algorithm shows the actual number of iterations used to arrive at convergence.

The data file supplied through the datafile option is expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each data record and the rest of the columns the numerical information. As to which columns are actually used for clustering is decided by the string value of the mask. For example, if we wanted to cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns above, the mask value would be N0111 where the character N indicates that the ID tag is in the first column, the character 0 that the second column is to be ignored, and the 1's that follow that the 3rd, the 4th, and the 5th columns are to be used for clustering.

If instead of random seeding, you wish to use the kmeans based seeding, just replace the option random supplied for seeding by kmeans. You can also do manual seeding by designating a specified set of data elements to serve as cluster seeds. The call to the constructor in this case looks like

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'manual',
                                seed_tags           => ['a26', 'b53', 'c49'],
                                terminal_output     => 1,
                                debug               => 0,
                    );

where the option seed_tags is set to an anonymous array of symbolic names associated with the data elements.

If you know the class priors, you can supply them through an additional option to the constructor that looks like

    class_priors    => [0.6, 0.2, 0.2],

for the case of K equal to 3. In general, this would be a useful thing to do only for the case of manual seeding. If you go for manual seeding, the order in which the priors are expressed should correspond to the order of the manually chosen tags supplied through the seed_tags option.

Note that the parameter terminal_output is boolean; when not supplied in the call to new() it defaults to 0. When set, this parameter displays useful information in the window of the terminal screen in which you invoke the algorithm.

read_data_from_file():
    $clusterer->read_data_from_file()

This is a required call after the constructor is invoked. As you would expect, this call reads in the data for clustering.

seed_the_clusters():
    $clusterer->seed_the_clusters();

This is also a required call. It processes the option you supplied for seeding in the constructor call to choose the data elements for seeding the K clusters.

EM():
    $clusterer->EM();

This is the workhorse of the module, as you would expect. The means, the covariances, and the priors estimated by this method are stored in instance variables that are subsequently accessed by other methods for the purpose of displaying the clusters, the probability distributions, etc.

run_bayes_classifier():
    $clusterer->run_bayes_classifier();

Using the posterior probability distributions estimated by the EM() method, this method partitions the data into the K disjoint clusters using the naive Bayes' classifier.

return_disjoint_clusters():
    my $clusters = $clusterer->return_disjoint_clusters();

This allows you to access the clusters obtained with the application of the naive Bayes' classifier in your own scripts. If, say, you wanted to see the data records placed in each cluster, you could subsequently invoke the following loop in your own script:

    foreach my $index (0..@$clusters-1) {
        print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
    }

where $clusters holds the array reference returned by the call to return_disjoint_clusters().

write_naive_bayes_clusters_to_files():
    $clusterer->write_naive_bayes_clusters_to_files();

This method writes the clusters obtained by applying the naive Bayes' classifier to disk files, one cluster per file. What is written out to each file consists of the symbolic names of the data records that belong to the cluster corresponding to that file. The clusters are placed in files with names like

    naive_bayes_cluster1.dat
    naive_bayes_cluster2.dat
    ...
return_clusters_with_posterior_probs_above_threshold($theta1):
    my $theta1 = 0.2;
    my $posterior_prob_clusters =
       $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

This method returns a reference to an array of K anonymous arrays, each consisting of the symbolic names for the data records where the posterior class probability exceeds the threshold as specified by $theta1. Subsequently, you can access each of these posterior-probability based clusters through a loop construct such as

    foreach my $index (0..@$posterior_prob_clusters-1) {
        print "Cluster $index (based on posterior probs exceeding $theta1): " .
              "@{$posterior_prob_clusters->[$index]}\n\n"
    }
write_posterior_prob_clusters_above_threshold_to_files($theta1):
    $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);

This call writes out the posterior-probability based soft clusters to disk files. As in the previous method, the threshold $theta1 sets the probability threshold for deciding which data elements belong to a cluster. These clusters are placed in files with names like

    posterior_prob_cluster1.dat
    posterior_prob_cluster2.dat
    ...
return_individual_class_distributions_above_given_threshold($theta):
    my $theta2 = 0.00001;
    my $class_distributions =
      $clusterer->return_individual_class_distributions_above_given_threshold($theta2);

This is the method to call if you wish to see the individual Gaussians in your own script. The method returns a reference to an array of anonymous arrays, with each anonymous array representing data membership in each Gaussian. Only those data points are included in each Gaussian where the probability exceeds the threshold $theta2. Note that the larger the covariance and the higher the data dimensionality, the smaller this threshold must be for you to see any of the data points in a Gaussian. After you have accessed the Gaussian mixture in this manner, you can display the data membership in each Gaussian through the following sort of a loop:

    foreach my $index (0..@$class_distributions-1) {
        print "Gaussian Distribution $index (only shows data elements " .
              "whose probabilities exceed the threshold $theta2:  " .
              "@{$class_distributions->[$index]}\n\n"
    }
visualize_clusters($visualization_mask):
    my $visualization_mask = "11";
    $clusterer->visualize_clusters($visualization_mask);

The visualization mask here does not have to be identical to the one used for clustering, but must be a subset of that mask. This is convenient for visualizing the clusters in two- or three-dimensional subspaces of the original space. The subset is specified by placing `0's in the positions corresponding to the dimensions you do NOT want to see through visualization. Depending on the mask used, this method creates a 2D or a 3D scatter plot of the clusters obtained through the naive Bayes' classification rule.

visualize_distributions($visualization_mask):
    $clusterer->visualize_distributions($visualization_mask);

This is the method to call if you want to visualize the soft clustering corresponding to the posterior class probabilities exceeding the threshold specified in the call to return_clusters_with_posterior_probs_above_threshold($theta1). Again, depending on the visualization mask used, you will see either a 2D plot or a 3D scatter plot.

plot_hardcopy_clusters($visualization_mask):
    $clusterer->plot_hardcopy_clusters($visualization_mask);

This method create a PNG file from the gnuplot created display of the naive Bayes' clusters obtained from the data. The plotting functionality of gnuplot is accessed through the Perl wrappers provided by the Graphics::GnuplotIF module.

plot_hardcopy_distributions($visualization_mask):
    $clusterer->plot_hardcopy_distributions($visualization_mask);

This method create a PNG file from the gnuplot created display of the clusters that correspond to the posterior class probabilities exceeding a specified threshold. The plotting functionality of gnuplot is accessed through the Perl wrappers provided by the Graphics::GnuplotIF module.

display_fisher_quality_vs_iterations():
    $clusterer->display_fisher_quality_vs_iterations();

This method measures the quality of clustering by calculating the normalized average squared distance between the cluster centers, the normalization being provided by the average cluster covariance. See the Description for further details. In general, this measure is NOT useful for overlapping clusters.

display_mdl_quality_vs_iterations():
    $clusterer->display_mdl_quality_vs_iterations();

At the end of each iteration, this method measures the quality of clustering my calculating its MDL (Minimum Description Length). As stated earlier in Description, the MDL measure is a difference of a log-likelihood term for all of the observed data and a model-complexity penalty term. The smaller the value returned by this method, the better the clustering.

return_estimated_priors():
    my $estimated_priors = $clusterer->return_estimated_priors();
    print "Estimated class priors: @$estimated_priors\n";

This method can be used to access the final values of the class priors as estimated by the EM algorithm.

cluster_data_generator()
    Algorithm::ExpectationMaximization->cluster_data_generator(
                            input_parameter_file => $parameter_file,
                            output_datafile => $out_datafile,
                            total_number_of_data_points => 300 
    );

for generating multivariate data for clustering if you wish to play with synthetic data for experimenting with the EM algorithm. The input parameter file must specify the priors to be used for the Gaussians, their means, and their covariance matrices. The format of the information contained in the parameter file must be as shown in the file param1.txt provided in the examples directory. It will be easiest for you to just edit a copy of this file for your data generation needs. In addition to the format of the parameter file, the main constraint you need to observe in specifying the parameters is that the dimensionality of the covariance matrices must correspond to the dimensionality of the mean vectors. The multivariate random numbers are generated by calling the Math::Random module. As you would expect, this module requires that the covariance matrices you specify in your parameter file be symmetric and positive definite. Should the covariances in your parameter file not obey this condition, the Math::Random module will let you know.

visualize_data($data_visualization_mask):
    $clusterer->visualize_data($data_visualization_mask);                          

This is the method to call if you want to visualize the data you plan to cluster with the EM algorithm. You'd need to specify argument mask in a manner similar to the visualization of the clusters, as explained earlier.

plot_hardcopy_data($data_visualization_mask):
    $clusterer->plot_hardcopy_data($data_visualization_mask); 

This method creates a PNG file that can be used to print out a hardcopy of the data in different 2D and 3D subspaces of the data space. The visualization mask is used to select the subspace for the PNG image.

HOW THE CLUSTERS ARE OUTPUT

This module produces two different types of clusters: the "hard" clusters and the "soft" clusters. The hard clusters correspond to the naive Bayes' classification of the data points on the basis of the Gaussian distributions and the class priors estimated by the EM algorithm. Such clusters partition the data into disjoint subsets. On the other hand, the soft clusters correspond to the posterior class probabilities calculated by the EM algorithm. A data element belongs to a cluster if its posterior probability for that Gaussian exceeds a threshold.

After the EM algorithm has finished running, the hard clusters are created by invoking the method run_bayes_classifier() on an instance of the module and then made user-accessible by calling return_disjoint_clusters(). These clusters may then be displayed in a terminal window by dereferencing each element of the array whose reference is returned b return_disjoint_clusters(). The hard clusters can be written out to disk files by invoking write_naive_bayes_clusters_to_files(). This method writes out the clusters to files, one cluster per file. What is written out to each file consists of the symbolic names of the data records that belong to the cluster corresponding to that file. The clusters are placed in files with names like

    naive_bayes_cluster1.dat
    naive_bayes_cluster2.dat
    ...

The soft clusters on the other hand are created by calling return_clusters_with_posterior_probs_above_threshold($theta1) on an instance of the module, where the argument $theta1 is the threshold for deciding whether a data element belongs in a soft cluster. The posterior class probability at a data element must exceed the threshold for the element to belong to the corresponding cluster. The soft cluster can be written out to disk files by calling write_posterior_prob_clusters_above_threshold_to_files($theta1). As with the hard clusters, each cluster is placed in a separate file. The filenames for such clusters look like:

    posterior_prob_cluster1.dat
    posterior_prob_cluster2.dat
    ...

WHAT IF THE NUMBER OF CLUSTERS IS UNKNOWN?

The module constructor requires that you supply a value for the parameter K, which is the number of clusters you expect to see in the data. But what if you do not have a good value for K? Note that it is possible to search for the best K by using the two clustering quality criteria included in the module. However, I have intentionally not yet incorporated that feature in the module because it slows down the execution of the code --- especially when the dimensionality of the data increases. However, nothing prevents you from writing a script --- along the lines of the five "canned_example" scripts in the examples directory --- that would use the two clustering quality metrics for finding the best choice for K for a given dataset. Obviously, you will now have to incorporate the call to the constructor in a loop and check the value of the quality measures for each value of K.

THE examples DIRECTORY

Becoming familiar with this directory should be your best strategy to become comfortable with this module (and its future versions). You are urged to start by executing the following five example scripts:

canned_example1.pl

This example applies the EM algorithm to the data contained in the datafile mydatafile.dat. The mixture data in the file corresponds to three overlapping Gaussian components in a star-shaped pattern. The EM based clustering for this data is shown in the files save_example_1_cluster_plot.png and save_example_1_posterior_prob_plot.png, the former displaying the hard clusters obtained by using the naive Bayes' classifier and the latter showing the soft clusters obtained on the basis of the posterior class probabilities at the data points.

canned_example2.pl

The datafile used in this example is mydatafile2.dat. This mixture data corresponds to two well-separated relatively isotropic Gaussians. EM based clustering for this data is shown in the files save_example_2_cluster_plot.png and save_example_2_posterior_prob_plot.png, the former displaying the hard clusters obtained by using the naive Bayes' classifier and the latter showing the soft clusters obtained by using the posterior class probabilities at the data points.

canned_example3.pl

Like the first example, this example again involves three Gaussians, but now their means are not co-located. Additionally, we now seed the clusters manually by specifying three selected data points as the initial guesses for the cluster means. The datafile used for this example is mydatafile3.dat. The EM based clustering for this data is shown in the files save_example_3_cluster_plot.png and save_example_3_posterior_prob_plot.png, the former displaying the hard clusters obtained by using the naive Bayes' classifier and the latter showing the soft clusters obtained on the basis of the posterior class probabilities at the data points.

canned_example4.pl

Whereas the three previous examples demonstrated EM based clustering of 2D data, we now present an example of clustering in 3D. The datafile used in this example is mydatafile4.dat. This mixture data corresponds to three well-separated but highly anisotropic Gaussians. The EM derived clustering for this data is shown in the files save_example_4_cluster_plot.png and save_example_4_posterior_prob_plot.png, the former displaying the hard clusters obtained by using the naive Bayes' classifier and the latter showing the soft clusters obtained on the basis of the posterior class probabilities at the data points.

canned_example5.pl

We again demonstrate clustering in 3D but now we have one Gaussian cluster that "cuts" through the other two Gaussian clusters. The datafile used in this example is mydatafile5.dat. The three Gaussians in this case are highly overlapping and highly anisotropic. The EM derived clustering for this data is shown in the files save_example_5_cluster_plot.png and save_example_5_posterior_prob_plot.png, the former displaying the hard clusters obtained by using the naive Bayes' classifier and the latter showing the soft clusters obtained through the posterior class probabilities at the data points.

Going through the five examples listed above will make you familiar with how to make the calls to the clustering and the visualization methods. The examples directory also includes several parameter files with names like

    param1.txt
    param2.txt
    param3.txt 
    ...

These were used to generate the synthetic data for which the results are shown in the examples directory. Just make a copy of one of these files and edit it if you would like to generate your own multivariate data for clustering. Note that you can generate data with any dimensionality through appropriate entries in the parameter file.

CAVEATS

When you run the scripts in the examples directory, your results will NOT always look like what I have shown in the PNG image files in the directory. As mentioned earlier in Description, the EM algorithm starting from randomly chosen initial guesses for the cluster means can get stuck in a local maximum.

That raises an interesting question of how one judges the correctness of clustering results when dealing with real experimental data. For real data, the best approach is to try the EM algorithm multiple times with all of the seeding options included in this module. It would be safe to say that, at least in low dimensional spaces and with sufficient data, a majority of your runs should yield "correct" results.

Also bear in mind that a pure Perl implementation is not meant for the clustering of very large data files. It is really designed more for researching issues related to EM based approaches to clustering.

REQUIRED

This module requires the following three modules:

   Math::Random
   Graphics::GnuplotIF
   Math::GSL::Matrix

the first for generating the multivariate random numbers, the second for the visualization of the clusters, and the last for access to the Perl wrappers for the GNU Scientific Library. The Matrix module of this library is used for various algebraic operations on the covariance matrices of the Gaussians.

EXPORT

None by design.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'Algorithm EM' in the subject line.

INSTALLATION

The usual

    perl Makefile.PL
    make
    make test
    make install

if you have root access. If not,

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "EM Algorithm" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2012 Avinash Kak

1 POD Error

The following errors were encountered while parsing the POD:

Around line 2361:

=pod directives shouldn't be over one line long! Ignoring all 2 lines of content