The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Mail::Digest::Tools - Tools for digest versions of mailing lists

VERSION

This document refers to version 2.12 of digest.pl, released May 14, 2011.

SYNOPSIS

    use Mail::Digest::Tools qw( 
        process_new_digests
        reprocess_ALL_digests
        reply_to_digest_message
        repair_message_order
        consolidate_threads_multiple
        consolidate_threads_single
        delete_deletables
    );

%config_in and %config_out are two configuration hashes whose setup is discussed in detail below.

    process_new_digests(\%config_in, \%config_out);

    reprocess_ALL_digests(\%config_in, \%config_out);

    $full_reply_file = reply_to_digest_message(
        \%config_in, 
        \%config_out, 
        $digest_number, 
        $digest_entry, 
        $directory_for_reply,
    );

    repair_message_order(
        \%config_in, 
        \%config_out,
        {
            year   => 2004,
            month  => 01,
            day    => 27,
        }
    );

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
        $first_common_letters,  # optional integer argument; defaults to 20
    );

    consolidate_threads_single(
        \%config_in, 
        \%config_out, 
        [
            'first_dummy_file_for_consolidation.thr.txt',
            'second_dummy_file_for_consolidation.thr.txt',
        ],
    );

    delete_deletables(\%config_out);

DESCRIPTION

Mail::Digest::Tools provides useful tools for processing mail which an individual receives in a 'daily digest' version from a mailing list. Digest versions of mailing lists are provided by a variety of mail processing programs and by a variety of list hosts. Within the Perl community, digest versions of mailing lists are offered by such sponsors as Active State, Sourceforge, Yahoo! Groups and London.pm. However, you do not have to be interested in Perl to make use of Mail::Digest::Tools. Mail from any of the thousands of Yahoo! Groups, for example, may be processed with this module.

If, when you receive e-mail from the digest version of a mailing list, you simply read the digest in an e-mail client and then discard it, you may stop reading here. If, however, you wish to read or store such mail by subject, read on. As printed in a normal web browser, this document contains 40 pages of documentation. You are urged to print this documentation out and study it before using this module.

To understand how to use Mail::Digest::Tools, we will first take a look at a typical mailing list digest. We will then sketch how that digest looks once processed by Mail::Digest::Tool. We will then discuss Mail::Digest::Tool's exportable functions. Next, we will study how to prepare the two configuration hashes which hold the configuration data. Finally, we will provide some tips for everyday use of Mail::Digest::Tools.

A TYPICAL MAILING LIST DIGEST

Here is a dummied-up version of a typical mailing list digest as it appears once saved to a plain-text file. For illustrative purposes, let us suppose that the file is named: 'Perl-Win32-Users Digest, Vol 1 Issue 9999.txt'

    Send Perl-Win32-Users mailing list submissions to
    perl-win32-users@listserv.ActiveState.com

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of Perl-Win32-Users digest..."

    Today's Topics:

      1. Introducing Mail::Digest::Tools (James E Keenan)
      2. A Different Discussion (steve)
      3. Re:  Introducing Mail::Digest::Tools (David H Adler)

    ----------------------------------------------------------------------

    Message: 1
    From: "James E Keenan" <jkeen@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: Introducing Mail::Digest::Tools
    Date: Sat, 31 Jan 2004 14:10:20 -0600

    Mail::Digest::Tools is the greatest thing since sliced bread.
    Go download it now!

    ------------------------------

    Message: 2
    From: "steve" <steve@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: A Different Discussion
    Date: Sat, 31 Jan 2004 14:40:20 -0600

    This is a new topic.  I am not discussing Mail::Digest::Tools in this 
    submission.

    ------------------------------

    Message: 3
    From: "David H Adler" <dha@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: Re: Introducing Mail::Digest::Tools
    Date: Sat, 31 Jan 2004 14:50:20 -0600

    Jim, what's this nonsense about sliced bread.  Weren't you on the Atkins 
    diet?  Unlike beer, sliced bread is Off Topic.

    ------------------------------

    _______________________________________________
    Perl-Win32-Users mailing list
    Perl-Win32-Users@listserv.ActiveState.com
    To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

    End of Perl-Win32-Users Digest

Note that the digest has an overall structure, while each message within the digest has its own structure.

The digest's overall structure consists of:

  • Digest Header

    The digest header consists of one or more paragraphs providing instructions on how to subscribe, post messages, unsubscribe and contact the list administrator.

    In processing a digest, Mail::Digest::Tools generally discards the digest header.

  • Today's Topics

    Next, each daily digest contains a list of the subjects of the messages found in that particular digest. This list is introduced by a paragraph such as:

        Today's Topics

    and is followed by a numbered list of the message subjects and authors. Some digests break the authors into two lines for names and e-mail addresses. Others, such as the example above, list only names.

    When Mail::Digest::Tools process a digest, it extracts the list of topics as a single chunk and appends it to a file containing the topics from all previous digests which the user has similarly processed.

  • Post-Topics Delimiter

    The list of topics is separated from the first message by a string of characters which the list sponsor has, we hope, determined is not likely to occur in the text of any message posted to that list. In the example above, the source message delimiter is the string:

        ----------------------------------------------------------------------

    followed by two \n newlines (so that the delimiter is a paragraph unto itself). Other digests may use a two-line delimiter such as:

        _______________________________________________________
        _______________________________________________________

    or

        --__--__--
  • Source Message Delimiter

    Most mailing list digests use the same string to delimit individual messages within the digest that they use to delimit the list of today's topics from the very first message in the digest. (The author tracked one digest for more than three-and-a-half years that used the same string for both functions -- only to see that digest's provider change its format while this module was being prepared for CPAN!) But the digest may use a different string to separate individual messages from each other. In the sample digest above, the source message delimiter is the string:

        ------------------------------

    followed by two \n newlines (so that the delimiter is a paragraph unto itself).

    As we shall see below, correctly identifying the post-topics delimiter and source message delimiter used in a particular digest is essential to correct configuration of Mail::Digest::Tools, as the module will repeatedly split digests on this delimiter.

  • Individual Messages

    Individual messages have their own structure.

    • Headers

      In addition to normal mail headers, a message in a digest must have a message number representing its position within that day's digest. So a message in a digest will typically have some or all of the following headers:

          Message:
          From:
          Organization:
          Reply-To:
          To:
          CC:
          Date:
          Subject:
    • Message Body

      One of more paragraphs of text, frequently including citations from earlier postings to the mailing list.

      The main objective of Mail::Digest::Tools is to extract headers and bodies from particular digest entries and to append them to plain-text files which hold all postings on a particular subject. See discussion of process_new_digests below.

      Many mailing lists allow subscribers to post in either plain-text or HTML. Some allow users to post attachments; others do not. Others still incorporate the attachments into the message body, often using 'multipart MIME' format. Regrettably, certain mailing list digest programs fail to eliminate redundant MIME parts before posting a message to a digest. This leads to severe bloat once Mail::Digest::Tools extracts a message's content and posts it to a thread file. Mail::Digest::Tools, however, provides its users with the option of stripping redundant MIME parts from a message before posting.

    • Source Message Delimiter

      As discussed above, each message within a digest is delimited by a string which may or may not be the same string which separates the list of Today's Topics from the first message in the digest.

  • Digest Footer

    The digest footer consists of one or more paragraphs containing additional information on the digest and signaling the end of the digest. It follows the source message delimiter corresponding to the last message in a particular digest.

    In processing a given digest, Mail::Digest::Tools generally discards the digest footer.

The Typical Digest After Processing with Mail::Digest::Tools

Using the dummy messages provided above, typical use of Mail::Digest::Tools would produce (in a bare-bones configuration) the following results:

  • Two plain-text 'thread' files holding the ongoing discussion of each topic:

    • Introducing Mail::Digest::Tools.thr.txt

          Thread:       Introducing Mail::Digest::Tools
          Message:      001_9999_001
          From:         "James E Keenan" <jkeen@some.web.address.com>
          Text:
      
          Mail::Digest::Tools is the greatest thing since sliced bread.
          Go download it now!
      
          --__--__--
      
          Thread:       Introducing Mail::Digest::Tools
          Message:      001_9999_003
          From: "David H Adler" <dha@some.web.address.com>
          Text:
      
          Jim, what's this nonsense about sliced bread.  Weren't you on the Atkins 
          diet?  Unlike beer, sliced bread is Off Topic.
      
          --__--__--
    • A Different Discussion.thr.txt

          Thread:       A Different Discussion
          Message:      001_9999_002
          From: "steve" <steve@some.web.address.com>
          Text:
      
          This is a new topic.  I am not discussing Mail::Digest::Tools in this 
          submission.
      
          --__--__--
  • A new entry at the end of file todays_topics.txt:

        Today's Topics
    
        ...
    
        Perl-Win32-Users digest, Vol 1 #9999 - 3 msgs.txt
          1. Introducing Mail::Digest::Tools (James E Keenan)
          2. A Different Discussion (steve)
          3. Re:  Introducing Mail::Digest::Tools (David H Adler)
  • A new entry at the end of file digests_log.txt:

        001_9999;Fri Feb  6 18:57:41 2004;Fri Feb  6 18:57:41 2004

FUNCTIONS

Mail::Digest::Tools exports no functions by default. Each of its current seven functions is imported only on request by your script.

In everyday use, you will probably call just one of Mail::Digest::Tool's exportable functions in a particular Perl script. Typically, you will import the function as described in the SYNOPSIS above, populate two configuration hashes, and finally call the one function you have imported.

As will become evident, the most challenging part of using Mail::Digest::Tools is not calling the functions. Rather, it is the initial setup and testing of configuration files from which the two configuration hashes passed as arguments to the various Mail::Digest::Tools functions are drawn.

More on those configuration hashes later. For now, let's look at the exportable functions.

process_new_digests

    process_new_digests(\%config_in, \%config_out);

process_new_digests() is the Mail::Digest::Tools function which you will use most frequently on a daily basis. Based on information supplied in the two configuration hashes passed to it as arguments, process_new_digests() does the following:

  • Validates the configuration data.

  • Conducts an analysis of the directory in which thread files for a given digest are stored to determine are old enough:

    • either to be moved to a subdirectory for archiving -- if you have told the configuration file that you wish to archive older threads in a subdirectory

    • or to be deleted -- if you have told the configuration file that you do not wish to archive older threads

  • Conducts an analysis of the directory in which digest files (i.e., the plain-text versions of mailing list digests you have received) are stored to determine which digest files are new and need processing and which have previously been processed.

  • Updates a log file to put a timestamp on the processing of the new digest file or files. Based on options set in the configuration file, this function may also update a more human-readable version of this log file.

  • Opens each of the digest files identified as needing processing and proceeds to 'strip down' those files. This 'stripping down' includes the following:

    • The digest file's name is analyzed to extract the digest's number as issued by the provider's mailing list program. This number is used to form part of the unique identifier which Mail::Digest::Tools assigns to each message within each digest.

    • The list of today's topics in the digest is extracted and appended to a permanent log file of such topics.

    • The digest's contents are split into individual messages. Each message, in turn, is split into headers and body.

    • If you have requested in the configuration file that superfluous multipart MIME content be purged from messages before posting to thread files, this purging is now conducted.

    • Each message is appended to an appropriate, plain-text thread file which holds the ongoing discussion of that topic. The following factors are taken into consideration:

      • The name of the thread file is derived from the message's subject, though characters in the message's subject which would not be valid in file names on your operating system are skipped over.

      • To the greatest extent possible, extraneous words in a message's subject such as 'Re:' or 'Fwd:' are deleted so that all relevant postings on a given subject can be included in a single thread file. (Should this not succeed and a new thread file beginning with 'Re:' or some similar term be created, you can fix this later by using Mail::Digest::Tool's consolidate_threads_single() function discussed below.)

    • A brief summation of results is printed to standard output.

reprocess_ALL_digests

    reprocess_ALL_digests(\%config_in, \%config_out);

reprocess_ALL_digests() is the Mail::Digest::Tools function which you should use ONLY when you are setting up and fine-tuning Mail::Digest::Tools to process a given digest -- and you should NEVER use it thereafter!

Why? Read on!

reprocess_ALL_digests() does almost exactly the same things as does process_new_digests(), but it does them on ALL digest files found in the directory in which you store such digests -- not just on those previously processed. But in the process it does not merely append new messages to already existing thread files, leaving older thread files untouched. Instead, reprocess_ALL_digests() WIPES OUT your entire directory of thread files and rebuilds it from scratch.

That's cool if you have retained all instances of a given digest which you wish to process into thread files. But if you've thrown out older instances of a given digest and call reprocess_ALL_digests(), you will not be able to process the messages contained in those discarded digests. The message sources are gone. That's cool once you're certain that you've got a given digest configured just the way you want it -- but not until that moment.

  • Example

    Let's make this more concrete. Suppose that you have begun to subscribe to the digest version of the London Perlmongers mailing list. When you receive e-mails from this provider, you store them in a directory whose contents look like this:

        london.pm digest, Vol 1 #1856 - 7 msgs.txt
        london.pm digest, Vol 1 #1857 - 18 msgs.txt
        london.pm digest, Vol 1 #1858 - 15 msgs.txt
        london.pm digest, Vol 1 #1859 - 17 msgs.txt
        london.pm digest, Vol 1 #1860 - 11 msgs.txt

    Initially, you decide that you want to post the messages in these digests to thread files that are discarded after three days. You set up your configuration files to do precisely this. (See below for how this is done.) You then write a script which calls

        reprocess_ALL_digests(\%config_in, \%config_out);

    Three days go by. One or two new london.pm digests arrive each day. You want to process only the newly arrived files, so each day you simply call:

        process_new_digests(\%config_in, \%config_out);

    and on Day 4 Mail::Digest::Tools starts to notify you on standard output that it is discarding thread files which have not been changed (i.e., received new postings) in three days.

    But then you decide that London.pm's contributors are the most witty and erudite Perlmongers anywhere and you wish to archive their contributions until the end of time (or until the first production release of Perl 6, whichever comes first). Fortunately, you've still got all your London.pm digest files going back to the beginning of your subscription. You make appropriate changes to your configuration setup to say, ''Instead of killing these thread files after 3 days of inactivity, archive them after 3 days instead.'' (Again, we'll see how to do this below.) You then call:

        reprocess_ALL_digests(\%config_in, \%config_out);

    one last time. All your previously existing thread files are wiped out, and all your London.pm digests are reprocessed from scratch. But that's okay, because you've decided to live with your configuration decisions. So you can now begin to discard older digest files and process newly arrived files only with

        process_new_digests(\%config_in, \%config_out);

    Your London.pm thread archive grows exponentially, and you live happily ever after.

The ALL CAPS in reprocess_ALL_digests() is a little warning that this Mail::Digest::Tools function is very powerful, but potentially very dangerous. You are also alerted to this danger by this screen prompt which appears when you call this function:

     By default, this program processes only NEWLY ARRIVED
     [London.pm/other digest] files found in this directory.  Messages in
     these new digests are sorted and appended to the appropriate
     '.thr.txt' files in the 'Threads' subdirectory.

     However, by choosing method 'reprocess_ALL_digests()' you have
     indicated that you wish to process ALL digest files found in this     
     directory -- regardless of whether or not they have previously been
     processed.  This is recommended ONLY for initialization and testing 
     of this program.

     Since this will wipe out all threads files ('.thr.txt') as well -- 
     including threads files for which you no longer have their source 
     digest files -- please confirm that this is your intent by typing 
     ALL at the prompt.


                               GOT IT?

To proceed, you must type ALL in ALL CAPS, hit [Enter], then respond to yet another prompt:

     You have chosen to WIPE OUT all '.thr.txt' files currently
     existing in the 'Threads' subdirectory and reprocess all
     [London.pm/other digest] digest files from scratch.

     Please re-confirm your choice by once again typing 'ALL'
         and hitting [Enter]:

You must again type ALL in ALL CAPS and hit [Enter] to reprocess all digests. Should you fail to type ALL at both of these prompts, your script will default to process_new_digests() and only process newly arrived digest files.

reply_to_digest_message

    $full_reply_file = reply_to_digest_message(
        \%config_in, 
        \%config_out, 
        $digest_number, 
        $digest_entry, 
        $directory_for_reply,
    );

Once you have begun to follow discussion threads on a mailing list with the aid of Mail::Digest::Tools, you may wish to join the discussion and reply to a message.

If you tried to do this by hitting the 'Reply' button in your e-mail client, you would probably end up with a 'Subject' line in your e-mail that looked this:

    Re: london.pm digest, Vol 1 #1814 - 2 msgs

Needless to say, this is tacky. So tacky that many mailing list digest programs insert this message into each digest's headers:

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of london.pm digest, Vol 1, #xxxx..."

You don't want to be tacky; you want to be lazy. You want Perl to do the work of initiating an e-mail with a meaningful subject header for you. Mail::Digest::Tool's reply_to_digest_message does just this. It creates a plain-text file for you that has a meaningful subject line and prepends each line of the body of the message with \ >. You then open this plain-text file, edit it to reply to its contents, copy-and-paste it into your e-mail client, and send it.

The arguments passed to reply_to_digest_message() are:

  • a reference to the 'in' configuration hash

  • a reference to the 'out' configuration hash

  • the number of the digest containing the message to which you are replying

  • the number of the message to which you are replying within that digest

  • a path to the directory in which you want the plain-text reply file to be created

  • Example

    Suppose that you wished to reply to message #2 in London.pm digest #1814:

        Message: 2
        From:     James E Keenan <jkeen@some.web.address.com>
        To:       London Perlmongers <london.pm@london.pm.org>
        Date: Fri, 2 Jan 2004 23:41:01 -0500
        Subject: re: language courses
        Reply-To: london.pm@london.pm.org
    
        On Fri, 2 Jan 2004 22:38:40 +0000 (GMT), Ali Young wrote concerning:
            language courses
    
        > Depends what you count as useful. Learning Esperanto means that you 
        > can read the current London.pm website.
    
        BTW, wasn't the Esperanto on the website supposed to expire on 31 Dec?
    
        Jim Keenan
        Brooklyn, NY

    You would call the function as follows:

        $full_reply_file = reply_to_digest_message(
            \%config_in, 
            \%config_out, 
            1814,
            2,
            '/home/jimk/mail/digest/london',
        );

    Mail::Digest::Tools will then create a plain-text file which you can use as the first draft of your reply. It will print this screen prompt:

        To complete reply, edit text in:
          /home/jimk/mail/digest/london/language_courses.reply.txt

    When you open language_courses.reply.txt in your text editor, it will look like this:

        Reply-To:
        london.pm@london.pm.org
    
        Subject:
        language courses
    
        On Fri, 2 Jan 2004 23:41:01 -0500, James E Keenan 
        <jkeen@some.web.address.com> wrote:
    
        > On Fri, 2 Jan 2004 22:38:40 +0000 (GMT), Ali Young wrote concerning:
        >     language courses
        > 
        > > Depends what you count as useful. Learning Esperanto means that you 
        > can 
        > > read the current London.pm website.
        > 
        > BTW, wasn't the Esperanto on the website supposed to expire on 31 Dec?
        > 
        > Jim Keenan
        > Brooklyn, NY
        > 

    The 'Reply-To' and 'Subject' paragraphs are provided simply to give you something to cut-and-paste into a GUI e-mail client. The 'Reply-To' paragraph will only appear if in %config_in the key reply_to_style_flag is defined for a particular digest.

    You edit this plain-text file, pop it into the body of your e-mail window and send it. Not elegant, but it at least gives you a first draft.

repair_message_order

    repair_message_order(
        \%config_in, 
        \%config_out,
        {
            year   => 2004,
            month  => 01,
            day    => 27,
        }
    );

From time to time you may receive digest versions of mailing lists out of chronological/numerical sequence. This is especially true when e-mail traffic is being disrupted by worms or viruses. You may discover that you have received and processed

    london.pm digest, Vol 1 #1856 - 7 msgs
    london.pm digest, Vol 1 #1858 - 15 msgs

before realizing that you were missing

    london.pm digest, Vol 1 #1857 - 18 msgs

If you were to now process digest 1857 with process_new_digests(), messages from that digest would be appended to their respective thread files after messages from digest 1858. Since the whole point of Mail::Digest::Tools is to be able to read a discussion thread in chronological order, this would not be desirable.

Fortunately, you can fix this problem as follows:

  • Apply process_new_digests()

    Call process_new_digests() as you normally would. In the above example, go ahead and call it on digest 1857 even though it creates thread files with messages out of chronological order.

  • Determine date where need for repair begins

    Examine the timestamps on your digest files for the date of the first digest you received out of sequence. In the above example, that would be the date of digest 1858. Since digest files were received out of proper sequence on or after that date, all thread files generated after that date may have out-of-sequence messages and need re-ordering.

  • Apply repair_message_order() with the repair date

    Call repair_message_order() with the following arguments:

    • a reference to the 'in' configuration hash

    • a reference to the 'out' configuration hash

    • a reference to an anonymous hash whose keys are year, month and day, the values for which keys are the elements of the repair date.

    Mail::Digest::Tools will examine all thread files from midnight local time on that date. Where messages have been posted to the thread files out of proper sequence, they will be reposted in the correct order. The thread file with the correct sequence will overwrite the file with the incorrect sequence.

consolidate_threads_multiple

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
    );

or

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
        $first_common_letters,  # optional integer argument
    );

As described above, Mail::Digest::Tool's process_new_digests() function will, to the greatest extent possible, delete extraneous words such as 'Re:' or 'Fwd:' from a message's subject so that all relevant postings on a given subject can be included in a single thread file. What happens when this is not sufficient? For example, suppose someone posts a message to a list with a slightly misspelled or altered subject line:

  • Original thread file:

       Help telnetting to remote host through CGI.thr.txt
  • Thread file created due to altered subject line:

       Help telnetting to remote host thru CGI.thr.txt

Mail::Digest::Tools offers two functions to address this problem. consolidate_threads_multiple() is the easier to use and will be discussed first. This function presumes that people who re-type e-mail subject lines when replying tend to type the first several words correctly, then make errors or alterations toward the end of the subject line. If the first n letters of the subject line of two or more messages are identical, there is a strong chance that the messages are discussing the same topic and should be posted to the same discussion thread. Mail::Digest::Tool's default value for n is 20, but you can set a different value for a particular digest by passing an optional third argument as shown above. consolidate_threads_multiple() accordingly:

  • Makes a list of all thread files for a particular digest.

  • Identifies groups of thread files whose names share the first 20 letters.

  • Displays a prompt on standard output asking you whether you wish to consolidate the files in each such group:

        Candidates for consolidation:
          Help telnetting to remote host through CGI.thr.txt
          Help telnetting to remote host thru CGI.thr.txt
    
        To consolidate, type YES:  
    • If you type YES in ALL CAPS, the files will be consolidated into a single thread file whose name will be derived from the Subject line of the very first posting to the discussion thread. Standard output will display:

            Files will be consolidated
    • If you type anything other than YES in ALL CAPS -- or simply hit [Enter], then the files will not be consolidated and standard output will display:

            Files will not be consolidated
    • If the files are consolidated, the original thread files will not automatically be deleted. Rather, they are renamed with the extension .DELETABLE.

          Help telnetting to remote host through CGI.thr.txt.DELETABLE
          Help telnetting to remote host thru CGI.thr.txt.DELETABLE

      This is a safety precaution. The user can then delete the deletable files by calling the delete_deletables() function discussed below.

  • If there are no files in the threads directory which share the first 20 letters in common (or the first n letters if you have passed the optional third argument), then you are warned at standard output:

        Analysis of the first 20 letters of each file in
          [threads directory] 
          shows no candidates for consolidation.  Please hard-code
          names of files you wish to consolidate as arguments to
          &consolidate_threads_single

consolidate_threads_single

    consolidate_threads_single(
        \%config_in, 
        \%config_out, 
        [
            'first_dummy_file_for_consolidation.thr.txt',
            'second_dummy_file_for_consolidation.thr.txt',
        ],
    );

Suppose that the thread files which you wish to consolidate have names whose spelling diverges before the 21st letter. The algorithm which consolidate_threads_multiple() applies would not detect the potential rationale for consolidation. This could happen when someone tries to change the subject of discussion from:

    Best book for extreme Newbie to programming

to:

    De incunabula nostra (Was Best book for extreme Newbie to programming)

Solution: Hard-code the files to be consolidated as elements of an anonymous array. Pass a reference to that anonymous array as the third argument to consolidate_threads_single() as shown above.

As with consolidate_threads_multiple(), the resulting consolidated file will bear the name of the source file containing the very first posting to the discussion thread. The files so consolidated will not automatically be deleted. Rather, they will be renamed with the extension .DELETABLE as a safety precaution and left for you to delete with delete_deletables().

delete_deletables

    delete_deletables(\%config_out);

Mail::Digest::Tools function delete_deletables() tidies up after use of either consolidate_threads_multiple() or consolidate_threads_single(). Unlike all other public functions provided by Mail::Digest::Tools, delete_deletables() needs to be passed a reference to only one of the two configuration hashes, viz., the 'out' configuration hash. The function simply changes to the directory where thread files for a given digest are stored and deletes all files with the extension .DELETABLE.

CONFIGURATION SETUP OVERVIEW

To use a Mail::Digest::Tool function, you need to answer two fundamental questions:

  1. What internal structure has the mailing list sponsor provided for a given digest?

  2. How do I want to structure the results of applying Mail::Digest::Tools to a particular digest on my system?

Each of these two questions breaks down into sub-parts. Their answers supply you with the information with which you will construct the two configuration hashes passed to most Mail::Digest::Tools functions. Let us take each in turn.

%config_in: THE INTERNAL STRUCTURE OF A DIGEST

The best way to learn about the internal structure of a mailing list digest (other than to study the application which created the digest in the first place) is to accumulate several instances of the digest on your system in a directory devoted to that purpose. Examine the way the digest's filename is formed. Then examine the digest file itself. You will soon pick up a feel for the structure of the digest, which will guide you in configuring Mail::Digest::Tools for your system. That configuration will take the form of a Perl hash which, for illustrative purposes, we shall here call %xxx_config_in where xxx is a short-hand title for a particular digest.

For heuristic purposes we will examine the characteristics of two mailing list digests which the author has been following and archiving for several years: ActiveState's 'Perl-Win32-Users' digest and Yahoo! Groups' Perl Beginners group digest.

Analysis of Digest's File Name

We must study a digest's file name in order to be able to write a pattern with which we will be able to distinguish a digest file from any non-digest file sitting in the same directory, as well as to be able to extract the digest number from that file name.

Once saved as plain-text files, Perl-Win32-Users digest files typically look like this in a directory:

    Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
    Perl-Win32-Users Digest, Vol 1 Issue 1772.txt

Similarly, the Perl Beginner digest files look like this:

    [PBML] Digest Number 1491.txt
    [PBML] Digest Number 1492.txt

To correctly identify Perl-Win32-Users digest files from any other files in the same directory, we compose a string which would form the core of a Perl regular expression, i.e., everything in a pattern except the outer delimiters. Internally, Mail::Digest::Tools passes the file name through a grep { /regexp/ } pattern, so the first key is called grep_formula.

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users Digest',
        ...
    );

The equivalent pattern for the Perl Beginners digest would be:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        ...
    );

Note that the [ and ] characters have to be escaped with a \ backslash because they are normally metacharacters inside Perl regular expressions.

We next have to extract the digest number from the digest's file name. Certain mailing list programs give individual digests both a 'Volume' number as well as an individual digest number. Perl-Win32-Users typifies this. In the example above we need to capture both the 1 as volume number and 1771 as digest number. The next key in our configuration hash is called pattern_target:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users Digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        ...
    );

Note the two sets of capturing parentheses.

Other digests, such as those at Yahoo! Groups, dispense with a volume number and simply increment each digest number:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        ...
    );

Note that this pattern_target contains only one pair of capturing parentheses.

Analysis of Digest's Internal Structure

A digest's internal structure is discussed in detail above (see 'A TYPICAL MAILING LIST DIGEST'). Here we need to identify two characteristics: the way the digest introduces its list of today's topics and the string it uses to delimit the list of today's topics from the first individual message in the digest and all subsequent messages from one another. Continuing with our two examples from above, we provide values for keys topics_intro and source_msg_delimiter:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        ...
    );

Note the escaped ' apostrophe character in the value for key topics_intro.

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        ...
    );

Note that the values provided for the respective source_msg_delimiter keys had to be double-quoted strings. That's because all such delimiters include two or more \n newline characters so that they form paragraphs unto themselves. Unless indicated otherwise, the values for all other values in the configuration hash are single-quoted strings.

Note: In early 2004, while Mail::Digest::Tools was being prepared for its initial distribution on CPAN, ActiveState changed certain features in the daily digest versions of its mailing lists. Hence, the code example presented above should not be 'copied-and-pasted' into a configuration hash with which you, the user, might follow the current Perl-Win32-Users digest. In particular, the source message delimiter was changed to a string of 30 hyphens followed by 2 \n newline characters:

    "------------------------------\n\n"

However, since it is not unheard of for contributors to a mailing list to use such a string of hyphens within their postings or signatures, using a string of hyphens is not a particularly apt choice for a source message delimiter. In this particular case, the author is getting better (but not fully tested) results by including an additional newline before the hyphen string in order to more uniquely identify the source message delimiter:

    "\n------------------------------\n\n"

Analysis of Individual Messages

The internal structure of an individual message within a digest is also discussed in detail above. Here we need to identify patterns with which we can extract the content of the message's headers.

Certain mailing list digest programs allow a wide variety of headers to appear in digested messages. The Perl-Win32-Users digest typifies this. Each message in a Perl-Win32_Users digest must have a message number and headers for the message's author, recipients, subject and date.

    Message: 1
    From: Chris Smithson <ChrisSmithson@some.web.address.com>
    To: "'Carter Kraus'" <carter@some.web.address.com>,
           "Perl-Win32-Users (E-mail)" <perl-win32-users@activestate.com>
    Subject: RE: OO Perl Issue.
    Date: Wed, 4 Feb 2004 14:17:24 -0600 

But a message in this digest may have additional headers for the author's organization, reply address and/or carbon-copy recipients.

    Message: 5
    Date: Wed, 4 Feb 2004 15:15:44 -0800
    From: Sam Spade <sspade@some.web.address.com>
    Organization: Some Web Address
    Reply-To: Sam Spade <sspade@some.web.address.com>
    To: "Time" <summers@some.web.address.com>
    CC: "Perl List" <perl-win32-users@listserv.activestate.com>
    Subject: Re: New IE Update causes script problems

Patterns are easily developed to capture this information and store it in the configuration hash:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^From:\s+(.+)$',
        org_style_flag          => '^Organization:\s+(.+)$',
        to_style_flag           => '^To:\s+(.+)$',
        cc_style_flag           => '^CC:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^Date:\s+(.+)$',
        reply_to_style_flag     => '^Reply-To:\s+(.+)$',
        ...
    );

Other mailing list digest programs allow much fewer headers in digested messages. The Yahoo! Groups digests such as Perl Beginner typify this.

    Message: 4
       Date: Sun, 7 Dec 2003 19:24:03 +1100
       From: Philip Streets <phil@some.web.address.com.au>
    Subject: RH9.0, perl 5.8.2 and qmail-localfilter question

The patterns developed to capture this information and store it in the configuration hash would be as follows:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^\s+From:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^\s+Date:\s+(.+)$',
        ...
    );

Note that this pattern is written to expect 1 or more whitespaces at the beginning of the from_style_flag and the date_style_flag.

We could -- but do not need to -- add the following key-value pairs to the %pbml_config_in hash.

        org_style_flag          => undef,
        to_style_flag           => undef,
        cc_style_flag           => undef,
        reply_to_style_flag     => undef,

Inspection of Messages for Multipart MIME Content

Certain mailing lists allow subscribers to post messages in either plain-text or HTML. Certain lists allow subscribers to post attachments; others do not. When it comes to preparing digests of these messages, the programs which different lists take lead to different results. The most annoying situation occurs when a list allows a subscriber to post in 'multipart MIME format' and then fails to strip out the redundant HTML part after printing the needed plain-text part.

Example: An all too typical example from an older version of an ActiveState list digest. (ActiveState changed the format of its digests in early 2004 to strip out HTML attachments. Hence, the following code no longer accurately represents what a subscriber to an ActiveState digest will see. Other mailing lists still suffer from MIME bloat, however, so treat the following code as illustrative.) The message begins:

    Message: 1
    To: Perl-Win32-Users@activestate.com
    Subject: Can not tie STDOUT to scolled Tk widget
    From: John_Wonderman@some.web.address.ca
    Date: Thu, 15 Jan 2004 16:25:17 -0500
    This is a multipart message in MIME format.
    --=_alternative 00750F0485256E1C_=
    Content-Type: text/plain; charset="US-ASCII"
    Hi;
    I am trying to implement a scrolling text widget to capture output for for 
    at tk app. Without scrolling:
    my $text = $mw->Text(-width => 78,
           -height => 32,
           -wrap => 'word',
           -font => ['Courier New','11']
    )->pack(-side => 'bottom',
           -expand => 1,
           -fill => 'both',
    );
    ...

When the plain-text part of the message is finished, it is then repeated in HTML:

    --=_alternative 00750F0485256E1C_=
    Content-Type: text/html; charset="US-ASCII"
    <br><font size=2 face="Tahoma">Hi;</font>
    <p><font size=2 face="Tahoma">I am trying to implement a scrolling text
    widget to capture output for for at tk app. Without scrolling:</font>
    <p><font size=2 face="Bitstream Vera Sans Mono">my $text = $mw-&gt;Text(-width
    =&gt; 78,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -height =&gt; 32,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -wrap =&gt; 'word',</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -font =&gt; ['Courier New','11']</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">)-&gt;pack(-side =&gt;
    'bottom',</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -expand =&gt; 1,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -fill =&gt; 'both',</font>

There is no reason to retain this bloat in your thread file. The digest providers should have stripped it out, but the program they were using failed to do so. Other digests, such as those at Yahoo! Groups, eliminate all this blather.

Now, with Mail::Digest::Tools, you can eliminate much of the bloat yourself. After examining 6-10 instances of a particular mailing list digest, you should be able to determine whether the digest needs a dose of digital castor oil or not, and you set key MIME_cleanup_flag accordingly. If the digest contains unnecessary multipart MIME content, you set this flag to 1; otherwise, to 0.

And with that you have completed your analysis of the internal structure of a given digest and entered the relevant information into the first configuration hash:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^From:\s+(.+)$',
        org_style_flag          => '^Organization:\s+(.+)$',
        to_style_flag           => '^To:\s+(.+)$',
        cc_style_flag           => '^CC:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^Date:\s+(.+)$',
        reply_to_style_flag     => '^Reply-To:\s+(.+)$',
        MIME_cleanup_flag       => 1,
    );

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^\s+From:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^\s+Date:\s+(.+)$',
        MIME_cleanup_flag       => 0,
    );

%config_out: HOW TO PROCESS A DIGEST ON YOUR SYSTEM

%config_in holds the answers to the question: What internal structure has the mailing list sponsor provided for a given digest? In contrast, %config_out will hold the answer to this question: How do I want to structure the results of applying Mail::Digest::Tools to a particular digest on my system?

For purpose of illustration, we will continue to assume that we are processing digest files received from the Perl-Win32-Users and Perl Beginner lists. We will make slightly different choices as to how we process those digest files so as to illustrate different options available from Mail::Digest::Tools.

We shall also assume that we going to place the scripts from which we call Mail::Digest::Tools functions in the directory above the directories in which we store the digest files once they have been saved as plain-text files. If we call this directory digest and place the scripts in that directory, then we will have a directory structure that starts out like this:

    digest/
        process_new.pl
        process_ALL.pl
        reply_digest_message.pl
        repair_digest_order.pl
        consolidate_threads.pl
        deletables.pl
        pw32u/
            Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
            Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
        pbml/
            [PBML] Digest Number 1491.txt
            [PBML] Digest Number 1492.txt

Required %config_out Keys

There are 9 keys which are required in %config_out in order for Mail::Digest::Tools to function properly. They correspond to 9 decisions which you must make in setting up a Mail::Digest::Tools configuration on your system.

1 Title

Each digest must be given a title which is used whenever Mail::Digest::Tools needs to prompt or warn you on standard output. The key which holds this information in %config_out must be called title; the value for this element should be sensible.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        ...
    );
2 Digest Directory

For each digest a directory must be designated where individual digest files are stored in plain-text format. The key which holds this information in %config_out must be called dir_digest. In the examples below directories are named relative to the 'current' directory (..), i.e., the directory where the script invoking a Mail::Digest::Function is stored.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        ...
    );
3 Threads Directory

For each digest a directory must be designated where the thread files created by use of Mail::Digest::Tools functions are stored. The key which holds this information in %config_out must be called dir_threads. In the examples below the threads directory is a subdirectory of the digest directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        ...
    );
4 Digests Log File

For each digest a file must be kept which logs whether a given digest file has already been processed or not and, if so, when. The key which holds this information in %config_out must be called digests_log. It has been found convenient to keep this file in the digests directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        ...
    );
5 Today's Topics

For each digest a file must be kept which holds an ongoing record of the list of topics found in each individual digest file. The key which holds this information in %config_out must be called <todays_topics>. It has been found convenient to keep this file in the digests directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        ...
    );
6 Format for Identifying Digest Number in Output

For each digest you must choose how to format the number(s) of the individual digest file being processed when messages from that file are written to a threads file. What you are doing here is formatting the information captured by the pattern_target key in a given digest's %config_in (see above). You express this choice as a single-quoted string which formats the data captured by Perl regular expression which in pattern_target. This formatting is done via the Perl sprintf function. The resulting string is assigned to be the value of %config_out key <id_format>.

We saw above that digests from the Perl-Win32-Users list carried both a volume number and an individual digest number.

    Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
    Perl-Win32-Users Digest, Vol 1 Issue 1772.txt

Both numbers were captured by the Perl regular expression in %pw32u_config_in key <pattern_target>.

    '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',

Here we have chosen to format the volume number as a 3-digit, 0-padded number and the individual digest number as a 4-digit, 0-padded number. We then join these two data with an underscore.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . \'_\' . sprintf("%04d",$2)',
        ...
    );

We saw above that digests from the Perl Beginners list carried only an digest number -- no volume number.

    [PBML] Digest Number 1491.txt
    [PBML] Digest Number 1492.txt

This number was captured by the Perl regular expression in %pbml_config_in key <pattern_target>.

    '.*\s(\d+)\.txt$'

Here we have chosen to format the digest number as a 5-digit, 0-padded number.

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        ...
    );

Note that if you allow for a 4-digit number, the highest numbered digest you can process off a given mailing list will be 9999. If you allow for a 5-digit number, the upper limit will be 99999. The latter should be sufficient for a lifetime even for a mailing list (e.g., London.pm) which generates 3 or 4 digest files per day or over 1000 per year.

7 Format for Numbering Individual Messages in Output

For each digest you must choose how to format the number which the digest assigns to its individual messages. Experience suggests that 2 digits should be more than sufficient to format this number, as all digests which the author has observed have fewer than 100 entries. However, below we have arbitrarily decided to allow for up to 9999 entries in a given digest. As with the digest number, the formatting is accomplished via the Perl sprintf function. The result is stored in a %config_out key which must be called output_id_format.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . 
                                           \'_\' . sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        ...
    );
8 Thread Message Delimiter

For each digest you must compose a string which will separate one message in a threads file from its successor. This string must be double-quoted and assigned to %config_out key thread_msg_delimiter. For readability, this string should terminate in two or more \n\n newline characters so that the delimiter is always a paragraph unto itself.

This delimiter may -- or may not -- be the same string which the mailing list provider uses to separate messages in the digest files themselves. In other words, you may choose to use the same string for thread_msg_delimiter in %config_out as you reported the list provider used in %config_in key source_msg_delimiter.

In the example below we make the thread_msg_delimiter for the output from Perl-Win32-Users to be the same as its source_msg_delimiter.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . 
                                           \'_\' . sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "--__--__--\n\n",
        ...
    );

Note: In light of the earlier discussion of the changes ActiveState made to its mailing list digests in early 2004, the reader is cautioned that the code above should not be directly 'copied-and-pasted' into a configuration hash with which you might follow an ActiveState mailing list. Treat it as educational. In particular, the author is now testing the following as a setting for $pw32u_config_out{'thread_msg_delimiter'}:

    "\n--__--__--\n\n",

For threads generated by appling Mail::Digest::Tools to the Perl Beginners list, we choose an output message delimiter which differs from the source message delimiter.

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
        ...
    );

Whatever choice you make for the thread_msg_delimiter it should be a string unlikely to occur within the text of a message and should terminate in two or more newlines.

9 Archive or Delete Threads?

For each digest you process with Mail::Digest::Tools, you must decide whether to retain the resulting thread files in an archive them in a separate directory after a specified period of time, to delete them from disk after a specified period of time, or to do neither and allow them to accumulate indefinitely in the threads directory. Your decision is represented as the value of %config_out key <archive_kill_trigger>. This value must be expressed as one of three numerical values:

     0    Thread files are neither archived nor deleted

     1    Thread files are archived in a separate directory (or directories) 
          after the number of days specified by key 'archive_kill_days' 
          (see below)

    -1    Thread files are deleted after I<n> days as specified by key 
          'archive_kill_days' 

In the examples below we have chosen to archive all threads generated by the Perl-Win32-Users list but to kill all threads generated by the Perl Beginner list after a number of days whose specification we shall come to shortly.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                           sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "--__--__--\n\n",
        archive_kill_trigger       => 1,
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
        archive_kill_trigger       => -1,
        ...
    );

This completes the 9 required keys for %config_out. We now turn to keys which are either optional or which are required if you have assigned a value of 1 or -1 to key archive_kill_trigger.

Optional %config_out Keys

  • Digests Read File

    As an option, Mail::Digest::Tools offers file to log which instances of a particular digest have previously been processed which is more human-readable than the file named in %config_out key digests_log. That file logs a digest as follows:

        001_9999;Fri Feb  6 18:57:41 2004;Fri Feb  6 18:57:41 2004

    It is probably easier to read this data like this:

        09999:
            first processed at            Fri Feb  6 18:57:41 2004
            most recently processed at    Fri Feb  6 18:57:41 2004

    To choose this option you need to set two keys in %config_out:

    1 digests_read_flag

    This must be assigned a true value such as 1. This tells Mail::Digest::Tools that you indeed want a 'digests read' file.

    2 digests_read

    This should be assigned the name of the 'digests read' file, but it will default to a file digests_read.txt placed in the directory named by key dir_digest.

    Adding these keys to our %config_out, we get:

        %pw32u_config_out = (
            title                      => 'Perl-Win32-Users',
            dir_digest                 => "../pw32u",
            dir_threads                => "../pw32u/Threads",
            digests_log                => "../pw32u/digests_log.txt",
            todays_topics              => "../pw32u/todays_topics.txt",
            id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                               sprintf("%04d",$2)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "--__--__--\n\n",
            archive_kill_trigger       => 1,
            digests_read_flag          => 1,
            digests_read               => "../pw32u/digests_read.txt",
            ...
        );
    
        %pbml_config_out = (
            title                      => 'Perl Beginner',
            dir_digest                 => "../pbml",
            dir_threads                => "../pbml/Threads",
            digests_log                => "../pbml/digests_log.txt",
            todays_topics              => "../pbml/todays_topics.txt",
            id_format                  => 'sprintf("%05d",$1)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
            archive_kill_trigger       => -1,
            digests_read_flag          => 1,
            digests_read               => "../pbml/digests_read.txt",
            ...
        );
  • Keys Needed When Archiving Thread Files

    If, as discussed above, you have assigned the value 1 to the <archive_kill_trigger key in %config_out, then Mail::Digest::Tools will archive older thread files, i.e., it will move thread files from the directory specified in key dir_threads to an archive directory if the thread file has not been modified in a specified number of days. If new messages need to be posted to a thread file which has been archived, that file will be de-archived and brought back to the dir_threads directory. Thread files which are either archived or de-archived via a call to process_new_digests() or reprocess_ALL_digests() will be logged in appropriately named files.

    Hence, the keys you will need to define when archiving thread files are:

    1 archive_kill_days

    This key must be assigned the number of days after which a thread file sitting in the dir_threads directory is moved to an archive directory. If not specified, will default to 14 days.

    2 dir_archive_top

    This key must be assigned the name of the top archive directory, i.e., the directory at the top of a tree of archive directories.

    When you track a particular mailing list digest for a number of years, the number of different thread files can grow to enormous proportions. For example, the author has tracked over 10,000 distinct thread files from the Perl-Win32-Users list over a three-and-a-half year period. 10,000 files in a single directory is completely unwieldy and slows directory read-times tremendously. Mail::Digest::Tools therefore by default provides a tree of archive directories: a top directory which contains no thread files but instead holds 27 subdirectories , one for each letter of the English alphabet and one for thread files which start with any other character (guaranteed to work with ASCII only; not tested with other character sets).

        dir_archive_top
            a
            b
            c
            ...
            z
            other

    The user gets to choose where to place the top archive directory but the 27 subdirectories are automatically placed beneath that one. The top archive directory is the value assigned to %config_out key dir_archive_top.

    3 archived_today

    This key should be assigned the name of a file which will log any and all files archived by a single call to process_new_digests() or reprocess_ALL_digests(). (By 'single' call is meant that this is not an ongoing log; it only shows what happened today.) If not assigned a value, it will default to a file called archived_today.txt located in the directory named by key dir_digest.

    4 de_archived_today

    This key should be assigned the name of a file which will log any and all files de-archived by a single call to process_new_digests() or reprocess_ALL_digests(). (By 'single' call is meant that this is not an ongoing log; it only shows what happened today.) If not assigned a value, it will default to a file called de_archived_today.txt located in the directory named by key dir_digest.

    5 archive_config

    This key is reserved for future use. In the current version of Mail::Digest::Tools it does not need to be set, but, should you be obsessive about this, set it to 0.

    Adding these keys to our sample %config_out hashes, we get:

        %pw32u_config_out = (
            title                      => 'Perl-Win32-Users',
            dir_digest                 => "../pw32u",
            dir_threads                => "../pw32u/Threads",
            digests_log                => "../pw32u/digests_log.txt",
            todays_topics              => "../pw32u/todays_topics.txt",
            id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                               sprintf("%04d",$2)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "--__--__--\n\n",
            archive_kill_trigger       => 1,
            digests_read_flag          => 1,
            digests_read               => "../pw32u/digests_read.txt",
            archive_kill_days          => 14,
            dir_archive_top            => "../pw32u/Threads/archive",
            archived_today             => "../pw32u/archived_today.txt",
            de_archived_today          => "../pw32u/de_archived_today.txt",
            ...
        );
    
        %pbml_config_out = (
            title                      => 'Perl Beginner',
            dir_digest                 => "../pbml",
            dir_threads                => "../pbml/Threads",
            digests_log                => "../pbml/digests_log.txt",
            todays_topics              => "../pbml/todays_topics.txt",
            id_format                  => 'sprintf("%05d",$1)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
            archive_kill_trigger       => -1,
            digests_read_flag          => 1,
            digests_read               => "../pbml/digests_read.txt",
            ...
        );

    Note that since in our example we chose not to archive thread files from the Perl Beginner list -- as evinced by the assignment of -1 to key archive_kill_trigger -- we do not need to assign any values to dir_archive_top, archived_today or de_archived_today in %pbml_config_out.

  • Keys Needed When Deleting Thread Files

    The keys needed for %config_out when you have chosen to delete thread files after a specified interval parallel those you would have needed if you had chosen to archive those files instead.

    1 archive_kill_days

    This key must be assigned the number of days after which a thread file sitting in the dir_threads directory is deleted. If not specified, will default to 14 days.

    2 deleted_today

    This key should be assigned the name of a file which will log any and all files deleted by a single call to process_new_digests() or reprocess_ALL_digests(). (By 'single' call is meant that this is not an ongoing log; it only shows what happened today.) If not assigned a value, it will default to a file called deleted_today.txt located in the directory named by key dir_digest.

    Adding these keys to our sample %config_out hashes, we get:

        %pw32u_config_out = (
            title                      => 'Perl-Win32-Users',
            dir_digest                 => "../pw32u",
            dir_threads                => "../pw32u/Threads",
            digests_log                => "../pw32u/digests_log.txt",
            todays_topics              => "../pw32u/todays_topics.txt",
            id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                               sprintf("%04d",$2)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "--__--__--\n\n",
            archive_kill_trigger       => 1,
            digests_read_flag          => 1,
            digests_read               => "../pw32u/digests_read.txt",
            archive_kill_days          => 14,
            dir_archive_top            => "../pw32u/Threads/archive",
            archived_today             => "../pw32u/archived_today.txt",
            de_archived_today          => "../pw32u/de_archived_today.txt",
            ...
        );
    
        %pbml_config_out = (
            title                      => 'Perl Beginner',
            dir_digest                 => "../pbml",
            dir_threads                => "../pbml/Threads",
            digests_log                => "../pbml/digests_log.txt",
            todays_topics              => "../pbml/todays_topics.txt",
            id_format                  => 'sprintf("%05d",$1)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
            archive_kill_trigger       => -1,
            digests_read_flag          => 1,
            digests_read               => "../pbml/digests_read.txt",
            archive_kill_days          => 14,
            deleted_today              => "../pbml/deleted_today.txt",
            ...
        );

    Note that since in our example we chose to archive thread files from the Perl-Win32-Users list -- as evinced by the assignment of 1 to key archive_kill_trigger -- we do not need to assign any values to deleted_today in %pw32u_config_out.

  • Keys Needed When Stripping Multipart MIME Content from Thread Files

    Recall from above that you had to study a given digest to determine whether or not it contained multipart MIME content in need of stripping out. If a digest, such as the ActiveState Perl-Win32-Users digest, contained a lot of such bloat, you set key MIME_cleanup_flag in %config_in to a value of 1. If, on the other hand, the mailing list provider stripped out the multipart MIME content before distributing the digest, you set that key to a value of 0.

    Mail::Digest::Tools will automatically strip out multipart MIME content once you have set MIME_cleanup_flag to 1. All that is left for you to decide is: Do I want to view a log of which messages processed in a single call of process_new_digests() or reprocess_ALL_digests() had multipart MIME content stripped out -- or not? If so, you must set two keys in %config_out:

    1 MIME_cleanup_log_flag

    This key must be set to a true value such as 1.

    2 mimelog

    This key should be assigned the name of the 'mimelog' file, but if you do not specify a value it will default to a file mimelog.txt placed in the directory named by key dir_digest.

    The logfile so created looks like this:

        Processed                     Problem
    
        001_1775_0003 CASE C
        001_1775_0015 CASE C
        001_1775_0018 CASE C
        001_1775_0021 CASE E

    where items in the 'Processed' column were either (a) successfully stripped of multipart MIME content by Mail::Digest::Tools as specified by the internal rule denoted by the 'CASE'; or (b) were recognized by Mail::Digest::Tools as containing multipart MIME content that could not be stripped out.

    This is relatively esoteric and probably of interest mainly to the module's developer. So if you are not interested in this feature set MIME_cleanup_log_flag to 0 and no mimelog will be created -- but Mail::Digest::Tools will still do its best to strip out extraneous multipart MIME content.

    Our sample %config_out hashes are now complete. They look like this:

        %pw32u_config_out = (
            title                      => 'Perl-Win32-Users',
            dir_digest                 => "../pw32u",
            dir_threads                => "../pw32u/Threads",
            digests_log                => "../pw32u/digests_log.txt",
            todays_topics              => "../pw32u/todays_topics.txt",
            id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                               sprintf("%04d",$2)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "--__--__--\n\n",
            archive_kill_trigger       => 1,
            digests_read_flag          => 1,
            digests_read               => "../pw32u/digests_read.txt",
            archive_kill_days          => 14,
            dir_archive_top            => "../pw32u/Threads/archive",
            archived_today             => "../pw32u/archived_today.txt",
            de_archived_today          => "../pw32u/de_archived_today.txt",
            mimelog                    => "../pw32u/mimelog.txt",
            MIME_cleanup_log_flag      => 1,
        );
    
        %pbml_config_out = (
            title                      => 'Perl Beginner',
            dir_digest                 => "../pbml",
            dir_threads                => "../pbml/Threads",
            digests_log                => "../pbml/digests_log.txt",
            todays_topics              => "../pbml/todays_topics.txt",
            id_format                  => 'sprintf("%05d",$1)',
            output_id_format           => 'sprintf("%04d",$1)',
            thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
            archive_kill_trigger       => -1,
            digests_read_flag          => 1,
            digests_read               => "../pbml/digests_read.txt",
            archive_kill_days          => 14,
            deleted_today              => "../pbml/deleted_today.txt",
        );

    Note that %pbml_config_out does not have MIME_cleanup_log_flag or mimelog keys. It doesn't need them, because in providing the Perl Beginners mailing list Yahoo! Groups strips out unnecessary multipart MIME content before sending the digest to you.

HELPFUL HINTS

... in which the module author shares what he has learned using Mail::Digest::Tools and its predecessors since August 2000.

Initial Configuration and Testing

As mentioned above, if you are considering creating a local archive of threads originating in daily digest versions of a mailing list, you should first accumulate 6-10 instances of such digests and both:

  1. study the internal structure of the digest -- needed to develop a %config_in for the digest; and

  2. carefully consider how you wish to structure the output from the module's use on your system -- needed to develop %config_out for the digest

Once you have developed the initial configuration, you should call reprocess_ALL_digests() on the digests, then open the files created to see if the results are what you want. If they are not what you want, then you need to think about what you should change in %config_in and/or %config_out. Make those changes, then call reprocess_ALL_digests() again. Repeat as needed, making sure not to delete any of the digest files you are using as sources until you are completely satisfied with your configuration.

Once, however, you are satisfied with your configuration, you should call process_new_digests() on new instances of digests and never call reprocess_ALL_digests() for that digest again (lest you not be able to regenerate threads containing messages from digests you have deleted over time).

Where to Store the Configuration Hashes

As mentioned above, you will probably find it convenient to write separate Perl scripts to call each one of Mail::Digest::Tool's public functions. You could code %config_in and %config_out in each of those scripts just before the respective function calls. But that would violate the principle of 'Repeated Code Is a Mistake' and multiply maintenance problems. It's far better to code the two configuration hashes in a separate plain-text file and 'require' that file into your script. That way, any changes you make in the configuration will be automatically picked up by each script that calls a Mail::Digest::Tools function.

Here is an example of such a file holding the configuration hashes governing use of the Perl-Win32-Users digest, along with a script making use of that file.

    # file:  pw32u.digest.data
    $topdir = "E:/Digest/pw32u";
    %config_in =  (
         grep_formula           => 'Perl-Win32-Users digest',
         pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
         # next element's value must be double-quoted
         source_msg_delimiter   => "--__--__--\n\n",
         topics_intro           => 'Today\'s Topics:',
         message_style_flag     => '^Message:\s+(\d+)$',
         from_style_flag        => '^From:\s+(.+)$',
         org_style_flag         => '^Organization:\s+(.+)$',
         to_style_flag          => '^To:\s+(.+)$',
         cc_style_flag          => '^CC:\s+(.+)$',
         subject_style_flag     => '^Subject:\s+(.+)$',
         date_style_flag        => '^Date:\s+(.+)$',
         reply_to_style_flag    => '^Reply-To:\s+(.+)$',
         MIME_cleanup_flag      => 1,
    );

    %config_out =  (
         title                  => 'Perl-Win32-Users',
         dir_digest             => $topdir,
         dir_threads            => "$topdir/Threads",
         dir_archive_top        => "$topdir/Threads/archive",
         archived_today         => "$topdir/archived_today.txt",
         de_archived_today      => "$topdir/de_archived_today.txt",
         deleted_today          => "$topdir/deleted_today.txt",
         digests_log            => "$topdir/digests_log.txt",
         digests_read           => "$topdir/digests_read.txt",
         todays_topics          => "$topdir/todays_topics.txt",
         mimelog                => "$topdir/mimelog.txt",
         id_format              => 'sprintf("%03d",$1) . \'_\' . 
                                        sprintf("%04d",$2)',
         output_id_format       => 'sprintf("%04d",$1)',
         MIME_cleanup_log_flag  => 1,
         # next element's value must be double-quoted
         thread_msg_delimiter   => "--__--__--\n\n",
         archive_kill_trigger   => 1,
         archive_kill_days      => 14,
         digests_read_flag      => 1,
         archive_config         => 0,
    );

    # script:  dig.pl
    # USAGE:  perl dig.pl
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Mail::Digest::Tools qw( process_new_digests );

    our (%config_in, %config_out);
    my $data_file = 'pw32u.digest.data';
    require $data_file;

    process_new_digests(\%config_in, \%config_out);

    print "\nFinished\n";

Maintaining Local Archives of More than One Digest

The module author has maintained local archives of more than a half dozen different mailing list digests over the past several years. He has found it convenient to maintain the configuration information for all the digests he is following at a given time in a single configuration file. The advantage to this approach is that if two digests share a similar internal structure (perhaps due to being generated by the same mailing list program or list provider) and if the user chooses to structure the output from the two digests in similar or identical ways, then getting the configuration hashes becomes much easier and the potential for error is reduced.

Here is a sample directory and file structure for maintaining archives of two different digests on a Win32 system:

    digest/
    digest.data
    process_new.pl
    process_ALL.pl
    reply_digest_message.pl
    repair_digest_order.pl
    consolidate_threads.pl
    deletables.pl
    pw32u/
        Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
        Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
        digest_log.txt
        digest_read.txt
        mimelog.txt
        Threads/
    pbml/
        [PBML] Digest Number 1491.txt
        [PBML] Digest Number 1492.txt
        digest_log.txt
        Threads/

File digest.data would look like this:

    # digest.data
    $topdir = "E:/Digest";
    %digest_structure = (
        pbml =>    {
             grep_formula   => '\[PBML\]',
             pattern_target => '.*\s(\d+)\.txt$',
             ...
           },
        pw32u =>   {
             grep_formula   => 'Perl-Win32-Users digest',
             pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
             ...
           },
    );
    %digest_output_format = (
        pbml =>    {
             title          => 'Perl Beginner',
             dir_digest     => "$topdir/pbml",
             dir_threads    => "$topdir/pbml/Threads",
             ...
           },
        pw32u =>   {
             title          => 'Perl-Win32-Users',
             dir_digest     => "$topdir/pw32u",
             dir_threads    => "$topdir/pw32u/Threads",
             ...
           },
    );

To accomodate this slightly more complex structure in the configuration file, the calling script might be modified as follows:

    # script:  dig.pl
    # USAGE:  perl dig.pl [short-name for digest]
    #!/usr/bin/perl
    use Mail::Digest::Tools qw( process_new_digests );

    my ($this_key, %config_in, %config_out);
    # variables imported from $data_file
    our (%digest_structure, %digest_output_format);    

    my $data_file = 'digest.data';
    require $data_file;

    $this_key = shift @ARGV;
    die "\n     The command-line argument you typed:  $this_key\n     does not call an accessible digest$!" 
        unless (defined $digest_structure{$this_key}
            and defined $digest_output_format{$this_key});

    my ($k,$v);
    while ( ($k, $v) = each %{$digest_structure{$this_key}} ) {
        $config_in{$k} = $v;
    }
    while ( ($k, $v) = each %{$digest_output_format{$this_key}} ) {
        $config_out{$k} = $v;
    }

    process_new_digests(\%config_in, \%config_out);

    print "\nFinished\n";

Getting Your Mail to the Right Place on Your System

For several years the module author used the scripts which were predecessors to Mail::Digest::Tools on a Win32 system where mail was read with Microsoft Outlook Express. He would do a "File/Save as.." on an instance of a digest, select text format (*.txt) and save it to an appropriate directory. Later, the author used the shareware e-mail client Poco, in which the same operation was accomplished by highlighting a file and keying "Ctrl+S".

But as the number of digests the author was tracking grew, this procedure became more and more tedious. Fortunately, about that time the author was assigned to write a review of the second edition of the Perl Cookbook, and he learned how to use the Net::POP3 module to receive his e-mail directly. So now he uses a Perl script to get all his digests and save them as text files to appropriate directories -- and then lets a GUI e-mail client take care of the rest.

Here is a script which more or less accomplishes this:

    # script:  get_digests.pl
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Net::POP3;
    use Term::ReadKey;

    my ($site, $username, $password);
    my ($verref, $pop3, $messagesref, $undeleted, $msgnum, $message);
    my ($k,$v);
    my ($oldfh, $output);

    my %digests = (
        'pbml'   => "E:/Digest/pbml",
        'pw32u'  => "E:/Digest/pw32u",
        'london' => "E:/Digest/london",
    );

    $site = 'pop3.someISP.com';
    $username = 'myuserid';

    $pop3 = Net::POP3->new($site)
            or die "Couldn't open connection to $site: $!";

    print "Enter password for $username at $site:  ";
    ReadMode('noecho');
    $password = ReadLine(0);
    chomp $password;
    ReadMode(0);
    print "\n";

    defined ($pop3->login($username, $password))
        or die "Can't authenticate: $!";

    $messagesref = $pop3->list 
        or die "Can't get list of undeleted messages: $!";

    while ( ($k,$v) = each %$messagesref ) {
        my ($messageref, $line, %headers);
        print "$k:\t$v\n";
        $messageref = $pop3->top($k);
        local $_;
        foreach (@$messageref) {
            chomp;
            last if (/^\s*$/);
            next unless (/^\s*(Date:|From:|Subject:|To:)/);
            if (/^\s*Date:\s*(.*)/) {
                $headers{'Date'} = $1;
            }
            if (/^\s*From:\s*(.*)/) {
                $headers{'From'} = $1;
            }
            if (/^\s*Subject:\s*(.*)/) {
                $headers{'Subject'} = $1;
            }
            if (/^\s*To:\s*(.*)/) {
                $headers{'To'} = $1;
            }
        }
        if ($headers{'Subject'} =~ /^\[PBML\]/) {
            get_digest($pop3, $k, 'pbml', $headers{'Subject'});
        }
        if ($headers{'Subject'} =~ /^Perl-Win32-Users/) {
            get_digest($pop3, $k, 'pw32u', $headers{'Subject'});
        }
        if ($headers{'Subject'} =~ /^london\.pm/) {
            get_digest($pop3, $k, 'london', $headers{'Subject'});
        }
    }

    $pop3->quit() or die "Couldn't quit cleanly: $!";

    print "Finished!\n";

    sub get_digest {
        my ($pop3, $msgnum, $digest, $subj) = @_;
        print "Retrieving $msgnum: $subj";
        my $message = 
            $pop3->get($msgnum) or die "Couldn't get message $msgnum: $!";
        if ($message) {
            print "\n";
            my $digestfile = "$digests{$digest}/$subj.txt";
            _print_message($digestfile, $message);
            print "Marking $msgnum for deletion\n";;
            $pop3->delete($msgnum) or die "Couldn't delete message $msgnum: $!";
        } else {
            print "Failed:  $!\n";
        }
    }

    sub _print_message {
        my ($digestfile, $message) = @_;
        my @lines = @{$message};
        my $counter = 0;
        open(FH, ">$digestfile") 
            or die "Couldn't open $digestfile for writing: $!";
        for (my $i = 0; $i<=$#lines; $i++) {
            chomp($lines[$i]);
            # Identify the first blank line in the digest,
            # i.e., the end of the headers
            if ($lines[$i] =~ /^$/) {
                $counter = $i;
                last;
            }
        };
        # Transfer digest to appropriate directory, skipping over digest header
        # so as to start just above Today's Topics
        foreach my $line (@lines[$counter+1 .. $#lines]) {
            chomp($line);
            # For some reason the $pop3->get() puts a single whitespace at the 
            # start of most (all but the first?) lines
            # That has to be cleaned up so digest.pl can correctly process 
            # header info and identify beginning of Today's Topics
            if ($line =~ /^\s(.*)/) {
                print FH $1, "\n";
            } else {
                print FH $line, "\n";
            }
        }
        close FH or die "Couldn't close after writing: $!";
    }

No promise is made that this script or any script contained in this documentation will work correctly on your system. Hack it up to get it to work the way you want it to.

ASSUMPTIONS AND QUALIFICATIONS

1 No Change in Mailing List Digest Software

The main assumption on which Mail::Digest::Tools depends for its success is that the provider of a particular digest continues to use the same mailing list software to produce the digest. If the provider changes his/her software, you must modify Mail::Digest::Tools' configuration data accordingly.

2 Digest Must Be One E-mail Without Attachments

At its current stage of development Mail::Digest::Tools is only applicable to mailing list digests which arrive as one continuous file. It is not applicable to digests (e.g., Cygwin, module-authors@perl.org) which are supplied in a format consisting of (a) one file with instructions and a table of contents and (b) all the individual messages provided as e-mail attachments.

3 Perl 5.6+ Only

The program was created with Perl 5.6. Certain features, such as the use of the our modifier, were not available prior to 5.6. Modifications to account for pre-5.6 features are left as an exercise for the user.

4 Time::Local

Mail::Digest::Tools internally uses Perl core extension Time::Local. If at some future point this module is not included as part of a Perl core distribution, you would have to install it manually from CPAN.

HISTORY AND FUTURE DEVELOPMENT

PRE-CPAN HISTORY

ActiveState maintains Perl for Windows-based platforms and also maintains a variety of mailing lists for users of its Windows-compatible versions of Perl. Subscribers to these lists can receive messages either as individual e-mails or as part of a daily digest which contains a listing of the day's topics and the complete text of each message. The messages are often best followed as discussion 'threads' which may extend over several days' worth of digests.

In June of 2000, however, ActiveState had to temporarily take its mailing lists off-line for technical reasons. When these lists were restored to service, their archive capacities were not immediately restored. I had just begun my study of Perl and had come to enjoy reading the Perl-Win32-Users digest. As I set off for the Yet Another Perl Conference in Pittsburgh, I shouted out, 'I want my Perl-Win32-Users digest!' I wrote a Perl script called digest.pl to fill that gap.

ActiveState has since restored archiving capacity to their lists. For reasons that would perhaps best be explored in a psychotherapeutic context, however, I had become attached to my local archive of the 'pw32u' list, so I continued to maintain this program and fine-tune its coding.

In early 2001 it became apparent that this program could be applied to a wide variety of mailing list digests -- not just those provided by ActiveState. In particular, valuable digests provided by Yahoo Groups (formerly E-groups) such as NT Emacs Users, Perl 5 Porters and Perl Beginners could also be archived if digest.pl were modified appropriately. I made those modifications and began to track several other digests. I was able to use the archive I had developed as a window into one part of the Perl community in a Lightning Talk I gave at YAPC::North America in Montreal in June 2001, ''An Index of Incivility in the Perl Community.''

Maintaining digest.pl was, to a considerable extent, the way I taught myself Perl. Along the way I incorporated my first profiler into the script -- and then discarded it. Some of the subroutines I had written for early versions of the program had applicability to other scripts -- and thus was born my first module -- also since discarded. By July 2003 I was up to version 1.3. Following a suggestion by Uri Guttman at the YAPC::EU conference held in Paris in July 2003, wherever possible the use of separate print statements for each line to be printed was eliminated in favor of concatenating strings to be printed into much larger strings which could be printed all at once. This revision reduced the number of times filehandles had to be opened for writing. A given thread file was now opened only once per call of this program, rather than once for each message in each digest processed per call of the program.

Various other improvements, such as the possibility of stripping out unnecessary multipart MIME content and the introduction of subdirectories for archiving, were made in late 2003. At that point I decided to transform the script into a full-fledged Perl module. At first I tried out an object-oriented structure (with which I was familiar from my first two CPAN modules, List::Compare and Data::Presenter). That OO structure necessitated one constructor and one method call per typical script, but since the constructor did nothing but some cursory validation of the configuration data, it was mostly superfluous. Hence, I jettisoned the OO structure in favor of a functional approach. The result: Mail::Digest::Tools.

CPAN

After these revisions, I was up to version 1.96. Why revert to a lower version number at this point? That is why Mail::Digest::Tools makes its CPAN debut in version 2.04.

v1.97 (2/18/2004): Dealing with problem that Win32 and Unix/Linux may create different thread names for the same set of source messages because they have different lists of characters forbidden in file names. This became a problem while writing tests for process_new_digests() because it made predicting the names of thread files created via that function more difficult to predict. Tests adjusted appropriately.

v1.98 (2/19/2004): Eliminated suspect uses of /o modifier on regexes. This was causing problems when I called process_new_digests() on two different types of digests in the same script. Also, eliminated code referring to DOS (e.g., code eliminating characters unacceptable in DOS filenames) as I have no way to test this module on a DOS box.

v1.99 (2/22/2004): ActiveState introduced a new format for its Perl-Win32-Users digest -- the digest which originally inspired the creation of this module's predecessor in 2000. One aspect of this new format was a clear improvement: HTML attachments are now stripped before messages are posted to the digest, so multipart MIME content has either been reduced considerably or eliminated altogether. But another aspect of this new format upset code going back four years: The delimiter immediately following Today's Topics is now different from the delimiters separating each message in the digest. Working around this appeared to be surprisingly difficult, especially since this revision had to be done in the middle of writing a test suite for CPAN distribution. A new key has been added to the %config_in hash for each digest:

    $config_in{'post_topics_delimiter'}

v2.00 (2/23/2004): Testing conducted after the last revision revealed a bug going back several versions in the internal subroutine stripping multipart MIME content. The last paragraph of each message which did not have MIME content was being stripped off. The offending code was found within _analyze_message_body(). (The author recently learned of the CPAN module Email::StripMime. This looks promising as a replacement for the hand-rolled subroutine used within Mail::Digest::Tools, but a full study of its possibilities will be deferred to a later version. Also in this version, POD was rewritten to reflect the introduction of the post-topics delimiter.

v2.01 (2/24/2004): Backslashes (except as part of \n newline characters) are prohibited in %config_out key thread_msg_delimiter. This is because in the test suite that key's value is used as a variable inside a regular expression which in turn is used as an argument to split(). Preliminary investigation suggests that to work around the backslash metacharacter in that situation would be very time-consuming.

v2.02 (2/26/2004): Revised reply_to_digest_message() internal subroutine _strip_down_for_reply to reflect distinction between post-topics delimiter and source message delimiter.

v2.03 (3/04/2004): Fixed bug in readdir call in repair_message_order(). Extensive reworking of test suite.

v2.04 (3/05/2004): No changes in module. Refinement of test suite only.

v2.05 (3/07/2004): Fixed accidental deletion of incrementation of $message_count in _strip_down().

v2.06 (3/10/2004): Correction of errors in test suite. Elimination of use of List::Compare in test suite.

v2.07 (3/11/2004): Correction of error in t/03.t

v2.08 (3/11/2004): Correction in _clean_up_thread_title and in tests.

v2.10 (3/15/2004): Corrections to README and documentation only.

v2.11 (10/23/2004): Fixed several errors which resulted in "Bizarre copy of hash in leave" error when running test suite under Devel::Cover.

v2.12 (05/14/2011): Added 'mirbsd' to list of Unixish-OSes.

AUTHOR

James E. Keenan (jkeenan@cpan.org).

Creation date: August 21, 2000. Last modification date: May 14, 2011. Copyright (c) 2000-2011 James E. Keenan. United States. All rights reserved.

This software is distributed with absolutely no warranty, express or implied. Use it at your own risk. This is free software which you may distribute under the same terms as Perl itself.