Text::LooseCSV - Highly forgiving variable length record text parser; compare to MS Excel
use Text::LooseCSV; use IO::File; $fh = new IO::File $fname; $f = new Text::LooseCSV($fh); # Some optional settings $f->word_delimiter("\t"); $f->line_delimiter("\n"); $f->no_quotes(1); # Parse/split a line while ($rec = $f->next_record()) { if ($rec == -1) { warn("corrupt rec: ", $f->cur_line); next; } # process $rec as arrayref ... } # Or, (vice-versa) create a variable-length record file $line = $f->form_record( [ 'Debbie Does Dallas','30.00','VHS','Classic' ] );
Why another variable-length text record parser? I've had the privilege to parse some of the gnarliest data ever seen and everything else I tried on CPAN choked (at the time I wrote this module). This module has been munching on millions of records of the filthiest data imaginable at several production sites so I thought I'd contribute.
This module follows somewhat loose rules (compare to MS Excel) and will handle embedded newlines, etc. It is capable of handling large files and processes data in line-chunks. If MAX_LINEBUF is reached, however, it will mark the current record as corrupt, return -1 and start over again at the very next line. This will (of course) process tab-delimited data or whatever value you set for word_delimiter.
word_delimiter
Methods are called in perl OO fashion.
WARNING this module messes with $/ line_delimiter sets $/ and is always called during construction. Don't change $/ during program execution!
line_delimiter
new (constructor)
$f = new Text::LooseCSV($fh);
Create a new Text::LooseCSV object for all your variable-length record needs with an optional file handle, $fh (e.g. IO::File). Set properties using the accessor methods as needed.
If $fh is not given, you can use input_file() or input_text().
Returns a blessed Text::LooseCSV object.
$current_value = $f->line_delimiter("\n");
Get/set LINE_DELIMITER. LINE_DELIMITER defines the line boundary chunks that are read into the buffer and loosely defines the record delimiter.
For parsing, this does not strictly affect the record/field structures as fields may have embedded newlines, etc. However, this DOES need to be set correctly.
Default = "\r\n" NOTE! The default is Windows format.
Always returns the current set value.
WARNING! line_delimiter() also sets $/ and is always called during construction. Due to buffering, don't change $/ or LINE_DELIMITER during program execution!
$current_value = $f->word_delimiter("\t");
Get/set WORD_DELIMITER. WORD_DELIMITER defines the field boundaries within the record. WORD_DELIMITER may only be set to a single character, otherwise a warning is generated and the new value is ignored.
Default = "," NOTE! Single character only.
WARNING! Due to buffering, don't change WORD_DELIMITER during program execution!
quote_escape
$current_value = $f->quote_escape("\\");
Get/set QUOTE_ESCAPE. For data that have fields enclosed in quotes, QUOTE_ESCAPE defines the escape character for '"' e.g. for the default QUOTE_ESCAPE = '"', to embed a quote character in a field (MS Excel style):
"field1 ""junk"" and more, etc"
Default = '"'
WARNING! Due to buffering, don't change QUOTE_ESCAPE during program execution!
word_line_delimiter_escape
$current_value = $f->word_line_delimiter_escape("\\");
Get/set WORD_LINE_DELIMITER_ESCAPE. Sometimes you'll encounter (or want to create) files where WORD_DELIMITER and/or LINE_DELIMITER's are embedded in the data and the creator had the notion (courtesy?) to escape those characters when they appeared within a field with say, '\'. If so, you'll want to set WORD_LINE_DELIMITER_ESCAPE to that character.
If WORD_LINE_DELIMITER_ESCAPE is specified, this character must be escaped by the same character to be included in a field. e.g. for a tab-delimited file where WORD_LINE_DELIMITER_ESCAPE => '\' follows is a sample record with an embedded newline:
me<TAB>you<TAB>this is a single field that contains an escaped line terminator\ an escaped tab\<TAB> and an actual \\<TAB>this is the next field...
Do not use WORD_LINE_DELIMITER_ESCAPE for data with fields that are enclosed in quotes.
WORD_LINE_DELIMITER_ESCAPE cannot be '_', will otherwise be silently ignored.
Default = undef()
WARNING! Due to buffering, don't change WORD_LINE_DELIMITER_ESCAPE during program execution!
no_quotes
$current_value = $f->no_quotes($bool);
Get/set NO_QUOTES. Instruct form_record to strip WORD_DELIMITER and LINE_DELIMITER from fields within the record and never to enclose fields in quotes.
form_record
By default, if, during record formation a WORD_DELIMITER or LINE_DELIMITER is encountered in a field value, that field will be enclosed in quotes. However, if NO_QUOTES = 1 any occurence of WORD_DELIMITER or LINE_DELIMITER will be stripped from the value and no enclosing quotes will be used.
If ALWAYS_QUOTE = 1 this attribute is ignored and quotes will always be used.
Only affects form_record.
Default = 0 (by default records created with form_record may have fields enclosed in quotes)
always_quote
$current_value = $f->always_quote($bool);
Get/set ALWAYS_QUOTE. Always enclose fields in quotes when using form_record. Only affects form_record. Takes precedence over no_quotes.
Default = 0
max_linebuf
$current_value = $f->max_linebuf($integer);
Get/set MAX_LINEBUF. A file is read in line chunks and because newlines are allowed to be embedded in the field values, many lines may be read and buffered before the whole record is determined. MAX_LINEBUF sets the maximum number of lines that are used to parse a record before the first line of that block is determined junk and -1 is returned from next_record. Processing then continues at the very next line in the file.
next_record
Default = 1000
recadd
$current_value = $f->recadd($bool);
Get/set RECADD. If set to true, LINE_DELIMITER (actually $/) will be added to the end of the value returned from form_record. Only affects form_record
input_file
$current_value = $f->input_file($fh);
Get/set the filehandle of the file to be parsed (e.g. IO::File object). May also be set in the constructor.
Default = undef
input_text
$textbuf = $f->input_text($text_blob);
Alternative to input_file, feed the entire text of a file or scalar to $f at once. Accepts scalar or scalar reference.
Returns the internal textbuf attr.
$rec = $f->next_record();
Parses and returns an arrayref of the fields of the next record.
return '' if EOF is encountered
return -1 if the next record is corrupted (incomplete, etc) or if MAX_LINEBUF is reached
WARNING! Due to buffering, don't change $/ or LINE_DELIMITER during program execution!
cur_line
$raw = $f->cur_line();
Returns the raw text line currently being processed (including a line terminator if originally present).
$line = $f->form_record($array_of_fields);
Returns a WORD_DELIMITED joined text scalar variable-length record of $array_of_fields. Also see recadd.
$array_of_fields may be an array or arrayref.
None as yet. This code has been used at several production sites before publishing to the public.
Reed Sandberg, <reed_sandberg Ó’ yahoo>
Copyright (C) 2001-2007 Reed Sandberg All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in 'Ó’'. Assuming CP1252
To install Text::LooseCSV, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::LooseCSV
CPAN shell
perl -MCPAN -e shell install Text::LooseCSV
For more information on module installation, please visit the detailed CPAN module installation guide.