The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

TITLE

sh2odt - convert Shoebox/Toolbox to OpenOffice .odt file

SYNOPSIS

    sh2odt [-s settings_dir] [-c codepage] [-e encs] [-m] infile [outfile]

Converts Shoebox data to OpenOffice format

OPTIONS

    -c codepage     Set system codepage for this process
    -e enc,enc      Add Encoding:: subsets in Perl 5.8.1
    -m              MDF character marker support
    -s dir          Directory to find .typ files in [.]
    

If outfile is missing, it is created as the input file with extension replaced by .odt. This allows a user to drop a data file on a shortcut.

DESCRIPTION

sh2odt converts a Shoebox/Toolbox file into an OpenOffice .odt file. To do this it needs to convert data to Unicode. It also converts interlinear text into character level frames whereby each frame contains a single interlinear block and is treated by the system as if it were a character. It can then be copied and pasted into tables, reflowed like normal text, etc.

Using sh2odt involves two aspects: preparing for conversion in terms of giving information about encoding conversion and even XML template output; and running the program, knowing what command line option does what. This manual is not a tutorial and so we list all the details with little or no indication of relative priority.

Running sh2odt

Here we list the various command line options and give further details on each

-c

Specifies the default codepage to be used when converting data. In effect it specifies that sh2xml should act as though it were running on a system with the given default codepage. This means that data in languages with no given encoding conversion will be converted using this codepage.

-e

Perl has internal support for a large number of industry standard encodings. This option specifies which sets to pull in apart from the default set. Values include

  Byte - standard ISO 8859 type single byte encodings
  CN   - Continental China encodings including cp 936, GB 12345 and GB 2312
  JP   - Japanese encodings including cp 932 and ISO 2022
  KR   - Korean encodings including cp 949
  TW   - Taiwanese encodings including cp 950
  HanExtra - more Chinese encodings including GB 18030
  JIS2K - More Japanese encodings
  Ebcdic - surely not!
  Symbols - various symbol encodings

See man Encode::Supported or the corresponding module documentation for details of what is supported on your Perl installation.

-m

MDF and perhaps other schemas support the ability to use inline markers of the form |mk{text}. sh2odt has the ability to work with these schemes. Data marked in such a way is output with a character style of the given marker's name.

-s

sh2xml requires access to information about the structure of the database and language information. This is held in files in the same directory as the .prj project file used when running Shoebox/Toolbox.

Preparing for Conversion

The basic need is to be able to specify how to convert text in a particular language into Unicode. This can be done by specifying a conversion mapping in each language file. Shoebox and Toolbox do not have a UI for specifying such conversion information, so we add information to the options/description field. The codepage specification takes the form:

  \codepage = value

The specification needs to be on a line on its own. The value can take a number of forms.

name

A mapping name either from the set of names supported by the Perl Encode module, or specified in an SIL Converters repository.

filename.tec

The path and filename of a TECkit binary mapping file. The path is relative to the settings directory.

none

No mapping should be done. The data is assumed to be in UTF-8 encoding.

sh2odt creates styles for each marker and outputs the font used for each marker. If the data has been converted, then the font isn't appropriate to that encoding any more. To specify an appropriate font it is possible to specify this in the description field using

  \unicode_font = value

Where value is the font name to be used for the Unicode form of the data.