The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

TITLE

shlex - typesets a dictionary

SYNOPSIS

  shlex -c config.xml [-o outfile] [-s style_info] [-b backend] [-n] [-d] infile
  shlex -h

Typesets an sfm dictionary according to a configuration to generate output in a given format.

OPTIONS

  -b backend        Which backend module to use [html]
  -c config.xml     Configuration file for controlling the typesetting
  -h                Give full help
  -n                Sort keys by line number of first occurrence (i.e. no sort)
  -o outfile        Where to store the output (or stdout)
  -s style_info     Backend specific configuration
  -v var=val        Assign values

DESCRIPTION

This documentation is aimed at programmers rather than providing a tutorial. I'm too close to the code to write the tutorial at this time.

The best way to really control shlex is to have some kind of understanding of how it works even if the understanding is somewhat high level.

An shlex configuration consists of 2 parts, parameters to be passed to the backend, and a sequence of sections. Each section corresponds to a major section in the dictionary be it the main dictionary output or an index of some kind such as a reverse finder or semantic domain based index. What a section does is to sort the dictionary according to the sort keys given and then to process each nodeset of records with a given key by passing it to a group in the section called main.

shlex is basically a nodeset processor. It thinks in terms of nodesets. Nodesets are a concept borrowed from XML. For an sfm style database a nodeset can be thought of as a set of ranges. Each range is a contiguous sequence of fields within a record, for example the whole record or all of the fields from one marker up to but not including the next occurrence of a field with the same marker. A set of ranges can be collected from different records (in fact very often each range is from a different record). shlex works with such node or range sets by passing them from parent element to child element according to the type of element involved.

A group element takes a nodeset and passes it to each of its children in turn. There are 3 other elements that take nodesets in the way a group element does. The marker element iterates through its nodeset and passes each node to each of its child elements in turn. The switch element passes its nodeset to each of its children in turn and stops when one of the children returns true. This is useful for alternation.

The nodeset that is passed can be filtered using the test attribute. It can also be sorted on fields within each node. But this is true for any of the nodeset generating elements.

The configuration file is an XML file that controls the typesetting of each dictionary entry. The language consist of 4 key elements.

group

Each group is a sequence of elements which are processed in order to output information. Each group is named and processing starts by executing all the elements in the group named main. A group is executed over a range of markers, which starts out being all the markers in the record.

use-group

Executes another group using the same range of markers that is active for the use-group element.

marker

A marker is used to match all occurrences of the given sfm (in the tag attribute) within the parent range of markers. For each marker that matches an optional test is run and if the test passes then if a paragraph attribute is given a new paragraph is started with the given style. Then any before text is output in the default style of the paragraph. Then if a style attribute exists, then the contents of the field are output using that character style. Then any children are output as if the marker were a group. Then the after text is output using the default paragraph style. Finally use-group passes its nodeset to another named group in the section.

An element may assemble its nodeset from 3 sources. It can use the nodeset it has been passed, in which case it can filter it using the test attribute and sort it on any fields in each node (the last field occurring is used). It can make a set of subnodes by taking a marker in each node and assembling a nodeset for that node which is the field with that marker up to the next field with the same marker which starts a new subnode. Finally all the subnodes from all the nodes in the set are joined into one big nodeset which again can be filtered and sorted. Finally, a nodeset may be assembled by looking up records against a particular set of fields in the incoming nodeset. Each node is searched for the first occurrence of each field in the keys attribute and a search key is created. This is looked up in an index and the resulting nodes added to the output nodeset. The output nodeset can be filtered and sorted.

picture

Outputs a picture whose filename (relative or absolute) is given by the value of the node. One picture is output for each node in the nodeset.

output

Outputs the given text attribute or the calculated value attribute if the optional test attribute is passed. Returns true if the test is passed.

debug

This is much like an output element but its output is sent to STDOUT for debugging purposes, when the option -d is used on the command line.

inline-map

It is possible for some fields to contain style markup, for example:

 \nt Rarely does |fv{fruble} interact with this word
 

The inline-map element contains a list of such style markup elements and then can be applied to a particular field as it is output.

inline-style

This element occurs within an inline-map and maps a particular style markup to an output style.

Indices

Indices are named globally regardless of where they are defined. Each section also creates an index with the name of the section prepended with an underscore _. An index may be used to access other files than the input file. A section works by iterating over the keys in the index creating a nodeset from all the records that match that key. It then passes that nodeset to the letter element named primary and then on to the group element named main.

The letter element is used to control the section headings used for character change headings and groupings.

Expression functions

The test attribute is a perl expression that can be used to further constrain the marker used. The following functions are provided. Remember that since this is a perl expression even node paths will need to be marked as strings,

value(nodeset|field, [index])

This takes a string and an optional index. Each nodeset knows about its parent so it is possible to track back through the nodesets to one corresponding to a particular ancestral element in the configuration. In addition, each node in a nodeset has a string associated with it which is considered the value of the node. A nodeset is considered as an array of nodes and by default the first node in a nodeset is used when a string is required.

In addition it is possible to query the fields in the node being processed or filtered. Note it is not possible to query the fields in any other nodes. Each field is considered to be an array of all fields with that name. Again the default index is 0.

count(nodeset|field)

Returns the number of nodes or fields in the referenced array of nodes or fields.

position(nodeset)

The marker element iterates over its output nodeset. It is possible for a child to query the iteration number it is being executed within a marker's nodeset. All other elements just take on the position of their parent.

firstchar(string, sort, level, ignore)

This takes a string and finds the first character in it. It uses sort principles of including characters at a given level and ignoring or including characters of a lower level. Thus if a is a primary character, ' a secondary one and * a tertiary one

    firstchar("a*'", "unicode", 0, 0) -> "a"   ("*", "'")
    firstchar("a*'", "unicode", 1, 0) -> "a"   ("*", "'")
    firstchar("a*'", "unicode", 0, 1) -> "a"   ()
    firstchar("a*'", "unicode", 1, 1) -> "a'"  ()

The sort parameter specifies a sort type to use be it numeric, unicode, nothing or some other. See the section on sorting.

split(string)

This returns an array. Keys made from multiple fields are joined using a null character. This function splits them up.

join(string, [string, ...])

Joins elements together to form a splittable string.

regexp(string, regexp)

This runs the regexp over the string returning the first group ($1)

cmp(string, string, sort, level)

Compares two strings using a given sort order at a given level. Returns 1 if the first string is 'greater' than the second string; 0 if they are equal and -1 otherwise.

lower(string)

Returns the lowercase form of string

upper(string)

Returns the uppercase form of string

Attributes

Here we list all the key attributes and what they do and mean

after

The contents of this attribute are output once after all iterations over the nodeset.

before

The contents of this attribute are output once before the nodeset is processed.

begin

Occurs in an inline-style element and gives the identifying string that starts a run in this particular style.

between

This is output only before the before and indent attributes if this is not the first node being processed.

empty

By default if the contents of the field that defines the start of a node is empty then no node is generated, unless this attribute is non-zero. If an empty nodeset is being processed, then no action occurs. This is for skipping empty fields.

end

Occurs in an inline-style element and gives the identifying string that closes a run in this particular style. Runs may be nested and the end attribute does not have to be unique. The end attribute may not occur as an initial substring of any begin attribute in any inline-style in the parent inline-map.

filename

Indexes (including sections) may draw their index from another file than the input file. The filename is the file to use in the index and all access to that index will be drawn from that file.

indent

Specifies how much indent to insert before any before or output. No indent occurs at the beginning of a paragraph. Values supported are:

newline

Inserts a line break (not a paragraph break).

none

Don't output any space

space

Output a space. This is the default

tab

Output a tab.

Note that if there is no output (i.e. style isn't defined) then the default behaviour for indent becomes none.

index

Assembles the output nodeset by assembling lookup keys from the input nodeset and then using these to create an output nodeset according to sort.

inline-map

This attribute indicates that the given inline-map should be used to map style markup to character styles for any field data processed by this element. Notice that the inline-map attribute is inherited by all this elements children.

keys

Keys have two uses in shlex.

They specify the fields to use when creating an index key. In an index element it specifies the fields to use in a record, and all combinations are indexed.

In a processing element it specifies the fields to use from the input nodeset when constructing the value of a node. Node values are used for sorting the nodes in a nodeset. Note that only the first occurence of each field is used.

limit

If we consider a record might contain subrecords, a field may occur both in the record and the subrecords and it is difficult to not pick up the subrecord fields when processing the main record. This attribute allows a tag selection to stop collecting elements in a node when a particular marker is hit. The attribute is a space separated list of markers to stop processing at.

map

Used in conjunction with an index lookup. This specifies a conversion map to map from input markers to markers to be used inside shlex.

name

Used to name elements, for example groups and indexes, so that they can be referenced again in use-group elements for example.

neg

Negates the result of a processing element. A processing element is considered to be true if it creates a non-empty nodeset and that the processing of any children according to that nodeset results in at least one true result. This attribute is useful in a switch element to say: if the test is true then continue processing the other children, otherwise stop.

paragraph

Specifies that this element starts a new paragraph of the given style. Only one paragraph is started for one element.

path

This value is inserted before a picture filename to locate the actual picture file. A warning is given if the file is not found. Since the value is simply prepended onto the picture filename, a trailing / is required (or whatever is appropriate for your OS, note on Windows that / are easier to use).

scale

This gives a floating point multiplier for the size of an image inserted into the document from a picture element. The size of the image is taken as its pixel size in pts (i.e. presuming 72dpi). Use scale to linearly scale to account for a difference between this and what the real resolution is.

sort

Specifies a list of sorting algorithms one for each of the fields specified in the keys attribute. This attribute is not needed in an index element, since sorting is only done on nodes in a nodeset not on indexes which are merely for looking things up. See the section on sorting for more information.

style

Specifies a character style to use to output the value() of a particular node in a nodeset. In the case of group this is the first node of the nodeset. For markers each node in the set causes some output. style may be empty in which case the text is output using the underlying paragraph style.

tag

Specifies a field marker to find within each node from which a new sub node is created. If tag is defined, but empty, the match will occur on any field marker. This results in a nodeset consisting of nodes containing only one field.

test

This is a perl expression that is used to filter a nodeset. It makes use of the expression functions and is tested for non-zero being considered as the test passing.

text

Specifies text to be output. The text is not evaluated and is output unchanged.

unique

When processing an index, only add a record to a nodeset once. I.e. don't have the same record in the nodeset more than once.

value

Specifies an expression to be evaluated and output as text.

Sorting

One of the most complex issues when creating a dictionary is to ensure that the dictionary is sorted correctly in all the different areas where sorting occurs. For this reason shlex supports a relatively powerful array of sorting options. Each of the different support sorting algorithms are listed here.

Sort algorithms are also used for tokenizing particularly in the firstchar function.

default

This is the default sorting option and uses perl's implicit binary cmp function. The firstchar is simply the first character of the string.

numeric

Does a numeric comparison treating the value as numbers rather than strings. It will also handle strings of the form: x.y.z... as an array of numbers to be compared. firstchar returns the whole string up to the first .

unicode

This uses Unicode::Collate to compare strings and comparison can be done at different sorting levels. For this the level should be specified after the unicode as in unicode|2. firstchar takes sorting levels into account.

language.lng|order

This sort method uses a Shoebox .lng file and an optional specific order within that langauge file. If no order is specified then the default sort order is used.

Configuration DTD

A DTD for the configuration file is:

  <!ELEMENT layout (backend*, map*, index*, inline-map* section+)>

  <!ELEMENT backend (property*)>
  <!ATTLIST backend
        type        CDATA #REQUIRED>
        
  <!ELEMENT inline-map (inline-style+)>
  <!ATTLIST inline-map
        name        CDATA #REQUIRED>
        
  <!ELEMENT inline-style>
  <!ATTLIST inline-style
        begin       CDATA #REQUIRED
        end         CDATA #REQUIRED
        style       CDATA #REQUIRED>

  <!ELEMENT property>
  <!ATTLIST property
        name        CDATA #REQUIRED
        value       CDATA #REQUIRED>

  <!ELEMENT section ((index | letter)*, group+)>
  <!ATTLIST section
        type        CDATA #REQUIRED
        name        CDATA #IMPLIED
        keys        CDATA #REQUIRED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        test        CDATA #IMPLIED>

  <!ELEMENT letter (output | group | marker | switch | use-group | debug | foreach)*>
  <!ATTLIST letter
        name        CDATA #REQUIRED
        test        CDATA #REQUIRED
        value       CDATA #IMPLIED
        text        CDATA #IMPLIED
        paragraph   CDATA #IMPLIED
        before      CDATA #IMPLIED
        style       CDATA #IMPLIED
        after       CDATA #IMPLIED>
        
  <!ELEMENT group (group | use-group | marker | switch | switch | debug | foreach)+>
  <!ATTLIST group
        name        CDATA #REQUIRED
        paragraph   CDATA #IMPLIED
        before      CDATA #IMPLIED
        between     CDATA #IMPLIED
        style       CDATA #IMPLIED
        after       CDATA #IMPLIED
        index       CDATA #IMPLIED
        keys        CDATA #IMPLIED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        test        CDATA #IMPLIED
        inline-map  CDATA #IMPLIED
        tag         CDATA #IMPLIED>

  <!ELEMENT use-group>
  <!ATTLIST use-group
        name        CDATA #REQUIRED
        text        CDATA #IMPLIED
        index       CDATA #IMPLIED
        keys        CDATA #IMPLIED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        test        CDATA #IMPLIED
        inline-map  CDATA #IMPLIED
        tag         CDATA #IMPLIED>

  <!ELEMENT marker (use-group | marker | switch | group | output | debug | foreach)*>
  <!ATTLIST marker
        tag         CDATA #REQUIRED
        test        CDATA #IMPLIED
        paragraph   CDATA #IMPLIED
        indent      CDATA #IMPLIED
        before      CDATA #IMPLIED
        between     CDATA #IMPLIED
        style       CDATA #IMPLIED
        after       CDATA #IMPLIED
        index       CDATA #IMPLIED
        keys        CDATA #IMPLIED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        inline-map  CDATA #IMPLIED
        neg         CDATA #IMPLIED>

  <!ELEMENT switch (switch | marker | group | output | debug | use-group | foreach)+>
  <!ATTLIST switch
        tag         CDATA #REQUIRED
        test        CDATA #IMPLIED
        index       CDATA #IMPLIED
        keys        CDATA #IMPLIED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        inline-map  CDATA #IMPLIED
        neg         CDATA #IMPLIED>
  
  <!ATTLIST foreach (switch | marker | group | output | debug | use-group | foreach)+>
  <!ATTLIST foreach
        var         CDATA #REQUIRED
        inline-map  CDATA #IMPLIED
        over        CDATA #REQUIRED>

  <!ELEMENT index>
  <!ATTLIST index
        filename    CDATA #IMPLIED
        name        CDATA #REQUIRED
        keys        CDATA #REQUIRED>

  <!ELEMENT map (replace*)>
  <!ATTLIST map
        name        CDATA #REQUIRED>

  <!ELEMENT replace>
  <!ATTLIST replace
        in          CDATA #REQUIRED
        out         CDATA #REQUIRED>

  <!ELEMENT output>
  <!ATTLIST
        style       CDATA #IMPLIED
        test        CDATA #IMPLIED
        text        CDATA #IMPLIED
        value       CDATA #IMPLIED
        paragraph   CDATA #IMPLIED
        style       CDATA #IMPLIED
        before      CDATA #IMPLIED
        after       CDATA #IMPLIED
        inline-map  CDATA #IMPLIED
        indent      CDATA #IMPLIED>

  <!ELEMENT debug>
  <!ATTLIST
        test        CDATA #IMPLIED
        text        CDATA #IMPLIED
        value       CDATA #IMPLIED>
  
  <!ELEMENT picture>
  <!ATTLIST
        path        CDATA #IMPLIED
        scale       CDATA #IMPLIED
        tag         CDATA #IMPLIED
        test        CDATA #IMPLIED
        index       CDATA #IMPLIED
        keys        CDATA #IMPLIED
        sort        CDATA #IMPLIED
        unique      CDATA #IMPLIED
        neg         CDATA #IMPLIED>
        

Limitations

Here are some things that this program won't do.

.

Add or change any fields in the database. If you need to munge your data before processing, e.g. splitting up fields, then this should be done before running shlex.

.

This program presumes the data it is processing is in Unicode even if it isn't. I.e. if you want to work with legacy encoded data then convert it to Unicode say as codepage 1252 and then work with it in that way.

TODO

.

Add system locale based sorting and configuration

.

Add chinese sorting (that takes two fields) when I get a chinese cmp module

.

The documentation is poor and rushed.