The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

--watch option

SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.

BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.

TODO: * modules are just reloaded, no constructors are called yet

NAME

Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data

VERSION

version 2.20160630

SYNOPSIS

In bash:

 > treex myscenario.scen -- data/*.treex
 > treex My::Block1 My::Block2 -- data/*.treex

In Perl:

 use Treex::Core::Run q(treex);
 treex([qw(myscenario.scen -- data/*.treex)]);
 treex([qw(My::Block1 My::Block2 -- data/*.treex)]);

DESCRIPTION

Treex::Core::Run allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex. However, the same list of arguments can be passed by an array reference to the function treex() imported from Treex::Core::Run.

Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p. The treex method then creates a Treex::Core::Parallel::Head object, which extends Treex::Core::Run by providing parallel processing functionality.

Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local switch.

SUBROUTINES

treex

create new runner and runs scenario given in parameters

USAGE

 usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files]
 scenario is a sequence of blocks or *.scen files
 options:
        -h -? --usage --help                Prints this usage information.
        -s --save                           save all documents
        -q --quiet                          Warning, info and debug messages
                                            are suppressed. Only fatal errors
                                            are reported.
        --cleanup                           Delete all temporary files.
        -e STR --error_level STR            Possible values: ALL, DEBUG,
                                            INFO, WARN, FATAL
        -L STR --language STR --lang STR    shortcut for adding
                                            "Util::SetGlobal language=xy" at
                                            the beginning of the scenario
        -S STR --selector STR               shortcut for adding
                                            "Util::SetGlobal selector=xy" at
                                            the beginning of the scenario
        -t --tokenize                       shortcut for adding
                                            "Read::Sentences W2A::Tokenize"
                                            at the beginning of the scenario
                                            (or W2A::XY::Tokenize if used
                                            with --lang=xy)
        --watch STR                         re-run when the given file is
                                            changed TODO better doc
        -d --dump_scenario                  Just dump (print to STDOUT) the
                                            given scenario and exit.
        --dump_required_files               Just dump (print to STDOUT) files
                                            required by the given scenario
                                            and exit.
        --cache STR                         Use cache. Required memory is
                                            specified in format
                                            memcached,loading. Numbers are in
                                            GB.
        -v --version                        Print treex and perl version
        -E STR --forward_error_level STR    messages with this level or
                                            higher will be forwarded from the
                                            distributed jobs to the main
                                            STDERR
        -p --parallel                       Parallelize the task on SGE
                                            cluster (using qsub).
        -j INT --jobs INT                   Number of jobs for
                                            parallelization, default 10.
                                            Requires -p.
        --local                             Run jobs locally (might help with
                                            multi-core machines). Requires -p.
        --priority INT                      Priority for qsub, an integer in
                                            the range -1023 to 0 (or 1024 for
                                            admins), default=-100. Requires
                                            -p.
        --memory STR -m STR --mem STR       How much memory should be
                                            allocated for cluster jobs,
                                            default=2G. Requires -p.
                                            Translates to "qsub -hard -l
                                            mem_free=$mem -l h_vmem=2*$mem -l
                                            act_mem_free=$mem". Use --mem=0
                                            and --qsub to set your own SGE
                                            settings (e.g. if act_mem_free is
                                            not available).
        --name STR                          Prefix of submitted jobs.
                                            Requires -p. Translates to "qsub
                                            -N $name-jobname".
        --queue STR                         SGE queue. Translates to "qsub -q
                                            $queue".
        --qsub STR                          Additional parameters passed to
                                            qsub. Requires -p. See --priority
                                            and --mem. You can use e.g.
                                            --qsub="-q *@p*,*@s*" to use just
                                            machines p* and s*. Or e.g.
                                            --qsub="-q *@!(twi*|pan*)" to
                                            skip twi* and pan* machines.
        --workdir STR                       working directory for temporary
                                            files in parallelized processing;
                                            one can create automatic
                                            directories by using patterns:
                                            {NNN} is replaced by an ordinal
                                            number with so many leading zeros
                                            to have length of the number of
                                            Ns, {XXXX} is replaced by a
                                            random string, whose length is
                                            the same as the number of Xs
                                            (min. 4). If not specified,
                                            directories such as
                                            001-cluster-run, 002-cluster-run
                                            etc. are created
        --survive                           Continue collecting jobs' outputs
                                            even if some of them crashed
                                            (risky, use with care!).
        --jobindex INT                      Not to be used manually. If
                                            number of jobs is set to J and
                                            modulo set to M, only I-th files
                                            fulfilling I mod J == M are
                                            processed.
        --outdir STR                        Not to be used manually. Dictory
                                            for collecting standard and error
                                            outputs in parallelized
                                            processing.
        --server STR                        Not to be used manually. Used to
                                            point parallel jobs to the head.

AUTHORS

Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>

Martin Popel <popel@ufal.mff.cuni.cz>

Martin Majliš

Ondřej Dušek <odusek@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.