The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ShiftJIS::String - functions to manipulate Shift-JIS strings

SYNOPSIS

  use ShiftJIS::String;

  ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));

DESCRIPTION

This module provides some functions which emulate the corresponding CORE functions and helps someone to manipulate multiple-byte character sequences in Shift-JIS.

* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.

FUNCTIONS

issjis(LIST)

Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift-JIS.

Returns false if LIST includes one (or more) invalid string.

Length

length(STRING)

Returns the length in characters of the supplied string.

Reverse

strrev(STRING)

Returns a reversed string, i.e., a string that has all characters of STRING but in the opposite order.

index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)

Returns the position of the first occurrence of SUBSTR in STRING at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string.

If the substring is not found, returns -1.

rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)

Returns the position of the last occurrence of SUBSTR in STRING. If POSITION is specified, returns the last occurrence at or before POSITION.

If the substring is not found, returns -1.

strspn(STRING, SEARCHLIST)

Returns the position of the first occurrence of any character that is not contained in <SEARCHLIST>.

If STRING consists of the characters in SEARCHLIST, the returned value must equal the length of STRING.

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

  strspn("+0.12345*12", "+-.0123456789");
  # returns 8 (at '*')
strcspn(STRING, SEARCHLIST)

Returns the position of the first occurrence of any character contained in SEARCHLIST.

If STRING does not contain any character in SEARCHLIST, the returned value must equal the length of STRING.

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

rspan(STRING, SEARCHLIST)

Searches the last occurence of any character that is not contained in SEARCHLIST.

If such a character is found, returns the next position to it; otherwise (any character in STRING is contained in SEARCHLIST), it returns 0 (as the first position of the string).

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

rcspan(STRING, SEARCHLIST)

Searches the last occurence of any character that is contained in SEARCHLIST.

If such a character is found, returns the next position to it; otherwise (any character in STRING is not contained in SEARCHLIST), it returns 0 (as the first position of the string).

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

Trimming

trim(STRING)
trim(STRING, SEARCHLIST)
trim(STRING, SEARCHLIST, USE_COMPLEMENT)

Erases characters in SEARCHLIST from the beginning and the end of STRING and the returns the result.

If USE_COMPLEMENT is true, erases characters that are not contained in SEARCHLIST.

If SEARCHLIST is omitted (or undef), it is used the list of whitespace characters i.e., "\t", "\n", "\r", "\f", "\x20" (SP), and "\x81\x40" (IDSP).

While SEARCHLIST is not aware of character ranges, you can utilize mkrange(), like trim($string, mkrange("\x00-\x20")).

ltrim(STRING)
ltrim(STRING, SEARCHLIST)
ltrim(STRING, SEARCHLIST, USE_COMPLEMENT)

Erases characters in SEARCHLIST from the beginning of STRING and the returns the result.

If USE_COMPLEMENT is true, erases characters that are not contained in SEARCHLIST.

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

rtrim(STRING)
rtrim(STRING, SEARCHLIST)
rtrim(STRING, SEARCHLIST, USE_COMPLEMENT)

Erases characters in SEARCHLIST from the end of STRING and the returns the result.

If USE_COMPLEMENT is true, erases characters that are not contained in SEARCHLIST.

While SEARCHLIST is not aware of character ranges, you can utilize mkrange().

Substring

substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)

It works like CORE::substr, but using character semantics of Shift-JIS.

If the REPLACEMENT as the fourth parameter is specified, replaces parts of the SCALAR and returns what was there before.

You can utilize the lvalue reference, returned if a reference to a scalar variable is used as the first argument.

    ${ &substr(\$str,$off,$len) } = $replace;

        works like

    CORE::substr($str,$off,$len) = $replace;

The returned lvalue is not aware of Shift-JIS, then successive assignment may cause unexpected results.

Get lvalue before any assignment if you are not sure.

Split

strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)

This function emulates CORE::split, but splits on the SEPARATOR string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the @_ array.

If an empty string is specified as SEPARATOR, splits the specified string into characters (similarly to CORE::split //, STRING, LIMIT).

  strsplit('', 'This is Perl.', 7);
  # ('T', 'h', 'i', 's', ' ', 'i',  's Perl.')

If an undefined value is specified as SEPARATOR, splits the specified string on whitespace characters (including IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field (similarly to CORE::split ' ', STRING, LIMIT).

  strsplit(undef, '   This  is   Perl.');
  # ('This', 'is', 'Perl.')

Comparison

strcmp(LEFT-STRING, RIGHT-STRING)

Returns 1 (when LEFT-STRING is greater than RIGHT-STRING) or 0 (when LEFT-STRING is equal to RIGHT-STRING) or -1 (when LEFT-STRING is lesser than RIGHT-STRING).

The order is roughly as shown the following list.

    JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).

For example, 0x41 as 'A' is lesser than 0xB1 (HANKAKU KATAKANA A). 0xB1 is lesser than 0x8341 (KATAKANA A). 0x8341 is lesser than 0x8383 (KATAKANA SMALL YA). 0x8383 is lesser than 0x83B1 (GREEK CAPITAL TAU).

Caveat! Compare the 2nd and the 4th examples. Byte "\xB1" is lesser than byte "\x83" as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift-JIS codepoint order.

strEQ(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is equal to RIGHT-STRING.

Note: strEQ is an expensive equivalence of the CORE's eq operator.

strNE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is not equal to RIGHT-STRING.

Note: strNE is an expensive equivalence of the CORE's ne operator.

strLT(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is lesser than RIGHT-STRING.

strLE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is lesser than or equal to RIGHT-STRING.

strGT(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is greater than RIGHT-STRING.

strGE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is greater than or equal to RIGHT-STRING.

strxfrm(STRING)

Returns a string transformed so that CORE:: cmp can be used for binary comparisons (NOT the length of the transformed string).

I.e. strxfrm($a) cmp strxfrm($b) is equivalent to strcmp($a, $b), as long as your cmp doesn't use any locale other than that of Perl.

Character Range

mkrange(EXPR, EXPR)

Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.

A character range is specified with a '-' (HYPHEN-MINUS). The backslashed combinations '\-' and '\\' are used instead of the characters '-' and '\', respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.

For example, mkrange('+\-0-9a-fA-F') returns ('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F').

The order of Shift-JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC.

If true value is specified as the second parameter, Reverse character ranges such as '9-0', 'Z-A' can be used; otherwise, reverse character ranges are croaked.

Transliteration

strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)

Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.

If a reference to a scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.

SEARCHLIST and REPLACEMENTLIST

Character ranges (internally utilizing mkrange()) are supported.

If the REPLACEMENTLIST is empty, the SEARCHLIST is replicated.

If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).

MODIFIER

    c   Complement the SEARCHLIST.
    d   Delete found but unreplaced characters.
    s   Squash duplicate replaced characters.
    h   Returns a hash (or a hashref in scalar context) of histogram
    R   No use of character ranges.
    r   Allows to use reverse character ranges.
    o   Caches the conversion table internally.

  strtr(\$str, " \x81\x40\n\r\t\f", '', 'd');
    # deletes all whitespace characters including IDEOGRAPHIC SPACE.

If 'h' modifier is specified, returns a hash (or a hashref in scalar context) of histogram (key: a character as a string, value: count), whether the first argument is a reference or not. If you want to get the histogram and the modified string at once, pass a reference as the first argument and use its value after.

If 'R' modifier is specified, '-' is not evaluated as a meta character but HYPHEN-MINUS itself like in tr'''. Compare:

  strtr("90 - 32 = 58", "0-9", "A-J");
    # output: "JA - DC = FI"

  strtr("90 - 32 = 58", "0-9", "A-J", "R");
    # output: "JA - 32 = 58"
    # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J';
    # '0' to 'A', '-' to '-', and '9' to 'J'.

If 'r' modifier is specified, you are allowed to use reverse character ranges. For example, strtr($str, "0-9", "9-0", "r") is equivalent to strtr($str, "0123456789", "9876543210").

PATTERN and TOPATTERN

By use of PATTERN and TOPATTERN, you can transliterate the string using lists containing some multi-character substrings.

If called with four arguments, SEARCHLIST, REPLACEMENTLIST, and STRING are splited characterwise;

If called with five arguments, a multi-character substring that matchs PATTERN in SEARCHLIST, REPLACEMENTLIST, or STRING is regarded as an transliteration unit.

If both PATTERN and TOPATTERN are specified, a multi-character substring either that matchs PATTERN in SEARCHLIST, or STRING, or that matchs TOPATTERN in REPLACEMENTLIST is regarded as an transliteration unit.

  print strtr(
    "Caesar Aether Goethe",
    "aeoeueAeOeUe",
    "&auml;&ouml;&ouml;&Auml;&Ouml;&Uuml;",
    "",
    "[aouAOU]e",
    "&[aouAOU]uml;");

  # output: C&auml;sar &Auml;ther G&ouml;the

LISTs as Anonymous Arrays

Instead of specification of PATTERN and TOPATTERN, you can use anonymous arrays as SEARCHLIST and/or REPLACEMENTLIST as follows.

  print strtr(
    "Caesar Aether Goethe",
    [qw/ae oe ue Ae Oe Ue/],
    [qw/&auml; &ouml; &ouml; &Auml; &Ouml; &Uuml;/]
  );

Caching the conversion table

If 'o' modifier is specified, the conversion table is cached internally. e.g.

  foreach (@strings) {
    print strtr($_, $from_list, $to_list, 'o');
  }

will be almost as efficient as this:

  $closure = trclosure($from_list, $to_list);

  foreach (@strings) {
    print &$closure($_);
  }

You can use whichever you like.

Without 'o',

  foreach (@strings) {
    print strtr($_, $from_list, $to_list);
  }

will be very slow since the conversion table is made whenever the function is called.

Generation of the Closure to Transliterate

trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)

Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.

The functionality of the closure made by trclosure() is equivalent to that of strtr(). Frankly speaking, the strtr() calls trclosure() internally and uses the returned closure.

Case of the Alphabet

toupper(STRING)
toupper(SCALAR REF)

Returns an uppercased string of STRING. Converts only half-width Latin characters a-z to A-Z.

If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.

tolower(STRING)
tolower(SCALAR REF)

Returns a lowercased string of STRING. Converts only half-width Latin characters A-Z to a-z.

If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.

Conversion between hiragana and katakana

If a reference to a scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.

Note
  • The conversion between a voiced (or semi-voiced) hiragana and katakana (a single character), and halfwidth katakana with a voiced or semi-voiced mark (a sequence of two characters) is counted as 1. Similarly, the conversion between hiragana VU, represented by two characters (hiragana U + voiced mark), and katakana VU or halfwidth katakana VU is counted as 1.

  • Conversion concerning halfwidth katakana includes halfwidth symbols: HALFWIDTH IDEOGRAPHIC FULL STOP, HALFWIDTH LEFT CORNER BRACKET, HALFWIDTH RIGHT CORNER BRACKET, HALFWIDTH IDEOGRAPHIC COMMA, HALFWIDTH KATAKANA MIDDLE DOT, HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK, HALFWIDTH KATAKANA VOICED SOUND MARK, HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK. Conversion between hiragana and katakana includes those between hiragana iteration marks and katakana iteration marks.

  • Hiragana WI, WE, small WA and katakana WI, WE, small WA, small KA, small KE will be regarded as hiragana I, E, WA and katakana I, E, WA, KA, KE if the fallback conversion is necessary.

kanaH2Z(STRING)
kanaH2Z(SCALAR REF)

Converts Halfwidth Katakana to Katakana. Hiragana are not affected.

kataH2Z(STRING)
kataH2Z(SCALAR REF)

Converts Halfwidth Katakana to Katakana. Hiragana are not affected.

Note: kataH2Z is an alias of kanaH2Z.

hiraH2Z(STRING)
hiraH2Z(SCALAR REF)

Converts Halfwidth Katakana to Hiragana. Katakana are not affected.

kataZ2H(STRING)
kataZ2H(SCALAR REF)

Converts Katakana to Halfwidth Katakana. Hiragana are not affected.

kanaZ2H(STRING)
kanaZ2H(SCALAR REF)

Converts Hiragana to Halfwidth Katakana, and Katakana to Halfwidth Katakana.

hiraZ2H(STRING)
hiraZ2H(SCALAR REF)

Converts Hiragana to Halfwidth Katakana. Katakana are not affected.

hiXka(STRING)
hiXka(SCALAR REF)

Converts Hiragana to Katakana and Katakana to Hiragana at once. Halfwidth Katakana are not affected.

hi2ka(STRING)
hi2ka(SCALAR REF)

Converts Hiragana to Katakana. Halfwidth Katakana are not affected.

ka2hi(STRING)
ka2hi(SCALAR REF)

Converts Katakana to Hiragana. Halfwidth Katakana are not affected.

Conversion of Whitespace Characters

If a reference to a scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.

spaceH2Z(STRING)
spaceH2Z(SCALAR REF)

Converts "\x20" (space) to "\x81\x40" (ideographic space).

spaceZ2H(STRING)
spaceZ2H(SCALAR REF)

Converts "\x81\x40" (ideographic space) to "\x20" (space).

CAVEAT

A legal Shift-JIS character in this module must match the following regular expression:

   [\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Any string from an external source should be checked by issjis() function, excepting you know it is surely coded in Shift-JIS.

Use of an illegal Shift-JIS string may lead to odd results.

Some Shift-JIS double-byte characters have a trailing byte in the range of [\x40-\x7E], viz.,

   @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. For example, the quoted literal ending with a double-byte character whose trailing byte is 0x5C causes a fatal error, since the trailing byte 0x5C backslashes the closing quote.

Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift-JIS double-byte characters needs the greatest care.

The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift-JIS string literal.

The safe ASCII-graphic characters, [\x21-\x3F], are:

   !"#$%&'()*+,-./0123456789:;<=>?

They are preferred as the delimiter of quote-like operators.

BUGS

This module supposes $[ is always equal to 0, never 1.

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

  Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.

  This module is free software; you can redistribute it
  and/or modify it under the same terms as Perl itself.

SEE ALSO

ShiftJIS::Regexp
ShiftJIS::Collate