ShiftJIS::String - functions to manipulate Shift-JIS strings
use ShiftJIS::String; ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));
This module provides some functions which emulate the corresponding CORE functions and helps someone to manipulate multiple-byte character sequences in Shift-JIS.
CORE
* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.
issjis(LIST)
Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift-JIS.
Returns false if LIST includes one (or more) invalid string.
LIST
length(STRING)
Returns the length in characters of the supplied string.
strrev(STRING)
Returns a reversed string, i.e., a string that has all characters of STRING but in the opposite order.
STRING
index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)
Returns the position of the first occurrence of SUBSTR in STRING at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string.
SUBSTR
POSITION
If the substring is not found, returns -1.
-1
rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)
Returns the position of the last occurrence of SUBSTR in STRING. If POSITION is specified, returns the last occurrence at or before POSITION.
strspn(STRING, SEARCHLIST)
Returns the position of the first occurrence of any character that is not contained in <SEARCHLIST>.
If STRING consists of the characters in SEARCHLIST, the returned value must equal the length of STRING.
SEARCHLIST
While SEARCHLIST is not aware of character ranges, you can utilize mkrange().
mkrange()
strspn("+0.12345*12", "+-.0123456789"); # returns 8 (at '*')
strcspn(STRING, SEARCHLIST)
Returns the position of the first occurrence of any character contained in SEARCHLIST.
If STRING does not contain any character in SEARCHLIST, the returned value must equal the length of STRING.
rspan(STRING, SEARCHLIST)
Searches the last occurence of any character that is not contained in SEARCHLIST.
If such a character is found, returns the next position to it; otherwise (any character in STRING is contained in SEARCHLIST), it returns 0 (as the first position of the string).
0
rcspan(STRING, SEARCHLIST)
Searches the last occurence of any character that is contained in SEARCHLIST.
If such a character is found, returns the next position to it; otherwise (any character in STRING is not contained in SEARCHLIST), it returns 0 (as the first position of the string).
trim(STRING)
trim(STRING, SEARCHLIST)
trim(STRING, SEARCHLIST, USE_COMPLEMENT)
Erases characters in SEARCHLIST from the beginning and the end of STRING and the returns the result.
If USE_COMPLEMENT is true, erases characters that are not contained in SEARCHLIST.
USE_COMPLEMENT
If SEARCHLIST is omitted (or undef), it is used the list of whitespace characters i.e., "\t", "\n", "\r", "\f", "\x20" (SP), and "\x81\x40" (IDSP).
undef
"\t"
"\n"
"\r"
"\f"
"\x20"
SP
"\x81\x40"
IDSP
While SEARCHLIST is not aware of character ranges, you can utilize mkrange(), like trim($string, mkrange("\x00-\x20")).
trim($string, mkrange("\x00-\x20"))
ltrim(STRING)
ltrim(STRING, SEARCHLIST)
ltrim(STRING, SEARCHLIST, USE_COMPLEMENT)
Erases characters in SEARCHLIST from the beginning of STRING and the returns the result.
rtrim(STRING)
rtrim(STRING, SEARCHLIST)
rtrim(STRING, SEARCHLIST, USE_COMPLEMENT)
Erases characters in SEARCHLIST from the end of STRING and the returns the result.
substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
It works like CORE::substr, but using character semantics of Shift-JIS.
CORE::substr
If the REPLACEMENT as the fourth parameter is specified, replaces parts of the SCALAR and returns what was there before.
REPLACEMENT
SCALAR
You can utilize the lvalue reference, returned if a reference to a scalar variable is used as the first argument.
${ &substr(\$str,$off,$len) } = $replace; works like CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not aware of Shift-JIS, then successive assignment may cause unexpected results.
Get lvalue before any assignment if you are not sure.
strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)
This function emulates CORE::split, but splits on the SEPARATOR string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the @_ array.
CORE::split
SEPARATOR
@_
If an empty string is specified as SEPARATOR, splits the specified string into characters (similarly to CORE::split //, STRING, LIMIT).
CORE::split //, STRING, LIMIT
strsplit('', 'This is Perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's Perl.')
If an undefined value is specified as SEPARATOR, splits the specified string on whitespace characters (including IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field (similarly to CORE::split ' ', STRING, LIMIT).
IDEOGRAPHIC SPACE
CORE::split ' ', STRING, LIMIT
strsplit(undef, ' This is Perl.'); # ('This', 'is', 'Perl.')
strcmp(LEFT-STRING, RIGHT-STRING)
Returns 1 (when LEFT-STRING is greater than RIGHT-STRING) or 0 (when LEFT-STRING is equal to RIGHT-STRING) or -1 (when LEFT-STRING is lesser than RIGHT-STRING).
1
LEFT-STRING
RIGHT-STRING
The order is roughly as shown the following list.
JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).
For example, 0x41 as 'A' is lesser than 0xB1 (HANKAKU KATAKANA A). 0xB1 is lesser than 0x8341 (KATAKANA A). 0x8341 is lesser than 0x8383 (KATAKANA SMALL YA). 0x8383 is lesser than 0x83B1 (GREEK CAPITAL TAU).
0x41
'A'
0xB1
HANKAKU KATAKANA A
0x8341
KATAKANA A
0x8383
KATAKANA SMALL YA
0x83B1
GREEK CAPITAL TAU
Caveat! Compare the 2nd and the 4th examples. Byte "\xB1" is lesser than byte "\x83" as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift-JIS codepoint order.
"\xB1"
"\x83"
strEQ(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is equal to RIGHT-STRING.
Note: strEQ is an expensive equivalence of the CORE's eq operator.
strEQ
eq
strNE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is not equal to RIGHT-STRING.
Note: strNE is an expensive equivalence of the CORE's ne operator.
strNE
ne
strLT(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is lesser than RIGHT-STRING.
strLE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is lesser than or equal to RIGHT-STRING.
strGT(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is greater than RIGHT-STRING.
strGE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is greater than or equal to RIGHT-STRING.
strxfrm(STRING)
Returns a string transformed so that CORE:: cmp can be used for binary comparisons (NOT the length of the transformed string).
CORE:: cmp
I.e. strxfrm($a) cmp strxfrm($b) is equivalent to strcmp($a, $b), as long as your cmp doesn't use any locale other than that of Perl.
strxfrm($a) cmp strxfrm($b)
strcmp($a, $b)
cmp
mkrange(EXPR, EXPR)
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
A character range is specified with a '-' (HYPHEN-MINUS). The backslashed combinations '\-' and '\\' are used instead of the characters '-' and '\', respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.
'-'
HYPHEN-MINUS
'\-'
'\\'
'\'
For example, mkrange('+\-0-9a-fA-F') returns ('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F').
mkrange('+\-0-9a-fA-F')
('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F')
The order of Shift-JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC.
0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC
If true value is specified as the second parameter, Reverse character ranges such as '9-0', 'Z-A' can be used; otherwise, reverse character ranges are croaked.
'9-0'
'Z-A'
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference to a scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
SEARCHLIST and REPLACEMENTLIST
Character ranges (internally utilizing mkrange()) are supported.
If the REPLACEMENTLIST is empty, the SEARCHLIST is replicated.
REPLACEMENTLIST
If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. h Returns a hash (or a hashref in scalar context) of histogram R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally. strtr(\$str, " \x81\x40\n\r\t\f", '', 'd'); # deletes all whitespace characters including IDEOGRAPHIC SPACE.
If 'h' modifier is specified, returns a hash (or a hashref in scalar context) of histogram (key: a character as a string, value: count), whether the first argument is a reference or not. If you want to get the histogram and the modified string at once, pass a reference as the first argument and use its value after.
'h'
If 'R' modifier is specified, '-' is not evaluated as a meta character but HYPHEN-MINUS itself like in tr'''. Compare:
'R'
tr'''
strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI" strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If 'r' modifier is specified, you are allowed to use reverse character ranges. For example, strtr($str, "0-9", "9-0", "r") is equivalent to strtr($str, "0123456789", "9876543210").
'r'
strtr($str, "0-9", "9-0", "r")
strtr($str, "0123456789", "9876543210")
PATTERN and TOPATTERN
By use of PATTERN and TOPATTERN, you can transliterate the string using lists containing some multi-character substrings.
PATTERN
TOPATTERN
If called with four arguments, SEARCHLIST, REPLACEMENTLIST, and STRING are splited characterwise;
If called with five arguments, a multi-character substring that matchs PATTERN in SEARCHLIST, REPLACEMENTLIST, or STRING is regarded as an transliteration unit.
If both PATTERN and TOPATTERN are specified, a multi-character substring either that matchs PATTERN in SEARCHLIST, or STRING, or that matchs TOPATTERN in REPLACEMENTLIST is regarded as an transliteration unit.
print strtr( "Caesar Aether Goethe", "aeoeueAeOeUe", "äööÄÖÜ", "", "[aouAOU]e", "&[aouAOU]uml;"); # output: Cäsar Äther Göthe
LISTs as Anonymous Arrays
Instead of specification of PATTERN and TOPATTERN, you can use anonymous arrays as SEARCHLIST and/or REPLACEMENTLIST as follows.
print strtr( "Caesar Aether Goethe", [qw/ae oe ue Ae Oe Ue/], [qw/ä ö ö Ä Ö Ü/] );
Caching the conversion table
If 'o' modifier is specified, the conversion table is cached internally. e.g.
'o'
foreach (@strings) { print strtr($_, $from_list, $to_list, 'o'); }
will be almost as efficient as this:
$closure = trclosure($from_list, $to_list); foreach (@strings) { print &$closure($_); }
You can use whichever you like.
Without 'o',
foreach (@strings) { print strtr($_, $from_list, $to_list); }
will be very slow since the conversion table is made whenever the function is called.
trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.
The functionality of the closure made by trclosure() is equivalent to that of strtr(). Frankly speaking, the strtr() calls trclosure() internally and uses the returned closure.
trclosure()
strtr()
toupper(STRING)
toupper(SCALAR REF)
Returns an uppercased string of STRING. Converts only half-width Latin characters a-z to A-Z.
a-z
A-Z
If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.
tolower(STRING)
tolower(SCALAR REF)
Returns a lowercased string of STRING. Converts only half-width Latin characters A-Z to a-z.
If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.
If a reference to a scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
The conversion between a voiced (or semi-voiced) hiragana and katakana (a single character), and halfwidth katakana with a voiced or semi-voiced mark (a sequence of two characters) is counted as 1. Similarly, the conversion between hiragana VU, represented by two characters (hiragana U + voiced mark), and katakana VU or halfwidth katakana VU is counted as 1.
Conversion concerning halfwidth katakana includes halfwidth symbols: HALFWIDTH IDEOGRAPHIC FULL STOP, HALFWIDTH LEFT CORNER BRACKET, HALFWIDTH RIGHT CORNER BRACKET, HALFWIDTH IDEOGRAPHIC COMMA, HALFWIDTH KATAKANA MIDDLE DOT, HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK, HALFWIDTH KATAKANA VOICED SOUND MARK, HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK. Conversion between hiragana and katakana includes those between hiragana iteration marks and katakana iteration marks.
HALFWIDTH IDEOGRAPHIC FULL STOP
HALFWIDTH LEFT CORNER BRACKET
HALFWIDTH RIGHT CORNER BRACKET
HALFWIDTH IDEOGRAPHIC COMMA
HALFWIDTH KATAKANA MIDDLE DOT
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
HALFWIDTH KATAKANA VOICED SOUND MARK
HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
Hiragana WI, WE, small WA and katakana WI, WE, small WA, small KA, small KE will be regarded as hiragana I, E, WA and katakana I, E, WA, KA, KE if the fallback conversion is necessary.
kanaH2Z(STRING)
kanaH2Z(SCALAR REF)
Converts Halfwidth Katakana to Katakana. Hiragana are not affected.
kataH2Z(STRING)
kataH2Z(SCALAR REF)
Note: kataH2Z is an alias of kanaH2Z.
kataH2Z
kanaH2Z
hiraH2Z(STRING)
hiraH2Z(SCALAR REF)
Converts Halfwidth Katakana to Hiragana. Katakana are not affected.
kataZ2H(STRING)
kataZ2H(SCALAR REF)
Converts Katakana to Halfwidth Katakana. Hiragana are not affected.
kanaZ2H(STRING)
kanaZ2H(SCALAR REF)
Converts Hiragana to Halfwidth Katakana, and Katakana to Halfwidth Katakana.
hiraZ2H(STRING)
hiraZ2H(SCALAR REF)
Converts Hiragana to Halfwidth Katakana. Katakana are not affected.
hiXka(STRING)
hiXka(SCALAR REF)
Converts Hiragana to Katakana and Katakana to Hiragana at once. Halfwidth Katakana are not affected.
hi2ka(STRING)
hi2ka(SCALAR REF)
Converts Hiragana to Katakana. Halfwidth Katakana are not affected.
ka2hi(STRING)
ka2hi(SCALAR REF)
Converts Katakana to Hiragana. Halfwidth Katakana are not affected.
spaceH2Z(STRING)
spaceH2Z(SCALAR REF)
Converts "\x20" (space) to "\x81\x40" (ideographic space).
spaceZ2H(STRING)
spaceZ2H(SCALAR REF)
Converts "\x81\x40" (ideographic space) to "\x20" (space).
A legal Shift-JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Any string from an external source should be checked by issjis() function, excepting you know it is surely coded in Shift-JIS.
issjis()
Use of an illegal Shift-JIS string may lead to odd results.
Some Shift-JIS double-byte characters have a trailing byte in the range of [\x40-\x7E], viz.,
[\x40-\x7E]
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. For example, the quoted literal ending with a double-byte character whose trailing byte is 0x5C causes a fatal error, since the trailing byte 0x5C backslashes the closing quote.
0x5C
Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift-JIS double-byte characters needs the greatest care.
The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift-JIS string literal.
<< ''
\xhh
The safe ASCII-graphic characters, [\x21-\x3F], are:
[\x21-\x3F]
!"#$%&'()*+,-./0123456789:;<=>?
They are preferred as the delimiter of quote-like operators.
This module supposes $[ is always equal to 0, never 1.
$[
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install ShiftJIS::String, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ShiftJIS::String
CPAN shell
perl -MCPAN -e shell install ShiftJIS::String
For more information on module installation, please visit the detailed CPAN module installation guide.