Unicode::Transform - conversion among Unicode Transformation Formats
use Unicode::Transform ':all'; $unicode_string = utf16be_to_unicode($utf16be_string); $utf16le_string = unicode_to_utf16le($unicode_string); $utf8_string = utf32be_to_utf8 ($utf32be_string); $utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string); # ill-formed octet sequences are allowed.
This module provides some functions to convert a string among some Unicode Transformation Formats (UTF).
(Exporting: use Unicode::Transform ':conv';)
use Unicode::Transform ':conv';
<SRC_UTF_NAME>_to_<DST_UTF_NAME>([CALLBACK,] STRING)
Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME.
Function names
A function name consists of SRC_UTF_NAME, a string '_to_', and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl internal Unicode encoding; see perlunicode) utf16le (for UTF-16LE) utf16be (for UTF-16BE) utf32le (for UTF-32LE) utf32be (for UTF-32BE) utf8 (for UTF-8) utf8mod (for UTF-8-Mod) utfcp1047 (for CP1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le() and utf8_to_unicode(). DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8().
utf16be_to_utf32le()
utf8_to_unicode()
utf8_to_utf8()
Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are defined well and stably. In contrast to these UTF, the Perl internal Unicode encoding is influenced by the platform-dependent features (e.g. 32bit/64bit, ASCII/EBCDIC).
Parameters
If the first parameter is a reference, that is regarded as the CALLBACK. Any reference will not allowed as STRING. If CALLBACK is given, the second parameter is STRING; otherwise the first is. Currently, only code references are allowed as CALLBACK.
If CALLBACK is omitted, only Unicode scalar values (0x0000..0xD7FF and 0xE000..0x10FFFF) are allowed. Ill-formed octet sequences (corresponding to a code point outside the range of Unicode scalar values) and partial octets (which does not correspond to any code point) are deleted, as if a code reference constantly returning an empty string, sub {''}, was used as CALLBACK.
0x0000..0xD7FF
0xE000..0x10FFFF
sub {''}
Examples of partial octets: the first octet without following octets in UTF-8 like "\xC2"; the last octet in UTF-16BE,LE with odd number of octets.
"\xC2"
If CALLBACK is specified, the appearance of an ill-formed octet sequences or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK will be inserted there. You may use chr_<DST_UTF_NAME>() as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME.
chr_<DST_UTF_NAME>()
You can call die or croak in CALLBACK when you want to stop the operation if the whole STRING would not be well-formed.
die
croak
(Exporting: use Unicode::Transform ':chr';)
use Unicode::Transform ':chr';
<DST_UTF_NAME>(CODEPOINT)
Returns a string in DST_UTF_NAME corresponding to CODEPOINT. CODEPOINT should be an unsigned integer. If CODEPOINT is outside the range of Unicode scalar values, a corresponding ill-formed octet sequence will be returned.
If CODEPOINT is greater than the maximum value, returns undef. The maximum value of CODEPOINT is:
undef
0x0010_FFFF for chr_utf16le() and chr_utf16be() 0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047() 0xFFFF_FFFF for chr_utf32le(), chr_utf32be()
The maximum value of CODEPOINT for chr_unicode() depends on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC).
chr_unicode()
The full list of functions provided:
chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)
(Exporting: use Unicode::Transform ':ord';)
use Unicode::Transform ':ord';
<SRC_UTF_NAME>(STRING)
Returns an unsigned integer value of the first character of STRING in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence corresponding to a surrogate code point (0xD800..0xDFFF) or an out-of-range code point (0x110000 and greater). If STRING is empty or begins at a partial octet, returns undef.
0xD800..0xDFFF
0x110000
ord_unicode(STRING)
ord_utf16le(STRING)
ord_utf16be(STRING)
ord_utf32le(STRING)
ord_utf32be(STRING)
ord_utf8(STRING)
ord_utf8mod(STRING)
ord_utfcp1047(STRING)
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2002-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
UTF-EBCDIC (and UTF-8-Mod)
To install Unicode::Transform, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Unicode::Transform
CPAN shell
perl -MCPAN -e shell install Unicode::Transform
For more information on module installation, please visit the detailed CPAN module installation guide.