🗐 UTF8.E for Euphoria 3.1.1


                 Version 2.02, January/13/2018, by Shian Lee.


                             1. Introduction


 This library provides UTF-8 and Unicode conversion routines for any platform.
 It is especially useful for developing applications for Linux terminal. (see
 also get_xkey() and wait_xkey() functions, MACHINE2.E by Shian Lee,
 www.RapidEuphoria.com archive).

 Although the UTF-8 encoding for Unicode is not the only encoding in use, it
 is definitely the dominant character encoding in many areas, and recommended
 for most applications.

 Unicode is a table of 1114112 (17 * power(2, 16)) code points, where each
 code point represents a single international character. The UTF-8 encoding
 for Unicode is a method of converting a code point (large integer), into a
 sequence of 1 to 4 bytes. Bytes are necessary for any I/O operation such as
 puts() and gets().

 UTF-8 is fully compatible with ASCII codes (0-127), and it is totally safe to
 embed UTF-8 strings in your source code file.


 Disclaimer
 ==========

 Use this library at your own risk. The author will not be responsible for
 any damage or data loss.

 This library is tested on Linux Mint 18.3 operating system. The code or the
 documentation might still contain errors or mistakes.


 In the descriptions below, to indicate what kind of object may be passed in
 and returned, the following prefixes are used:

 x     - a general object (atom or sequence)

 s     - a sequence (...or a string-sequence)

 a     - an atom

 i     - an integer

 fn    - an integer used as a file number

 st    - a string sequence, or single-character atom



                      2. Routines by Application Area


 2.1 Decoding/Encoding
 =====================

 unicode         - decode UTF-8 string into Unicode string

 utf8            - encode Unicode string into UTF-8 string

 group_utf8      - group UTF-8 characters as sub-strings

 UNICODE_BOM     - the byte-order-mark Unicode code point (#FEFF)

 UTF8_BOM        - the byte-order-mark UTF-8 character ({#EF, #BB, #BF})

 UNICODE_REP     - the replacement-character Unicode code point (#FFFD)

 UTF8_REP        - the replacement-character UTF-8 character ({#EF, #BF, #BD})



 2.2 Mapping
 ===========

 cp437 and cp1252 are commonly used 8-bit encodings for DOS and Windows. For
 more code pages see: ftp://ftp.unicode.org/Public/MAPPINGS/.

 CP437           - the DOS-Latin-US cp437 mapping to Unicode ({#0001, ...})

 CP1252          - the Windows-Latin cp1252 mapping to Unicode ({#0001, ...})



 2.3 Bitwise Logical Operations
 ==============================

 B00000000
 ...
 B11111111       - the binary representations of byte (0-255)




                   3. Alphabetical Listing of all Routines



 --------------------------------<b00000000>---------------------------------

 Syntax:      include binconst.e
              B00000000 ... B11111111

 Description: B00000000 ... B11111111 (0 ... 255) have been defined as global
              constants.

 Comments:    Binary representation of byte (8-bit integer from 0 to 255) can
              often improve readability of a bit-manipulation code, such as
              UTF-8 character encoding.

 Example 1:

              x = B01110011  -- x is 115 (#73)


 Example 2:

              i = B11110101 * #100 + B00001101
              -- i is B11110101_00001101 (#F50D)

              a = B11110101 * #1000000 + B00001101
              -- a is B11110101_00000000_00000000_00001101 (#F500000D)


 Example 3:

              i = and_bits('a', B11011111)
              -- i is 'A' -- upper 'a' by resetting bit-5


 See Also:    Euphoria 3.1.1: and_bits, or_bits, xor_bits, not_bits


 ---------------------------------<cp1252>-----------------------------------

 Syntax:      include cp1252.e
              CP1252

 Description: CP1252 ({#0001, ...}) has been defined as a global constant.

 Comments:    The sequence CP1252 maps the code page 1252, from 1 to 255, into
              Unicode code points. It allows you to write programs which are
              compatible with both UTF-8 and cp1252 encodings.

              ASCII codes 0 to 127 are identical in both encodings. Code 0
              (NULL) is not included in CP1252, because Euphoria-sequence is
              1-based and it's easier to start counting from 1.

              Windows Latin cp1252 is a commonly used 8-bit character encoding.
              For more info see https://en.wikipedia.org/wiki/Windows-1252.

 Example:

              x = CP1252['c']  -- x is #0063 ('c')

              x = CP1252[177]  -- x is #00B1 (plus-minus sign)


 Example Programs: mapping.ex, boxes.ex

 See Also:    CP437


 ----------------------------------<cp437>-----------------------------------

 Syntax:      include cp437.e
              CP437

 Description: CP437 ({#0001, ...}) has been defined as a global constant.

 Comments:    The sequence CP437 maps the code page 437, from 1 to 255, into
              Unicode code points. It allows you to write programs which are
              compatible with both UTF-8 and cp437 encodings.

              ASCII codes 0 to 127 are identical in both encodings. Code 0
              (NULL) is not included in CP437, because Euphoria-sequence is
              1-based and it's easier to start counting from 1.

              DOS Latin US cp437 is the standard (8-bit) encoding for IBM PC.
              For more info see https://en.wikipedia.org/wiki/Code_page_437.

 Example:

              x = CP437['c']  -- x is #0063 ('c')

              x = CP437[241]  -- x is #00B1 (plus-minus sign)


 Example Programs: mapping.ex, boxes.ex

 See Also:    CP1252


 ---------------------------------<utf8_bom>---------------------------------

 Syntax:      include utf8.e
              UTF8_BOM

 Description: UTF8_BOM ({#EF, #BB, #BF}) has been defined as a global constant.

 Comments:    Many Windows programs (including Windows Notepad) add the Unicode
              character BOM (byte-order-mark) in the beginning of a document,
              as a magic number to indicate that it's a UTF-8 file. In some
              cases you may have to add or remove the BOM character from the
              beginning of a file.

              For more info see https://en.wikipedia.org/wiki/Byte_order_mark.

 Example:

              x = UTF8_BOM  -- x is {239, 187, 191}


 See Also:    UNICODE_BOM


 ---------------------------------<utf8_rep>---------------------------------

 Syntax:      include utf8.e
              UTF8_REP

 Description: UTF8_REP ({#EF, #BF, #BD}) has been defined as a global constant.

 Comments:    The replacement-character is used to indicate problems when a
              system is not able to decode a stream of data to a correct
              symbol.

              The functions utf8() and group_utf8() return invalid code points
              as UTF8_REP, which is useful for some algorithms.

              See also https://en.wikipedia.org/wiki/Specials_(Unicode_block).

 Example:

              x = UTF8_REP  -- x is {239, 191, 189}


 See Also:    UNICODE_REP, utf8, group_utf8


 -------------------------------<unicode_bom>--------------------------------

 Syntax:      include utf8.e
              UNICODE_BOM

 Description: UNICODE_BOM (#FEFF) has been defined as a global constant.

 Comments:    Many Windows programs (including Windows Notepad) add the Unicode
              character BOM (byte-order-mark) in the beginning of a document,
              as a magic number to indicate that it's a UTF-8 file. In some
              cases you may have to add or remove the BOM character from the
              beginning of a Unicode string.

              For more info see https://en.wikipedia.org/wiki/Byte_order_mark.

 Example:

              x = UNICODE_BOM  -- x is 65279 (#FEFF)


 See Also:    UTF8_BOM


 -------------------------------<unicode_rep>--------------------------------

 Syntax:      include utf8.e
              UNICODE_REP

 Description: UNICODE_REP (#FFFD) has been defined as a global constant.

 Comments:    The replacement-character is used to indicate problems when a
              system is not able to decode a stream of data to a correct
              symbol.

              The function unicode() returns invalid code points as
              UNICODE_REP, which is useful for some algorithms. For example
              see edu.ex (Shian Lee, www.RapidEuphoria.com archive).

              See also https://en.wikipedia.org/wiki/Specials_(Unicode_block).

 Example:

              x = UNICODE_REP  -- x is 65533 (#FFFD)


 See Also:    UTF8_REP, unicode


 ------------------------------<group_utf8>----------------------------------

 Syntax:      include utf8.e
              s = group_utf8(st)

 Description: Group UTF-8 characters in string-sequence st as sub-strings.

 Comments:    Only multi-byte UTF-8 characters are grouped into sub-strings;
              single-byte ASCII characters, 0-127, are not grouped.

              Invalid (excluding long-forms) UTF-8 byte-sequences in st are
              decoded into the Unicode replacement character (see UTF8_REP).

              Unlike many languages Euphoria can easily manipulate any multi-
              byte string directly, regardless of which encoding is being used.
              For example, in Euphoria the sequence {{'a', 'b'}, 'c'} has two
              multi-byte "characters", the first is "ab", the second is 'c'.

              Instead of using unicode() and utf8() functions to decode and
              encode strings for I/O operations, you can use group_utf8() to
              "decode" a UTF-8 string, and flat() to "encode" a UTF-8 string
              (see MACHINE2.E by Shian Lee, www.RapidEuphoria.com archive).

              You can also use print() to save multi-byte strings in a file,
              regardless of encoding, and load that file with get().

 Example 1:

              -- "home"
              s = group_utf8({104, 111, 109, 101})
              -- s is {104, 111, 109, 101}

              -- "дом" ("home" in Russian)
              s = group_utf8({208, 180, 208, 190, 208, 188})
              -- s is {{208, 180}, {208, 190}, {208, 188}}

              -- "집" ("home" in Korean)
              s = group_utf8({236, 167, 145})
              -- s is {{236, 167, 145}}


 Example 2:

              integer fn
              object line

              -- save the word "home" (in Russian) to a UTF-8 file
              fn = open("utf8file.txt", "w")
              puts(fn, "дом in Russian...\n")
              close(fn)

              -- load the word "home" (in Russian) from a UTF-8 file
              -- into Euphoria sequence, and print it to the screen
              fn = open("utf8file.txt", "r")
              line = group_utf8(gets(fn))
              close(fn)
              puts(1, flat(line))   -- see flat(), MACHINE2.E


 Example Program: boxes.ex

 See Also:    unicode, utf8, UTF8_BOM, UTF8_REP
              Euphoria 3.1.1: equal, print, get, Operations on Sequences


 --------------------------------<unicode>-----------------------------------

 Syntax:      include utf8.e
              s2 = unicode(s1)

 Description: Decode UTF-8 string-sequence s1 into Unicode string-sequence s2.

 Comments:    Invalid or long-form UTF-8 byte-sequences in s1 are decoded into
              the Unicode replacement character (see UNICODE_REP).

              While UTF-8 strings are neccessary for any I/O operation, it's
              not simple to use regular functions such as match_from() or
              length() to manipulate a multi-byte character string. It's much
              more convenient to manipulate Unicode strings where each
              character is represented by a single Unicode code point.

              Unicode code point is an Euphoria integer. Euphoria sees both
              Unicode string and ASCII string in the same way: a sequence of
              integers. You can manipulate a Unicode string in the same way,
              using the same functions, that you use to manipulate a regular
              ASCII string ("abc").

              To output a Unicode string you must first encode it to a UTF-8
              byte-string with utf8().

              For more info see https://en.wikipedia.org/wiki/Unicode.

              For a complete Unicode/UTF-8-character table see:
              http://www.utf8-chartable.de/.
              See also the 'Character Map' application on Windows and Linux
              Mint systems.

 Example 1:

              -- "home"
              s = unicode({104, 111, 109, 101})
              -- s is {104, 111, 109, 101}

              -- "дом" ("home" in Russian)
              s = unicode({208, 180, 208, 190, 208, 188})
              -- s is {1076, 1086, 1084}

              -- "집" ("home" in Korean)
              s = unicode({236, 167, 145})
              -- s is {51665}


 Example 2:

              integer fn
              object line

              -- save the word "home" (in Russian) to a UTF-8 file
              fn = open("utf8file.txt", "w")
              puts(fn, "дом in Russian...\n")
              close(fn)

              -- load the word "home" (in Russian) from a UTF-8 file
              -- into Unicode string, and print it to the screen
              fn = open("utf8file.txt", "r")
              line = unicode(gets(fn))
              close(fn)
              puts(1, utf8(line))


 See Also:    utf8, group_utf8, UNICODE_BOM, UNICODE_REP
              Euphoria 3.1.1: gets, length, print, find, match


 ----------------------------------<utf8>------------------------------------

 Syntax:      include utf8.e
              s = utf8(x)

 Description: Encode a single Unicode code point (integer) or a string-sequence
              of Unicode code points x into UTF-8 string-sequence s.

 Comments:    Invalid Unicode code points in x (less then 0 or greater then
              1114111) are encoded into the Unicode replacement character (see
              UTF8_REP).

              Unicode string-sequence can only be manipulated in memory, since
              it is made of large integers rather than bytes. For I/O operation
              such as puts() you must first encode each Unicode code point into
              a UTF-8 multi-byte character (1 to 4 bytes).

              Note that the length of a UTF-8 string is equal or *longer* then
              the length of the Unicode string that it encodes. Therefore I/O
              routines which are using the actual length of a string, such as
              wrap() and scroll(), may seem to behave in an odd way.

              UTF-8 strings are made of single-byte ASCII characters in the
              range 0 to 127, and multi-byte non-ASCII characters in the range
              128 to 255, therefore it is totally safe to embed UTF-8 strings
              in a source code file.

              You must use an editor and terminal which support UTF-8 encoding
              to actually view UTF-8 characters in a source code file or on
              the screen, otherwise you will see a meaningless string of bytes.

              For more info see https://en.wikipedia.org/wiki/UTF-8.

 Example 1:

              s = utf8({104, 111, 109, 101})
              -- s is {104, 111, 109, 101} -- "home"

              s = utf8({1076, 1086, 1084})
              -- s is {208, 180, 208, 190, 208, 188}
              -- "дом" ("home" in Russian)

              s = utf8({51665})
              -- s is {236, 167, 145} -- "집" ("home" in Korean)


 Example 2:

              -- output embedded UTF-8 string "home" in Russian
              puts(1, "*** дом ***\n")


 Example Programs: mapping.ex, boxes.ex

 See Also:    unicode, group_utf8, UTF8_BOM, UTF8_REP
              Euphoria 3.1.1: puts, length