.. Copyright (c) 2025 Tobias Erbsland - Erbsland DEV. https://erbsland.dev SPDX-License-Identifier: Apache-2.0 .. _ref-character: .. index:: single: Character Characters ========== The *Erbsland Configuration Language* (:term:`ELCL`) supports all valid :term:`Unicode` characters, except for most control codes. In this section, we define characters, character groups, and ranges that carry specific meanings in the language. .. index:: single: Encoding Encoding -------- #. An :term:`ELCL` document **must** be encoded in :term:`UTF-8`. #. A parser **must** support an optional UTF-8 *BOM* (Byte Order Mark). #. A parser **must** raise an ``Encoding`` error if it encounters an illegal byte sequence in the UTF-8 encoded data. #. A parser **must** raise an ``Encoding`` error if it encounters a valid UTF-8 sequence that represents an illegal Unicode code point. .. micro-parser:: Parsers **must** support at least 7-bit ASCII, but may support additional UTF-8 encoded data. .. index:: single: UTF-8 single: Invalid UTF-8 Illegal UTF-8 Sequences ----------------------- An :term:`ELCL` parser must reject any illegal UTF-8 sequences and terminate with an error. Since not all programming languages or libraries fully implement UTF-8 decoding, the following list outlines illegal UTF-8 sequences that a parser must reject. These sequences are representative examples, not exhaustive lists. .. list-table:: :header-rows: 1 :widths: 25, 75 :width: 100% * - Sequence - Description * - | :text-code:`ED A0 80` ... | :text-code:`ED BF BF` - Low- and high-surrogates are special 16-bit code points used in UTF-16 to encode 32-bit values. They represent the code point range from :text-code:`U+D800` to :text-code:`U+DFFF`, which are illegal in UTF-8 and must be rejected. * - | :text-code:`F4 90 80 80` ... | :text-code:`F5 ...` | :text-code:`F6 ...` | ... | :text-code:`FD ...` - The :term:`Unicode` standard limits the highest valid code point to :text-code:`U+10FFFF`. Any UTF-8 sequence that generates a code point above this range, such as :text-code:`U+110000` and higher, must be rejected. * - | :text-code:`C0 80` | :text-code:`C1 80` | :text-code:`E0 9F BF` | :text-code:`F0 8F BF BF` - UTF-8 multi-byte sequences can encode the same value in multiple ways, but only the shortest possible encoding is allowed. Therefore, sequences that encode a value in a longer sequence than technical necessary are illegal and must be rejected. * - | :text-code:`C2` + 7-bit | :text-code:`EO 80` + 7-bit | :text-code:`FO 80 80` + 7-bit - If a start byte is not followed by the required number of continuation bytes, the sequence is illegal. This can occur if a 7-bit character follows an incomplete sequence, or if the document ends mid-sequence. * - :text-code:`80` — :text-code:`BF` - A continuation byte must only appear after a valid start byte. If encountered elsewhere, it is illegal and must be rejected. * - :text-code:`FE`, :text-code:`FF` - These are invalid start bytes and must be rejected. .. design-rationale:: Enforcing strict UTF-8 handling ensures predictable behavior, as opposed to lenient alternatives like skipping, ignoring, or replacing invalid encodings with the *replacement character*. If encoding issues are not handled upfront, they will surface in the application layer, potentially causing problems or requiring additional error-handling logic. Strict encoding rules ensure that users of an :term:`ELCL` parser can reliably process text from configuration files. Implementation Examples ~~~~~~~~~~~~~~~~~~~~~~~ A safe and complete UTF-8 decoding process, including the rejection of all illegal characters, can be implemented with minimal code if you use bit-tests in your decoder. .. code-block:: cpp :caption: Pseudo C++ code for proper UTF-8 decoding if (at_end()) return Char(); // EOF byte c = get_next_byte(); if (c < 0x80) return Char(c); // 7-bit ASCII uint8_t cSize = 0; uint32_t unicodeValue; if ((c & 0b11100000u) == 0b11000000u && c >= 0b11000010u) { cSize = 2; // 2-byte sequence unicodeValue = (c & 0b00011111u); } else if ((c & 0b11110000u) == 0b11100000u) { cSize = 3; // 3-byte sequence unicodeValue = (c & 0b00001111u); } else if ((c & 0b11111000u) == 0b11110000u && c < 0b11110101u) { cSize = 4; // 4-byte sequence unicodeValue = (c & 0b00000111u); } if (cSize < 2) throw EncodingError(); // Invalid start byte sequence UnsafeConstBytePtr lastIt = it; for (uint8_t i = 1; i < cSize; ++i) { if (at_end()) throw EncodingError(); c = get_next_char(); if ((c & 0b11000000u) != 0b10000000u) throw EncodingError(); // Invalid continuation byte unicodeValue <<= 6; unicodeValue |= (c & 0b00111111u); } if ((cSize == 3 && unicodeValue < 0x800) || cSize == 4 && unicodeValue < 0x10000) { throw EncodingError(); // Over-long UTF-8 sequence. } const auto result = Char(unicodeValue); // Validate against invalid Unicode ranges (surrogates, code points > 0x10FFFF) if (!result.isValidUnicode()) throw EncodingError(); return result; .. code-block:: python :caption: Pseudo Python code for proper UTF-8 decoding def parse_utf8_char() -> str: if at_end(): return None # EOF c = get_next_byte() if c < 0x80: return chr(c) # 7-bit ASCII c_size = 0 unicode_value = 0 if (c & 0b11100000) == 0b11000000 and c >= 0b11000010: c_size = 2 # 2-byte sequence unicode_value = c & 0b00011111 elif (c & 0b11110000) == 0b11100000: c_size = 3 # 3-byte sequence unicode_value = c & 0b00001111 elif (c & 0b11111000) == 0b11110000 and c < 0b11110101: c_size = 4 # 4-byte sequence unicode_value = c & 0b00000111 else: raise EncodingError("Invalid start byte sequence") for _ in range(1, c_size): if at_end(): raise EncodingError("Unexpected end of data") c = get_next_byte() if (c & 0b11000000) != 0b10000000: raise EncodingError("Invalid continuation byte") unicode_value = (unicode_value << 6) | (c & 0b00111111) if c_size == 3 and unicode_value < 0x800 or c_size == 4 and unicode_value < 0x10000: raise EncodingError("Over-long UTF-8 sequence") if not is_valid_unicode(unicode_value): raise EncodingError("Invalid Unicode code point") return chr(unicode_value) .. index:: single: Control single: Control Code single: Illegal Control Codes Illegal Control Codes --------------------- Most control codes are prohibited in an :term:`ELCL` document. The following table lists all illegal control codes. .. list-table:: :header-rows: 1 :widths: 25, 75 :width: 100% * - Code/Range - Description * - :text-code:`U+0000` - The "null" control character is disallowed in any part of a document, including text. The escape sequence ``\u0000`` is not permitted in text. * - | :text-code:`U+0001` — :text-code:`U+0008` | :text-code:`U+000B` — :text-code:`U+000C` | :text-code:`U+000E` — :text-code:`U+001F` | :text-code:`U+007F` — :text-code:`U+00A0` - These control codes are disallowed in documents. However, they may appear in text blocks as escape sequences. The only **valid control codes** in *ELCL* documents are the tab (:cp:`09`), new-line (:cp:`0a`), and carriage-return (:cp:`0d`). .. design-rationale:: Historically, control codes had specific uses, but today, most of them introduce errors or even security vulnerabilities. For this reason, control codes are disallowed in *ELCL* documents, particularly in text. If a control code is needed in text, it can be inserted using the appropriate Unicode escape sequence. The "null" control character is forbidden in text because it frequently causes issues when passing text through API boundaries. Like other control codes, it serves no meaningful purpose in text contexts. If byte-data is needed, *ELCL* provides support for such structures, and if values need to be separated, lists can be used. Prohibiting control codes simplifies text processing, although more complex Unicode behaviors—such as combining characters or directionality markers—remain possible within text blocks. However, the responsibility for handling these complexities can safely be delegated to the application code. .. index:: single: Character single: Named Characters Named Characters in EBNF ------------------------ Characters in the shown EBNF syntax are named according to their Unicode or common names, rather than their function within the language. .. code-block:: bnf TAB ::= #x0009 /* Tab character */ LF ::= #x000A /* Line feed */ CR ::= #x000D /* Carriage return */ SPACE ::= #x0020 /* Space character */ DOUBLE_QUOTE ::= #x0022 /* Double Quote (") */ HASH ::= #x0023 /* Hash symbol (#) */ DOLLAR ::= #x0024 /* Dollar sign ($) */ APOSTROPHE ::= #x0027 /* Apostrophe (') */ ASTERISK ::= #x002A /* Asterisk (*) */ PLUS ::= #x002B /* The plus sign (+) */ COMMA ::= #x002C /* Comma (,) */ HYPHEN ::= #x002D /* Hyphen (-) */ PERIOD ::= #x002E /* Period (.) */ SLASH ::= #x002F /* Slash (/) */ COLON ::= #x003A /* Colon (:) */ LESS_THAN_SIGN ::= #x003C /* Less-Than Sign (<) */ EQUAL ::= #x003D /* Equals sign (=) */ GREATER_THAN_SIGN ::= #x003E /* Greater-Than Sign (>) */ AT_SIGN ::= #x0040 /* At sign (@) */ SQ_BRACKET_OPEN ::= #x005B /* Opening square bracket ([) */ BACKSLASH ::= #x005C /* Backslash (\) */ SQ_BRACKET_CLOSE ::= #x005D /* Closing square bracket (]) */ UNDERSCORE ::= #x005F /* Underscore (_) */ BACKTICK ::= #x0060 /* Backtick (`) */ CU_BRACKET_OPEN ::= #x007B /* Opening curly bracket ({) */ CU_BRACKET_CLOSE ::= #x007D /* Closing curly bracket ({) */ .. index:: single: Character Groups Character Groups in EBNF ------------------------ In ELCL, certain character groups have predefined ranges or sets. Below is a list of important character groups used in the EBNF syntax: .. code-block:: bnf DIGIT ::= [#x0030-#x0039] /* Decimal digits 0-9 */ HEX_DIGIT ::= [#x0030-#x0039#x0041-#x0046#x0061-#x0066] /* Hexadecimal digits 0-9, A-F, a-f */ BIN_DIGIT ::= [#x0030#x0031] /* Binary digits 0, 1 */ ALPHA ::= [#x0041-#x005A#x0061-#x007A] /* Alphabetic characters A-Z, a-z */ TEXT ::= [#x0009#x0020-#x007E#x00A0-#x10FFFF] /* Any printable character (excluding control codes) */ DIGIT_OR_ALPHA ::= DIGIT | ALPHA /* Digits or alphabetic characters */ FORMAT_DIGIT ::= ALPHA | DIGIT | HYPHEN | UNDERSCORE /* One element of a format specifier */ PLUS_MINUS ::= PLUS | HYPHEN /* Plus or minus */ LETTER_E ::= [eE] /* The letter E */ .. note:: The ``TEXT`` character group must exclude low and high surrogates, as well as any characters that are invalid in a UTF-8 encoded document. Features -------- .. list-table:: :header-rows: 1 :width: 100% :widths: 25, 75 * - Feature - Coverage * - :text-code:`core` - The full syntax outlined in this chapter is part of the core language. Errors ------ .. list-table:: :header-rows: 1 :width: 100% :widths: 25, 75 * - Error Code - Causes * - :text-code:`Encoding` - Raised if the parser detects invalid UTF-8 sequences. * - :text-code:`Character` - Raised if an illegal control character is read in the configuration document.