Characters
The Erbsland Configuration Language (ELCL) supports all valid Unicode characters, except for most control codes. In this section, we define characters, character groups, and ranges that carry specific meanings in the language.
Encoding
A parser must support an optional UTF-8 BOM (Byte Order Mark).
A parser must raise an
Encoding
error if it encounters an illegal byte sequence in the UTF-8 encoded data.A parser must raise an
Encoding
error if it encounters a valid UTF-8 sequence that represents an illegal Unicode code point.
Micro-Parsers
Parsers must support at least 7-bit ASCII, but may support additional UTF-8 encoded data.
Illegal UTF-8 Sequences
An ELCL parser must reject any illegal UTF-8 sequences and terminate with an error. Since not all programming languages or libraries fully implement UTF-8 decoding, the following list outlines illegal UTF-8 sequences that a parser must reject. These sequences are representative examples, not exhaustive lists.
Sequence |
Description |
---|---|
ED A0 80 …
ED BF BF
|
Low- and high-surrogates are special 16-bit code points used in UTF-16 to encode 32-bit values. They represent the code point range from U+D800 to U+DFFF, which are illegal in UTF-8 and must be rejected. |
F4 90 80 80 …
F5 ...
F6 ...
…
FD ...
|
The Unicode standard limits the highest valid code point to U+10FFFF. Any UTF-8 sequence that generates a code point above this range, such as U+110000 and higher, must be rejected. |
C0 80
C1 80
E0 9F BF
F0 8F BF BF
|
UTF-8 multi-byte sequences can encode the same value in multiple ways, but only the shortest possible encoding is allowed. Therefore, sequences that encode a value in a longer sequence than technical necessary are illegal and must be rejected. |
C2 + 7-bit
EO 80 + 7-bit
FO 80 80 + 7-bit
|
If a start byte is not followed by the required number of continuation bytes, the sequence is illegal. This can occur if a 7-bit character follows an incomplete sequence, or if the document ends mid-sequence. |
80 — BF |
A continuation byte must only appear after a valid start byte. If encountered elsewhere, it is illegal and must be rejected. |
FE, FF |
These are invalid start bytes and must be rejected. |
Design Rationale
Enforcing strict UTF-8 handling ensures predictable behavior, as opposed to lenient alternatives like skipping, ignoring, or replacing invalid encodings with the replacement character. If encoding issues are not handled upfront, they will surface in the application layer, potentially causing problems or requiring additional error-handling logic. Strict encoding rules ensure that users of an ELCL parser can reliably process text from configuration files.
Implementation Examples
A safe and complete UTF-8 decoding process, including the rejection of all illegal characters, can be implemented with minimal code if you use bit-tests in your decoder.
if (at_end()) return Char(); // EOF
byte c = get_next_byte();
if (c < 0x80) return Char(c); // 7-bit ASCII
uint8_t cSize = 0;
uint32_t unicodeValue;
if ((c & 0b11100000u) == 0b11000000u && c >= 0b11000010u) {
cSize = 2; // 2-byte sequence
unicodeValue = (c & 0b00011111u);
} else if ((c & 0b11110000u) == 0b11100000u) {
cSize = 3; // 3-byte sequence
unicodeValue = (c & 0b00001111u);
} else if ((c & 0b11111000u) == 0b11110000u && c < 0b11110101u) {
cSize = 4; // 4-byte sequence
unicodeValue = (c & 0b00000111u);
}
if (cSize < 2) throw EncodingError(); // Invalid start byte sequence
UnsafeConstBytePtr lastIt = it;
for (uint8_t i = 1; i < cSize; ++i) {
if (at_end()) throw EncodingError();
c = get_next_char();
if ((c & 0b11000000u) != 0b10000000u) throw EncodingError(); // Invalid continuation byte
unicodeValue <<= 6;
unicodeValue |= (c & 0b00111111u);
}
if ((cSize == 3 && unicodeValue < 0x800) || cSize == 4 && unicodeValue < 0x10000) {
throw EncodingError(); // Over-long UTF-8 sequence.
}
const auto result = Char(unicodeValue);
// Validate against invalid Unicode ranges (surrogates, code points > 0x10FFFF)
if (!result.isValidUnicode()) throw EncodingError();
return result;
def parse_utf8_char() -> str:
if at_end():
return None # EOF
c = get_next_byte()
if c < 0x80:
return chr(c) # 7-bit ASCII
c_size = 0
unicode_value = 0
if (c & 0b11100000) == 0b11000000 and c >= 0b11000010:
c_size = 2 # 2-byte sequence
unicode_value = c & 0b00011111
elif (c & 0b11110000) == 0b11100000:
c_size = 3 # 3-byte sequence
unicode_value = c & 0b00001111
elif (c & 0b11111000) == 0b11110000 and c < 0b11110101:
c_size = 4 # 4-byte sequence
unicode_value = c & 0b00000111
else:
raise EncodingError("Invalid start byte sequence")
for _ in range(1, c_size):
if at_end():
raise EncodingError("Unexpected end of data")
c = get_next_byte()
if (c & 0b11000000) != 0b10000000:
raise EncodingError("Invalid continuation byte")
unicode_value = (unicode_value << 6) | (c & 0b00111111)
if c_size == 3 and unicode_value < 0x800 or c_size == 4 and unicode_value < 0x10000:
raise EncodingError("Over-long UTF-8 sequence")
if not is_valid_unicode(unicode_value):
raise EncodingError("Invalid Unicode code point")
return chr(unicode_value)
Illegal Control Codes
Most control codes are prohibited in an ELCL document. The following table lists all illegal control codes.
Code/Range |
Description |
---|---|
U+0000 |
The “null” control character is disallowed in any part of a document, including text. The escape sequence |
U+0001 — U+0008
U+000B — U+000C
U+000E — U+001F
U+007F — U+00A0
|
These control codes are disallowed in documents. However, they may appear in text blocks as escape sequences. |
The only valid control codes in ELCL documents are the tab (\t
), new-line (\n
), and carriage-return (\r
).
Design Rationale
Historically, control codes had specific uses, but today, most of them introduce errors or even security vulnerabilities. For this reason, control codes are disallowed in ELCL documents, particularly in text. If a control code is needed in text, it can be inserted using the appropriate Unicode escape sequence.
The “null” control character is forbidden in text because it frequently causes issues when passing text through API boundaries. Like other control codes, it serves no meaningful purpose in text contexts. If byte-data is needed, ELCL provides support for such structures, and if values need to be separated, lists can be used.
Prohibiting control codes simplifies text processing, although more complex Unicode behaviors—such as combining characters or directionality markers—remain possible within text blocks. However, the responsibility for handling these complexities can safely be delegated to the application code.
Named Characters in EBNF
Characters in the shown EBNF syntax are named according to their Unicode or common names, rather than their function within the language.
TAB ::= #x0009 /* Tab character */
LF ::= #x000A /* Line feed */
CR ::= #x000D /* Carriage return */
SPACE ::= #x0020 /* Space character */
DOUBLE_QUOTE ::= #x0022 /* Double Quote (") */
HASH ::= #x0023 /* Hash symbol (#) */
DOLLAR ::= #x0024 /* Dollar sign ($) */
APOSTROPHE ::= #x0027 /* Apostrophe (') */
ASTERISK ::= #x002A /* Asterisk (*) */
PLUS ::= #x002B /* The plus sign (+) */
COMMA ::= #x002C /* Comma (,) */
HYPHEN ::= #x002D /* Hyphen (-) */
PERIOD ::= #x002E /* Period (.) */
SLASH ::= #x002F /* Slash (/) */
COLON ::= #x003A /* Colon (:) */
LESS_THAN_SIGN ::= #x003C /* Less-Than Sign (<) */
EQUAL ::= #x003D /* Equals sign (=) */
GREATER_THAN_SIGN ::= #x003E /* Greater-Than Sign (>) */
AT_SIGN ::= #x0040 /* At sign (@) */
SQ_BRACKET_OPEN ::= #x005B /* Opening square bracket ([) */
BACKSLASH ::= #x005C /* Backslash (\) */
SQ_BRACKET_CLOSE ::= #x005D /* Closing square bracket (]) */
UNDERSCORE ::= #x005F /* Underscore (_) */
BACKTICK ::= #x0060 /* Backtick (`) */
CU_BRACKET_OPEN ::= #x007B /* Opening curly bracket ({) */
CU_BRACKET_CLOSE ::= #x007D /* Closing curly bracket ({) */
Character Groups in EBNF
In ELCL, certain character groups have predefined ranges or sets. Below is a list of important character groups used in the EBNF syntax:
DIGIT ::= [#x0030-#x0039] /* Decimal digits 0-9 */
HEX_DIGIT ::= [#x0030-#x0039#x0041-#x0046#x0061-#x0066] /* Hexadecimal digits 0-9, A-F, a-f */
BIN_DIGIT ::= [#x0030#x0031] /* Binary digits 0, 1 */
ALPHA ::= [#x0041-#x005A#x0061-#x007A] /* Alphabetic characters A-Z, a-z */
TEXT ::= [#x0009#x0020-#x007E#x00A0-#x10FFFF] /* Any printable character (excluding control codes) */
DIGIT_OR_ALPHA ::= DIGIT | ALPHA /* Digits or alphabetic characters */
FORMAT_DIGIT ::= ALPHA | DIGIT | HYPHEN | UNDERSCORE /* One element of a format specifier */
PLUS_MINUS ::= PLUS | HYPHEN /* Plus or minus */
LETTER_E ::= [eE] /* The letter E */
Note
The TEXT
character group must exclude low and high surrogates, as well as any characters that are invalid in a UTF-8 encoded document.
Features
Feature |
Coverage |
---|---|
core |
The full syntax outlined in this chapter is part of the core language. |
Errors
Error Code |
Causes |
---|---|
Encoding |
Raised if the parser detects invalid UTF-8 sequences. |
Character |
Raised if an illegal control character is read in the configuration document. |