english improved

This commit is contained in:
2026-03-05 20:09:16 +01:00
parent c6609d15f5
commit 733fe8c290
21 changed files with 954 additions and 1042 deletions

View File

@@ -17,11 +17,11 @@ Base.stdout = QuartoNotebookWorker.with_context(stdout)
There were - depending on manufacturer, country, programming language, operating system, etc. - a large variety of encodings.
Still relevant today:
Still relevant today are:
### ASCII
The _American Standard Code for Information Interchange_ was published as a standard in the USA in 1963.
The American Standard Code for Information Interchange (ASCII) was published as a standard in the USA in 1963.
- It defines $2^7=128$ characters, namely:
- 33 control characters, such as `newline`, `escape`, `end of transmission/file`, `delete`
@@ -40,7 +40,7 @@ The _American Standard Code for Information Interchange_ was published as a stan
### ISO 8859 Character Sets
- ASCII uses only 7 bits.
- In a byte, one can fit another 128 characters by setting the 8th bit.
- In a byte, you can fit another 128 characters by setting the 8th bit.
- In 1987/88, various 1-byte encodings were standardized in ISO 8859, all ASCII-compatible, including:
:::{.narrow}
@@ -61,11 +61,11 @@ The _American Standard Code for Information Interchange_ was published as a stan
## Unicode
The goal of the Unicode Consortium is a uniform encoding for all scripts of the world.
The goal of the Unicode Consortium is a uniform encoding for all scripts worldwide.
- Unicode version 1 was published in 1991
- Unicode version 15.1 was published in 2023 with 149,813 characters, including:
- 161 scripts
- Unicode version 17 was published in 2025 with 159,801 characters, including:
- 172 scripts
- mathematical and technical symbols
- Emojis and other symbols, control and formatting characters
- Over 90,000 characters are assigned to the CJK scripts (Chinese/Japanese/Korean)
@@ -73,21 +73,19 @@ The goal of the Unicode Consortium is a uniform encoding for all scripts of the
### Technical Details
- Each character is assigned a `codepoint`. This is simply a sequential number.
- This number is written hexadecimally
- either 4-digit as `U+XXXX` (0th plane)
- or 6-digit as `U+XXXXXX` (further planes)
- Each plane ranges from `U+XY0000` to `U+XYFFFF`, thus can contain $2^{16}=65\;534$ characters.
- 17 planes `XY=00` to `XY=10` are provided so far, thus the value range from `U+0000` to `U+10FFFF`.
- Each character is assigned a `codepoint`, which is simply a sequential number written hexadecimally
- either with 4 digit as `U+XXXX` (zeroth plane)
- or with 6 digit as `U+XXXXXX` (further planes)
- Each plane ranges from `U+XY0000` to `U+XYFFFF`, thus containing $2^{16}=65\;534$ characters.
- 17 planes `XY=00` to `XY=10` are provided, giving a value range from `U+0000` to `U+10FFFF`.
- Thus, a maximum of 21 bits per character are needed.
- The total number of possible codepoints is slightly less than 0x10FFFF, as certain areas are not used for technical reasons. It is about 1.1 million, so there is still much room.
- So far, codepoints from the planes have been assigned only from
- Plane 0 = BMP _Basic Multilingual Plane_ `U+0000 - U+FFFF`,
- Plane 1 = SMP _Supplementary Multilingual Plane_ `U+010000 - U+01FFFF`,
- Plane 2 = SIP _Supplementary Ideographic Plane_ `U+020000 - U+02FFFF`,
- Plane 3 = TIP _Tertiary Ideographic Plane_ `U+030000 - U+03FFFF` and
- Plane 14 = SSP _Supplementary Special-purpose Plane_ `U+0E0000 - U+0EFFFF`
have been assigned.
- So far, codepoints have been assigned only from these planes:
- Plane 0 = BMP (Basic Multilingual Plane) `U+0000 - U+FFFF`,
- Plane 1 = SMP (Supplementary Multilingual Plane) `U+010000 - U+01FFFF`,
- Plane 2 = SIP (Supplementary Ideographic Plane) `U+020000 - U+02FFFF`,
- Plane 3 = TIP (Tertiary Ideographic Plane) `U+030000 - U+03FFFF`, and
- Plane 14 = SSP (Supplementary Special-purpose Plane) `U+0E0000 - U+0EFFFF`.
- `U+0000` to `U+007F` is identical to ASCII
- `U+0000` to `U+00FF` is identical to ISO 8859-1 (Latin-1)
@@ -101,7 +99,7 @@ In the standard, each character is described by
- script direction
- category: uppercase letter, lowercase letter, modifier letter, digit, punctuation, symbol, separator,....
In the Unicode standard, this looks like this (simplified, only codepoint and name):
In the Unicode standard, this looks like (simplified, only codepoint and name):
```
...
U+0041 LATIN CAPITAL LETTER A
@@ -144,8 +142,9 @@ Alternatively, you can use the PDF version of this page. There, all fonts are em
A small helper function:
```{julia}
function printuc(c, n)
for i in 0:n-1
print(c + i)
for i in 1:n
print(c + i -1)
if i%70 == 0 print("\n") end
end
end
```
@@ -205,8 +204,8 @@ printuc('\U16a0', 40)
__Phaistos Disc__
- This script is not deciphered.
- It is unclear what language is represented.
- There is only one single document in this script: the Phaistos Disc from the Bronze Age
- It is unclear what language it represents.
- There is only one single document in this script: the Phaistos Disc from the Bronze Age.
```{julia}
@@ -231,7 +230,7 @@ printuc('\U101D0', 46 )
_Unicode transformation formats_ define how a sequence of codepoints is represented as a sequence of bytes.
Since the codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?
Since codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?
- __UTF-32__: The simplest but also most memory-intensive is to make them all the same length. Each codepoint is encoded in 4 bytes = 32 bits.
- In __UTF-16__, a codepoint is represented either with 2 bytes or with 4 bytes.
@@ -243,14 +242,14 @@ Since the codepoints are of different lengths, they cannot simply be written dow
- For each codepoint, 1, 2, 3, or 4 full bytes are used.
- With variable-length encoding, one must be able to recognize which byte sequences belong together:
- With variable-length encoding, you must be able to recognize which byte sequences belong together:
- A byte of the form 0xxxxxxx represents an ASCII codepoint of length 1.
- A byte of the form 110xxxxx starts a 2-byte code.
- A byte of the form 1110xxxx starts a 3-byte code.
- A byte of the form 11110xxx starts a 4-byte code.
- All further bytes of a 2-, 3-, or 4-byte code have the form 10xxxxxx.
- Thus, the space available for the codepoint (number of x):
- Thus, the space available for the codepoint (number of x) is:
- One-byte code: 7 bits
- Two-byte code: 5 + 6 = 11 bits
- Three-byte code: 4 + 6 + 6 = 16 bits
@@ -258,24 +257,22 @@ Since the codepoints are of different lengths, they cannot simply be written dow
- Thus, every ASCII text is automatically also a correctly encoded UTF-8 text.
- If the 17 planes (= 21 bits = 1.1 million possible characters) defined for Unicode so far are ever expanded, UTF-8 will be expanded to 5- and 6-byte codes.
- If the 17 planes (equivalent to 21 bits, resulting in approximately 1.1 million possible characters) currently defined in Unicode are ever depleted, UTF-8 can be extended to include 5- and 6-byte code sequences.
## Characters and Strings in Julia
## Characters and Character Strings in Julia
### Characters: `Char`
### Characters
The `Char` type encodes a single Unicode character.
- Julia uses single quotes for this: `'a'`.
- Julia uses single quotes for characters: `'a'`.
- A `Char` occupies 4 bytes of memory and
- represents a Unicode codepoint.
- `Char`s can be converted to/from `UInt`s and
- the integer value is equal to the Unicode codepoint.
`Char`s can be converted to/from `UInt`s.
`Char`s can be converted to/from `UInt`s:
```{julia}
UInt('a')
```
@@ -285,10 +282,10 @@ UInt('a')
b = Char(0x2656)
```
### Character Strings: `String`
### Strings
- For strings, Julia uses double quotes: `"a"`.
- They are UTF-8 encoded, i.e., one character can be between 1 and 4 bytes long.
- In Julia, strings are denoted with double quotes: `"a"`.
- These strings are encoded in UTF-8, where a single character may consist of 1 to 4 bytes.
```{julia}
@@ -305,7 +302,6 @@ __For a non-ASCII string, the number of bytes and the number of characters diffe
asciistr = "Hello World!"
@show length(asciistr) ncodeunits(asciistr);
```
(The space, of course, also counts.)
```{julia}
str = "😄 Hellö 🎶"
@@ -323,7 +319,7 @@ end
### Concatenation of Strings
"Strings with concatenation form a non-commutative monoid."
Strings with concatenation form a non-commutative monoid.
Therefore, Julia writes concatenation multiplicatively.
```{julia}
@@ -338,9 +334,7 @@ str^3, str^0
### String Interpolation
The dollar sign has a special function in strings, which we have often used in
`print()` statements. One can interpolate a variable or expression with it:
The dollar sign serves a special purpose in strings, frequently utilized within `print()` statements. It enables the interpolation of variables or expressions.
```{julia}
a = 33.4
@@ -351,8 +345,8 @@ s = "The result for $b is equal to $a and the doubled square root of it is $(2sq
### Backslash Escape Sequences
The _backslash_ `\` also has a special function in string constants.
Julia uses the backslash codings known from C and other languages for special characters and for dollar signs and backslashes themselves:
The backslash `\` also has a special function in string constants.
Julia uses the backslash codings known from C and other languages for special characters, dollar signs, and backslashes themselves:
```{julia}
@@ -364,9 +358,7 @@ print(s)
### Triple Quotes
Strings can also be delimited with triple quotes.
In this form, line breaks and quotes are preserved:
Strings may also be enclosed in triple quotes, preserving line breaks and embedded quotes:
```{julia}
s = """
@@ -380,8 +372,7 @@ print(s)
### Raw Strings
In a `raw string`, all backslash codings except `\"` are disabled:
In a `raw string`, all backslash escape sequences except for `\"` are disabled:
```{julia}
s = raw"A $ and a \ and two \\ and a 'bla'..."
@@ -400,7 +391,7 @@ print(s)
### Application to Strings
These tests can e.g. be used with `all()`, `any()`, or `count()` on strings:
These tests can be used on strings with `all()`, `any()`, or `count()`:
```{julia}
@@ -449,11 +440,10 @@ replace("π is irrational.", "is" => "is allegedly")
## Indexing of Strings
Strings are immutable but indexable. There are a few special features here.
Strings are immutable but indexable, with a few special features:
- The index numbers the bytes of the string.
- For a non-ASCII string, not all indices are valid, because
- a valid index always addresses a Unicode character.
- For a non-ASCII string, not all indices are valid because a valid index always addresses a Unicode character.
Our example string:
```{julia}
@@ -476,7 +466,7 @@ Only the 5th byte is a new character:
str[5]
```
Even when addressing substrings, start and end must always be valid indices, i.e., the end index must also index the first byte of a character, and that character is the last of the substring.
Even when addressing substrings, start and end must always be valid indices; i.e., the end index must also index the first byte of a character, and that character is the last of the substring.
```{julia}
str[1:7]
@@ -504,8 +494,8 @@ The function `nextind()` returns the next valid index.
Why does Julia use a byte index instead of a character index? The main reason is the efficiency of indexing.
- In a long string, e.g., a book text, the position `s[123455]` can be found quickly with a byte index.
- A character index would have to traverse the entire string in UTF-8 encoding to find the n-th character, since the characters can be 1, 2, 3, or 4 bytes long.
- In a long string (e.g., book text), the position `s[123455]` can be found quickly with a byte index.
- A character index would have to traverse the entire string in UTF-8 encoding to find the n-th character, since characters can be 1, 2, 3, or 4 bytes long.
Some functions return indices or ranges as results. They always return valid indices:
@@ -530,7 +520,7 @@ str2 = "αβγδϵ"^3
n = findfirst('γ', str2)
```
So one can continue searching from the next valid index after `n=5`:
So you can continue searching from the next valid index after `n=5`:
```{julia}
findnext('γ', str2, nextind(str2, n))