538 lines
12 KiB
Plaintext
538 lines
12 KiB
Plaintext
---
|
||
engine: julia
|
||
---
|
||
|
||
```{julia}
|
||
#| error: false
|
||
#| echo: false
|
||
#| output: false
|
||
using InteractiveUtils
|
||
import QuartoNotebookWorker
|
||
Base.stdout = QuartoNotebookWorker.with_context(stdout)
|
||
```
|
||
|
||
# Characters, Strings, and Unicode
|
||
|
||
## Character Encodings (Early History)
|
||
|
||
There were - depending on manufacturer, country, programming language, operating system, etc. - a large variety of encodings.
|
||
|
||
Still relevant today:
|
||
|
||
|
||
### ASCII
|
||
The _American Standard Code for Information Interchange_ was published as a standard in the USA in 1963.
|
||
|
||
- It defines $2^7=128$ characters, namely:
|
||
- 33 control characters, such as `newline`, `escape`, `end of transmission/file`, `delete`
|
||
- 95 graphically printable characters:
|
||
- 52 Latin letters `a-z, A-Z`
|
||
- 10 digits `0-9`
|
||
- 7 punctuation marks `.,:;?!"`
|
||
- 1 space ` `
|
||
- 6 parentheses `[{()}]`
|
||
- 7 mathematical operations `+-*/<>=`
|
||
- 12 special characters ``` #$%&'\^_|~`@ ```
|
||
|
||
- ASCII is still the "lowest common denominator" in the encoding chaos.
|
||
- The first 128 Unicode characters are identical to ASCII.
|
||
|
||
### ISO 8859 Character Sets
|
||
|
||
- ASCII uses only 7 bits.
|
||
- In a byte, one can fit another 128 characters by setting the 8th bit.
|
||
- In 1987/88, various 1-byte encodings were standardized in ISO 8859, all ASCII-compatible, including:
|
||
|
||
:::{.narrow}
|
||
|Encoding | Region | Languages|
|
||
|:-----------|:----------|:-------|
|
||
|ISO 8859-1 (Latin-1) | Western Europe | German, French,..., Icelandic
|
||
|ISO 8859-2 (Latin-2) | Eastern Europe | Slavic languages with Latin script
|
||
|ISO 8859-3 (Latin-3) | Southern Europe | Turkish, Maltese,...
|
||
|ISO 8859-4 (Latin-4) | Northern Europe | Estonian, Latvian, Lithuanian, Greenlandic, Sami
|
||
|ISO 8859-5 (Latin/Cyrillic) | Eastern Europe | Slavic languages with Cyrillic script
|
||
|ISO 8859-6 (Latin/Arabic) | |
|
||
|ISO 8859-7 (Latin/Greek) | |
|
||
|...| |
|
||
|ISO 8859-15 (Latin-9)| | 1999: Revision of Latin-1: now including Euro sign
|
||
|
||
:::
|
||
|
||
|
||
## Unicode
|
||
|
||
The goal of the Unicode Consortium is a uniform encoding for all scripts of the world.
|
||
|
||
- Unicode version 1 was published in 1991
|
||
- Unicode version 15.1 was published in 2023 with 149,813 characters, including:
|
||
- 161 scripts
|
||
- mathematical and technical symbols
|
||
- Emojis and other symbols, control and formatting characters
|
||
- Over 90,000 characters are assigned to the CJK scripts (Chinese/Japanese/Korean)
|
||
|
||
|
||
### Technical Details
|
||
|
||
- Each character is assigned a `codepoint`. This is simply a sequential number.
|
||
- This number is written hexadecimally
|
||
- either 4-digit as `U+XXXX` (0th plane)
|
||
- or 6-digit as `U+XXXXXX` (further planes)
|
||
- Each plane ranges from `U+XY0000` to `U+XYFFFF`, thus can contain $2^{16}=65\;534$ characters.
|
||
- 17 planes `XY=00` to `XY=10` are provided so far, thus the value range from `U+0000` to `U+10FFFF`.
|
||
- Thus, a maximum of 21 bits per character are needed.
|
||
- The total number of possible codepoints is slightly less than 0x10FFFF, as certain areas are not used for technical reasons. It is about 1.1 million, so there is still much room.
|
||
- So far, codepoints from the planes have been assigned only from
|
||
- Plane 0 = BMP _Basic Multilingual Plane_ `U+0000 - U+FFFF`,
|
||
- Plane 1 = SMP _Supplementary Multilingual Plane_ `U+010000 - U+01FFFF`,
|
||
- Plane 2 = SIP _Supplementary Ideographic Plane_ `U+020000 - U+02FFFF`,
|
||
- Plane 3 = TIP _Tertiary Ideographic Plane_ `U+030000 - U+03FFFF` and
|
||
- Plane 14 = SSP _Supplementary Special-purpose Plane_ `U+0E0000 - U+0EFFFF`
|
||
have been assigned.
|
||
- `U+0000` to `U+007F` is identical to ASCII
|
||
- `U+0000` to `U+00FF` is identical to ISO 8859-1 (Latin-1)
|
||
|
||
### Properties of Unicode Characters
|
||
|
||
In the standard, each character is described by
|
||
|
||
- its codepoint (number)
|
||
- a name (which consists only of ASCII uppercase letters, digits, and hyphens) and
|
||
- various attributes such as
|
||
- script direction
|
||
- category: uppercase letter, lowercase letter, modifier letter, digit, punctuation, symbol, separator,....
|
||
|
||
In the Unicode standard, this looks like this (simplified, only codepoint and name):
|
||
```
|
||
...
|
||
U+0041 LATIN CAPITAL LETTER A
|
||
U+0042 LATIN CAPITAL LETTER B
|
||
U+0043 LATIN CAPITAL LETTER C
|
||
U+0044 LATIN CAPITAL LETTER D
|
||
...
|
||
U+00E9 LATIN SMALL LETTER E WITH ACUTE
|
||
U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
|
||
...
|
||
U+0641 ARABIC LETTER FEH
|
||
U+0642 ARABIC LETTER QAF
|
||
...
|
||
U+21B4 RIGHTWARDS ARROW WITH CORNER DOWNWARDS
|
||
...
|
||
```
|
||
|
||
What does 'RIGHTWARDS ARROW WITH CORNER DOWNWARDS' look like?
|
||
|
||
Julia uses `\U...` for input of Unicode codepoints.
|
||
|
||
```{julia}
|
||
'\U21b4'
|
||
```
|
||
|
||
|
||
### A Selection of Scripts
|
||
|
||
::: {.content-visible when-format="html"}
|
||
|
||
:::{.callout-note}
|
||
If individual characters or scripts are not displayable in your browser, you must install appropriate
|
||
fonts on your computer.
|
||
|
||
Alternatively, you can use the PDF version of this page. There, all fonts are embedded.
|
||
:::
|
||
|
||
:::
|
||
|
||
A small helper function:
|
||
```{julia}
|
||
function printuc(c, n)
|
||
for i in 0:n-1
|
||
print(c + i)
|
||
end
|
||
end
|
||
```
|
||
|
||
__Cyrillic__
|
||
|
||
|
||
```{julia}
|
||
printuc('\U0400', 100)
|
||
```
|
||
|
||
__Tamil__
|
||
|
||
|
||
:::{.cellmerge}
|
||
```{julia}
|
||
#| echo: true
|
||
#| output: false
|
||
printuc('\U0be7',20)
|
||
```
|
||
|
||
\begingroup\setmonofont{Noto Sans Tamil}
|
||
|
||
```{julia}
|
||
#| echo: false
|
||
#| output: true
|
||
printuc('\U0be7',20)
|
||
```
|
||
|
||
\endgroup
|
||
|
||
:::
|
||
|
||
__Chess__
|
||
|
||
|
||
```{julia}
|
||
printuc('\U2654', 12)
|
||
```
|
||
|
||
__Mathematical Operators__
|
||
|
||
|
||
```{julia}
|
||
printuc('\U2200', 255)
|
||
```
|
||
|
||
__Runes__
|
||
|
||
|
||
```{julia}
|
||
printuc('\U16a0', 40)
|
||
```
|
||
|
||
:::{.cellmerge}
|
||
|
||
__Phaistos Disc__
|
||
|
||
- This script is not deciphered.
|
||
- It is unclear what language is represented.
|
||
- There is only one single document in this script: the Phaistos Disc from the Bronze Age
|
||
|
||
|
||
```{julia}
|
||
#| echo: true
|
||
#| output: false
|
||
printuc('\U101D0', 46 )
|
||
```
|
||
|
||
\begingroup\setmonofont{Phaistos.otf}
|
||
|
||
```{julia}
|
||
#| echo: false
|
||
#| output: true
|
||
printuc('\U101D0', 46 )
|
||
```
|
||
|
||
\endgroup
|
||
|
||
:::
|
||
|
||
### Unicode Transformation Formats: UTF-8, UTF-16, UTF-32
|
||
|
||
_Unicode transformation formats_ define how a sequence of codepoints is represented as a sequence of bytes.
|
||
|
||
Since the codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?
|
||
|
||
- __UTF-32__: The simplest but also most memory-intensive is to make them all the same length. Each codepoint is encoded in 4 bytes = 32 bits.
|
||
- In __UTF-16__, a codepoint is represented either with 2 bytes or with 4 bytes.
|
||
- In __UTF-8__, a codepoint is represented with 1, 2, 3, or 4 bytes.
|
||
- __UTF-8__ is the format with the highest prevalence. Julia also uses it.
|
||
|
||
|
||
### UTF-8
|
||
|
||
- For each codepoint, 1, 2, 3, or 4 full bytes are used.
|
||
|
||
- With variable-length encoding, one must be able to recognize which byte sequences belong together:
|
||
- A byte of the form 0xxxxxxx represents an ASCII codepoint of length 1.
|
||
- A byte of the form 110xxxxx starts a 2-byte code.
|
||
- A byte of the form 1110xxxx starts a 3-byte code.
|
||
- A byte of the form 11110xxx starts a 4-byte code.
|
||
- All further bytes of a 2-, 3-, or 4-byte code have the form 10xxxxxx.
|
||
|
||
- Thus, the space available for the codepoint (number of x):
|
||
- One-byte code: 7 bits
|
||
- Two-byte code: 5 + 6 = 11 bits
|
||
- Three-byte code: 4 + 6 + 6 = 16 bits
|
||
- Four-byte code: 3 + 6 + 6 + 6 = 21 bits
|
||
|
||
- Thus, every ASCII text is automatically also a correctly encoded UTF-8 text.
|
||
|
||
- If the 17 planes (= 21 bits = 1.1 million possible characters) defined for Unicode so far are ever expanded, UTF-8 will be expanded to 5- and 6-byte codes.
|
||
|
||
|
||
## Characters and Character Strings in Julia
|
||
|
||
### Characters: `Char`
|
||
|
||
The `Char` type encodes a single Unicode character.
|
||
|
||
- Julia uses single quotes for this: `'a'`.
|
||
- A `Char` occupies 4 bytes of memory and
|
||
- represents a Unicode codepoint.
|
||
- `Char`s can be converted to/from `UInt`s and
|
||
- the integer value is equal to the Unicode codepoint.
|
||
|
||
|
||
`Char`s can be converted to/from `UInt`s.
|
||
|
||
```{julia}
|
||
UInt('a')
|
||
```
|
||
|
||
|
||
```{julia}
|
||
b = Char(0x2656)
|
||
```
|
||
|
||
### Character Strings: `String`
|
||
|
||
- For strings, Julia uses double quotes: `"a"`.
|
||
- They are UTF-8 encoded, i.e., one character can be between 1 and 4 bytes long.
|
||
|
||
|
||
```{julia}
|
||
@show typeof('a') sizeof('a') typeof("a") sizeof("a");
|
||
```
|
||
|
||
|
||
|
||
|
||
__For a non-ASCII string, the number of bytes and the number of characters differ:__
|
||
|
||
|
||
```{julia}
|
||
asciistr = "Hello World!"
|
||
@show length(asciistr) ncodeunits(asciistr);
|
||
```
|
||
(The space, of course, also counts.)
|
||
|
||
```{julia}
|
||
str = "😄 Hellö 🎶"
|
||
@show length(str) ncodeunits(str);
|
||
```
|
||
|
||
__Iterating over a string iterates over the characters:__
|
||
|
||
|
||
```{julia}
|
||
for i in str
|
||
println(i, " ", typeof(i))
|
||
end
|
||
```
|
||
|
||
### Concatenation of Strings
|
||
|
||
"Strings with concatenation form a non-commutative monoid."
|
||
|
||
Therefore, Julia writes concatenation multiplicatively.
|
||
```{julia}
|
||
str * asciistr * str
|
||
```
|
||
|
||
Powers with natural exponents are thus also defined.
|
||
|
||
```{julia}
|
||
str^3, str^0
|
||
```
|
||
|
||
### String Interpolation
|
||
|
||
The dollar sign has a special function in strings, which we have often used in
|
||
`print()` statements. One can interpolate a variable or expression with it:
|
||
|
||
|
||
```{julia}
|
||
a = 33.4
|
||
b = "x"
|
||
|
||
s = "The result for $b is equal to $a and the doubled square root of it is $(2sqrt(a))\n"
|
||
```
|
||
|
||
### Backslash Escape Sequences
|
||
|
||
The _backslash_ `\` also has a special function in string constants.
|
||
Julia uses the backslash codings known from C and other languages for special characters and for dollar signs and backslashes themselves:
|
||
|
||
|
||
```{julia}
|
||
s = "This is how one gets \'quotes\" and a \$ sign and a\nline break and a \\ etc... "
|
||
print(s)
|
||
```
|
||
|
||
|
||
|
||
### Triple Quotes
|
||
|
||
Strings can also be delimited with triple quotes.
|
||
In this form, line breaks and quotes are preserved:
|
||
|
||
|
||
```{julia}
|
||
s = """
|
||
This should
|
||
be a "longer"
|
||
'text'.
|
||
"""
|
||
|
||
print(s)
|
||
```
|
||
|
||
### Raw Strings
|
||
|
||
In a `raw string`, all backslash codings except `\"` are disabled:
|
||
|
||
|
||
```{julia}
|
||
s = raw"A $ and a \ and two \\ and a 'bla'..."
|
||
print(s)
|
||
```
|
||
|
||
## Further Functions for Characters and Strings (Selection)
|
||
|
||
### Tests for Characters
|
||
|
||
|
||
```{julia}
|
||
@show isdigit('0') isletter('Ψ') isascii('\U2655') islowercase('α')
|
||
@show isnumeric('½') iscntrl('\n') ispunct(';');
|
||
```
|
||
|
||
### Application to Strings
|
||
|
||
These tests can e.g. be used with `all()`, `any()`, or `count()` on strings:
|
||
|
||
|
||
```{julia}
|
||
all(ispunct, ";.:")
|
||
```
|
||
|
||
|
||
```{julia}
|
||
any(isdigit, "It is 3 o'clock! 🕒" )
|
||
```
|
||
|
||
|
||
```{julia}
|
||
count(islowercase, "Hello, du!!")
|
||
```
|
||
|
||
### Other String Functions
|
||
|
||
|
||
```{julia}
|
||
@show startswith("Lampenschirm", "Lamp") occursin("pensch", "Lampenschirm")
|
||
@show endswith("Lampenschirm", "irm");
|
||
```
|
||
|
||
|
||
```{julia}
|
||
@show uppercase("Eis") lowercase("Eis") titlecase("eiSen");
|
||
```
|
||
|
||
|
||
```{julia}
|
||
# remove newline from end of string
|
||
|
||
@show chomp("Eis\n") chomp("Eis");
|
||
```
|
||
|
||
|
||
```{julia}
|
||
split("π is irrational.")
|
||
```
|
||
|
||
|
||
```{julia}
|
||
replace("π is irrational.", "is" => "is allegedly")
|
||
```
|
||
|
||
## Indexing of Strings
|
||
|
||
Strings are immutable but indexable. There are a few special features here.
|
||
|
||
- The index numbers the bytes of the string.
|
||
- For a non-ASCII string, not all indices are valid, because
|
||
- a valid index always addresses a Unicode character.
|
||
|
||
Our example string:
|
||
```{julia}
|
||
str
|
||
```
|
||
|
||
The first character
|
||
```{julia}
|
||
str[1]
|
||
```
|
||
|
||
This character is 4 bytes long in UTF-8 encoding. Thus, 2, 3, and 4 are invalid indices.
|
||
```{julia}
|
||
str[2]
|
||
```
|
||
|
||
Only the 5th byte is a new character:
|
||
|
||
```{julia}
|
||
str[5]
|
||
```
|
||
|
||
Even when addressing substrings, start and end must always be valid indices, i.e., the end index must also index the first byte of a character, and that character is the last of the substring.
|
||
|
||
```{julia}
|
||
str[1:7]
|
||
```
|
||
|
||
The function `eachindex()` returns an iterator over the valid indices:
|
||
|
||
```{julia}
|
||
for i in eachindex(str)
|
||
c = str[i]
|
||
println("$i: $c")
|
||
end
|
||
```
|
||
|
||
As usual, `collect()` makes an iterator into a vector.
|
||
|
||
```{julia}
|
||
collect(eachindex(str))
|
||
```
|
||
|
||
The function `nextind()` returns the next valid index.
|
||
```{julia}
|
||
@show nextind(str, 1) nextind(str, 2);
|
||
```
|
||
|
||
Why does Julia use a byte index instead of a character index? The main reason is the efficiency of indexing.
|
||
|
||
- In a long string, e.g., a book text, the position `s[123455]` can be found quickly with a byte index.
|
||
- A character index would have to traverse the entire string in UTF-8 encoding to find the n-th character, since the characters can be 1, 2, 3, or 4 bytes long.
|
||
|
||
|
||
Some functions return indices or ranges as results. They always return valid indices:
|
||
|
||
|
||
```{julia}
|
||
findfirst('l', str)
|
||
```
|
||
|
||
|
||
```{julia}
|
||
findfirst("Hel", str)
|
||
```
|
||
|
||
|
||
```{julia}
|
||
str2 = "αβγδϵ"^3
|
||
```
|
||
|
||
|
||
```{julia}
|
||
n = findfirst('γ', str2)
|
||
```
|
||
|
||
So one can continue searching from the next valid index after `n=5`:
|
||
|
||
```{julia}
|
||
findnext('γ', str2, nextind(str2, n))
|
||
```
|