JuliaKurs23/chapters/10_Strings.qmd

---
engine: julia
---

```{julia}
#| error: false
#| echo: false
#| output: false
using InteractiveUtils
import QuartoNotebookWorker
Base.stdout = QuartoNotebookWorker.with_context(stdout)
```

# Characters, Strings, and Unicode

## Character Encodings (Early History)

There were - depending on manufacturer, country, programming language, operating system, etc. - a large variety of encodings.

Still relevant today are:


### ASCII
The American Standard Code for Information Interchange (ASCII) was published as a standard in the USA in 1963.

- It defines $2^7=128$ characters, namely:
  - 33 control characters, such as `newline`, `escape`, `end of transmission/file`, `delete`
  - 95 graphically printable characters:
    - 52 Latin letters `a-z, A-Z`
    - 10 digits `0-9`
    - 7 punctuation marks `.,:;?!"`
    - 1 space ` `
    - 6 parentheses `[{()}]`
    - 7 mathematical operations `+-*/<>=`
    - 12 special characters ``` #$%&'\^_|~`@ ```

- ASCII is still the "lowest common denominator" in the encoding chaos.
- The first 128 Unicode characters are identical to ASCII.

### ISO 8859 Character Sets

- ASCII uses only 7 bits.
- In a byte, you can fit another 128 characters by setting the 8th bit.
- In 1987/88, various 1-byte encodings were standardized in ISO 8859, all ASCII-compatible, including:

:::{.narrow}
  |Encoding | Region  | Languages|
  |:-----------|:----------|:-------|
   |ISO 8859-1 (Latin-1)  |  Western Europe | German, French,..., Icelandic
   |ISO 8859-2 (Latin-2)  |  Eastern Europe  | Slavic languages with Latin script
   |ISO 8859-3 (Latin-3)  | Southern Europe   | Turkish, Maltese,...
   |ISO 8859-4 (Latin-4)  | Northern Europe  | Estonian, Latvian, Lithuanian, Greenlandic, Sami
   |ISO 8859-5 (Latin/Cyrillic) | Eastern Europe | Slavic languages with Cyrillic script
   |ISO 8859-6 (Latin/Arabic) | |
   |ISO 8859-7 (Latin/Greek)  | |
   |...| |
   |ISO 8859-15 (Latin-9)| | 1999: Revision of Latin-1: now including Euro sign

:::


## Unicode

The goal of the Unicode Consortium is a uniform encoding for all scripts worldwide.

- Unicode version 1 was published in 1991
- Unicode version 17 was published in 2025 with 159,801 characters, including:
   - 172 scripts
   - mathematical and technical symbols
   - Emojis and other symbols, control and formatting characters
- Over 90,000 characters are assigned to the CJK scripts (Chinese/Japanese/Korean)


### Technical Details

- Each character is assigned a `codepoint`, which is simply a sequential number written hexadecimally
    - either with 4 digit as `U+XXXX` (zeroth plane)
    - or with 6 digit as `U+XXXXXX` (further planes)
- Each plane ranges from `U+XY0000` to `U+XYFFFF`, thus containing $2^{16}=65\;534$ characters.
- 17 planes `XY=00` to `XY=10` are provided, giving a value range from `U+0000` to `U+10FFFF`.
- Thus, a maximum of 21 bits per character are needed.
- The total number of possible codepoints is slightly less than 0x10FFFF, as certain areas are not used for technical reasons. It is about 1.1 million, so there is still much room.
- So far, codepoints have been assigned only from these planes:
    - Plane 0 = BMP (Basic Multilingual Plane)  `U+0000 - U+FFFF`,
    - Plane 1 = SMP (Supplementary Multilingual Plane)  `U+010000 - U+01FFFF`,
    - Plane 2 = SIP (Supplementary Ideographic Plane)    `U+020000 - U+02FFFF`,
    - Plane 3 = TIP (Tertiary Ideographic Plane)     `U+030000 - U+03FFFF`, and
    - Plane 14 = SSP (Supplementary Special-purpose Plane) `U+0E0000 - U+0EFFFF`.
- `U+0000` to `U+007F` is identical to ASCII
- `U+0000` to `U+00FF` is identical to ISO 8859-1 (Latin-1)

### Properties of Unicode Characters

In the standard, each character is described by

- its codepoint (number)
- a name (which consists only of ASCII uppercase letters, digits, and hyphens) and
- various attributes such as
  - script direction
  - category: uppercase letter, lowercase letter, modifier letter, digit, punctuation, symbol, separator,....

In the Unicode standard, this looks like (simplified, only codepoint and name):
```
...
U+0041 LATIN CAPITAL LETTER A
U+0042 LATIN CAPITAL LETTER B
U+0043 LATIN CAPITAL LETTER C
U+0044 LATIN CAPITAL LETTER D
...
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
...
U+0641 ARABIC LETTER FEH
U+0642 ARABIC LETTER QAF
...
U+21B4 RIGHTWARDS ARROW WITH CORNER DOWNWARDS
...
```

What does 'RIGHTWARDS ARROW WITH CORNER DOWNWARDS' look like?

Julia uses `\U...` for input of Unicode codepoints.

```{julia}
'\U21b4'
```


### A Selection of Scripts

::: {.content-visible when-format="html"}

:::{.callout-note}
If individual characters or scripts are not displayable in your browser, you must install appropriate
fonts on your computer.

Alternatively, you can use the PDF version of this page. There, all fonts are embedded.
:::

:::

A small helper function:
```{julia}
function printuc(c, n)
    for i in 1:n
        print(c + i -1)
        if i%70 == 0 print("\n") end
    end
end
```

__Cyrillic__


```{julia}
printuc('\U0400', 100)
```

__Tamil__


:::{.cellmerge}
```{julia}
#| echo: true
#| output: false
printuc('\U0be7',20)
```

\begingroup\setmonofont{Noto Sans Tamil}

```{julia}
#| echo: false
#| output: true
printuc('\U0be7',20)
```

\endgroup

:::

__Chess__


```{julia}
printuc('\U2654', 12)
```

__Mathematical Operators__


```{julia}
printuc('\U2200', 255)
```

__Runes__


```{julia}
printuc('\U16a0', 40)
```

:::{.cellmerge}

__Phaistos Disc__

- This script is not deciphered.
- It is unclear what language it represents.
- There is only one single document in this script: the Phaistos Disc from the Bronze Age.


```{julia}
#| echo: true
#| output: false
printuc('\U101D0', 46 )
```

\begingroup\setmonofont{Phaistos.otf}

```{julia}
#| echo: false
#| output: true
printuc('\U101D0', 46 )
```

\endgroup

:::

### Unicode Transformation Formats: UTF-8, UTF-16, UTF-32

_Unicode transformation formats_ define how a sequence of codepoints is represented as a sequence of bytes.

Since codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?

- __UTF-32__: The simplest but also most memory-intensive is to make them all the same length. Each codepoint is encoded in 4 bytes = 32 bits.
- In __UTF-16__, a codepoint is represented either with 2 bytes or with 4 bytes.
- In __UTF-8__, a codepoint is represented with 1, 2, 3, or 4 bytes.
- __UTF-8__ is the format with the highest prevalence. Julia also uses it.


### UTF-8

- For each codepoint, 1, 2, 3, or 4 full bytes are used.

- With variable-length encoding, you must be able to recognize which byte sequences belong together:
    - A byte of the form 0xxxxxxx represents an ASCII codepoint of length 1.
    - A byte of the form 110xxxxx starts a 2-byte code.
    - A byte of the form 1110xxxx starts a 3-byte code.
    - A byte of the form 11110xxx starts a 4-byte code.
    - All further bytes of a 2-, 3-, or 4-byte code have the form 10xxxxxx.

- Thus, the space available for the codepoint (number of x) is:
     - One-byte code:  7 bits
     - Two-byte code: 5 + 6 = 11 bits
     - Three-byte code: 4 + 6 + 6 = 16 bits
     - Four-byte code: 3 + 6 + 6 + 6 = 21 bits

- Thus, every ASCII text is automatically also a correctly encoded UTF-8 text.

- If the 17 planes (equivalent to 21 bits, resulting in approximately 1.1 million possible characters) currently defined in Unicode are ever depleted, UTF-8 can be extended to include 5- and 6-byte code sequences.

## Characters and  Strings in Julia

### Characters

The `Char` type encodes a single Unicode character.

- Julia uses single quotes for characters: `'a'`.
- A `Char` occupies 4 bytes of memory and
- represents a Unicode codepoint.
- `Char`s can be converted to/from `UInt`s and
- the integer value is equal to the Unicode codepoint.


`Char`s can be converted to/from `UInt`s:
```{julia}
UInt('a')
```


```{julia}
b = Char(0x2656)
```

###  Strings

- In Julia, strings are denoted with double quotes: `"a"`.
- These strings are encoded in UTF-8, where a single character may consist of 1 to 4 bytes.


```{julia}
@show typeof('a') sizeof('a') typeof("a") sizeof("a");
```


__For a non-ASCII string, the number of bytes and the number of characters differ:__


```{julia}
asciistr = "Hello World!"
@show length(asciistr) ncodeunits(asciistr);
```

```{julia}
str = "😄 Hellö 🎶"
@show length(str) ncodeunits(str);
```

__Iterating over a string iterates over the characters:__


```{julia}
for i in str
    println(i, "  ", typeof(i))
end
```

### Concatenation of Strings

Strings with concatenation form a non-commutative monoid.

Therefore, Julia writes concatenation multiplicatively.
```{julia}
 str * asciistr * str
```

Powers with natural exponents are thus also defined.

```{julia}
str^3,  str^0
```

### String Interpolation

The dollar sign serves a special purpose in strings, frequently utilized within `print()` statements. It enables the interpolation of variables or expressions.

```{julia}
a = 33.4
b = "x"

s = "The result for $b is equal to $a and the doubled square root of it is $(2sqrt(a))\n"
```

### Backslash Escape Sequences

The backslash `\` also has a special function in string constants.
Julia uses the backslash codings known from C and other languages for special characters, dollar signs, and backslashes themselves:


```{julia}
s = "This is how one gets \'quotes\" and a \$ sign and a\nline break and a \\ etc... "
print(s)
```


### Triple Quotes

Strings may also be enclosed in triple quotes, preserving line breaks and embedded quotes:

```{julia}
s = """
 This should
be a "longer"
  'text'.
"""

print(s)
```

### Raw Strings

In a `raw string`, all backslash escape sequences except for `\"` are disabled:

```{julia}
s = raw"A $ and a \ and two \\ and a 'bla'..."
print(s)
```

## Further Functions for Characters and Strings (Selection)

### Tests for Characters


```{julia}
@show isdigit('0') isletter('Ψ') isascii('\U2655') islowercase('α')
@show isnumeric('½') iscntrl('\n') ispunct(';');
```

### Application to Strings

These tests can  be used on strings with `all()`, `any()`, or `count()`:


```{julia}
all(ispunct, ";.:")
```


```{julia}
any(isdigit, "It is 3 o'clock! 🕒" )
```


```{julia}
count(islowercase, "Hello, du!!")
```

### Other String Functions


```{julia}
@show startswith("Lampenschirm", "Lamp")  occursin("pensch", "Lampenschirm")
@show endswith("Lampenschirm", "irm");
```


```{julia}
@show uppercase("Eis") lowercase("Eis")  titlecase("eiSen");
```


```{julia}
# remove newline from end of string

@show chomp("Eis\n")  chomp("Eis");
```


```{julia}
split("π is irrational.")
```


```{julia}
replace("π is irrational.", "is" => "is allegedly")
```

## Indexing of Strings

Strings are immutable but indexable, with a few special features:

- The index numbers the bytes of the string.
- For a non-ASCII string, not all indices are valid because a valid index always addresses a Unicode character.

Our example string:
```{julia}
str
```

The first character
```{julia}
str[1]
```

This character is 4 bytes long in UTF-8 encoding. Thus, 2, 3, and 4 are invalid indices.
```{julia}
str[2]
```

Only the 5th byte is a new character:

```{julia}
str[5]
```

Even when addressing substrings, start and end must always be valid indices; i.e., the end index must also index the first byte of a character, and that character is the last of the substring.

```{julia}
str[1:7]
```

The function `eachindex()` returns an iterator over the valid indices:

```{julia}
for i in eachindex(str)
    c = str[i]
    println("$i: $c")
end
```

As usual, `collect()` makes an iterator into a vector.

```{julia}
collect(eachindex(str))
```

The function `nextind()` returns the next valid index.
```{julia}
@show nextind(str, 1) nextind(str, 2);
```

Why does Julia use a byte index instead of a character index? The main reason is the efficiency of indexing.

- In a long string (e.g., book text), the position `s[123455]` can be found quickly with a byte index.
- A character index would have to traverse the entire string in UTF-8 encoding to find the n-th character, since characters can be 1, 2, 3, or 4 bytes long.


Some functions return indices or ranges as results. They always return valid indices:


```{julia}
findfirst('l', str)
```


```{julia}
findfirst("Hel", str)
```


```{julia}
str2 = "αβγδϵ"^3
```


```{julia}
n = findfirst('γ', str2)
```

So you can continue searching from the next valid index after `n=5`:

```{julia}
findnext('γ', str2, nextind(str2, n))
```