JuliaKurs23/chapters/numerictypes.qmd

---
engine: julia
---

```{julia}
#| error: false
#| echo: false
#| output: false
using InteractiveUtils
import QuartoNotebookWorker
Base.stdout = QuartoNotebookWorker.with_context(stdout)
myactive_module() = Main.Notebook
Base.active_module() = myactive_module()
```

# Machine Numbers

```{julia}
#| error: false
#| echo: false
#| output: false
using InteractiveUtils
```

```{julia}
for x ∈ ( 3, 3.3e4,  Int16(20), Float32(3.3e4), UInt16(9) )
    @show x sizeof(x) typeof(x)
    println()
end
```

## Integer Numbers *(integers)*

Integer numbers are fundamentally stored as bit patterns of fixed length. Therefore, the value range is finite.
**Within this value range** addition, subtraction, multiplication, and integer division with remainder
are exact operations without rounding errors.

Integer numbers come in two types: `Signed` (with sign) and `Unsigned`, which can be viewed as machine models for ℤ and ℕ respectively.


### *Unsigned integers*
```{julia}
subtypes(Unsigned)
```
UInts are binary numbers with n=8, 16, 32, 64, or 128 bits length and the corresponding value range of
$$
0 \le x < 2^n
$$
They are used relatively rarely in *scientific computing*. In hardware-proximate programming, they are e.g. used for handling binary data and memory addresses. Therefore, Julia displays them by default as hexadecimal numbers (with prefix `0x` and digits `0-9a-f`).

```{julia}
x = 0x0033efef
@show x typeof(x) Int(x)

z = UInt(32)
@show z typeof(z);

```

### *Signed Integers*
```{julia}
subtypes(Signed)
```
*Integers* have the value range
$$
-2^{n-1} \le x < 2^{n-1}
$$


In Julia, integer numbers are 64-bit by default:
```{julia}
x = 42
typeof(x)
```
Therefore, they have the value range:
$$
-9.223.372.036.854.775.808 \le x \le 9.223.372.036.854.775.807
$$

32-bit integers have the value range
$$
-2.147.483.648 \le x \le 2.147.483.647
$$
The maximum value $2^{31}-1$ is conveniently a Mersenne prime:

```{julia}
using Primes
isprime(2^31-1)
```

Negative numbers are represented in two's complement:

$x \Rightarrow -x$ corresponds to: _flip all bits, then add 1_

This looks like this:

::: {.content-visible when-format="html"}
![A representation of the fictional data type `Int4`](../images/Int4.png){width=50%}
:::

::: {.content-visible when-format="pdf"}
![A representation of the fictional data type `Int4`](../images/Int4.png){width=50%}
:::

The most significant bit indicates the sign. However, its "flipping" does not correspond to the operation `x → -x`.

:::{.callout-important}
All integer arithmetic is __arithmetic modulo $2^n$__

```{julia}
x = 2^62 - 10 + 2^62
```
```{julia}
x + 20
```
No error message, no warning! Fixed-length integers do not lie on a line, but on a circle!

:::

Let's look at a few *integers*:


```{julia}
using Printf

for i in (1, 2, 7, -7, 2^50, -1, Int16(7), Int8(7))
    s = bitstring(i)
    t = typeof(i)
    @printf "%20i %-5s ⇒  %s\n" i t s
end
```

Julia has functions that provide type information (*introspection*):


```{julia}
typemin(Int64), typemax(Int64)
```

```{julia}
typemin(UInt64), typemax(UInt64), BigInt(typemax(UInt64))
```

```{julia}
typemin(Int8), typemax(Int8)
```

## Integer Arithmetic

#### Addition, Multiplication
The operations `+`,`-`,`*` have the usual exact arithmetic **modulo $2^{64}$**.

#### Powers `a^b`

- Powers `a^n` are computed exactly modulo $2^{64}$ for natural exponents `n`.
- For negative exponents, the result is a floating-point number.
- `0^0` is [naturally](https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero#cite_note-T4n3B-4) equal to 1.
```{julia}
(-2)^63, 2^64, 3^(-3), 0^0
```
- For natural exponents, [*exponentiation by squaring*](https://de.wikipedia.org/wiki/Bin%C3%A4re_Exponentiation) is used, so for example `x^23` requires only 7 multiplications:
$$
x^{23} = \left( \left( (x^2)^2 \cdot x \right)^2  \cdot x \right)^2  \cdot x
$$

#### Division

- Division `/` produces a floating-point number.

```{julia}
x = 40/5
```
#### Integer Division and Remainder

- The functions `div(a,b)`,  `rem(a,b)`, and `divrem(a,b)` compute the quotient of integer division, the corresponding remainder, or both as a tuple.
- For `div(a,b)` there is the operator form `a ÷ b` (input: `\div<TAB>`), and for `rem(a,b)` the operator form `a % b`.
- By default, division is "rounded toward zero", so the corresponding remainder has the same sign as the dividend `a`:

```{julia}
@show divrem( 27, 4)
@show ( 27 ÷  4,  27 %  4)
@show (-27 ÷  4, -27 %  4 )
@show ( 27 ÷ -4,  27 % -4);
```

- A rounding rule other than `RoundToZero` can be specified as the third optional argument for the functions.
- `?RoundingMode` shows the possible rounding modes.
- For the rounding rule `RoundDown` ("toward minus infinity"), so that the corresponding remainder has the same sign as the divisor `b`, there are also the functions `fld(a,b)` *(floored division)* and `mod(a,b)`:

```{julia}
@show divrem(-27, 4, RoundDown)
@show (fld(-27,  4), mod(-27,  4))
@show (fld( 27, -4), mod( 27, -4));
```

For all rounding modes holds:
```
div(a, b, RoundingMode) * b + rem(a, b, RoundingMode) = a
```

#### The `BigInt` Type

The `BigInt` type allows arbitrary-length integers. The required memory is dynamically allocated.

Numeric constants automatically have a sufficiently large type:

```{julia}
z = 10
@show typeof(z)
z = 10_000_000_000_000_000    # 10 quadrillion
@show typeof(z)
z = 10_000_000_000_000_000_000  # 10 quintillion
@show typeof(z)
z = 10_000_000_000_000_000_000_000_000_000_000_000_000_000   # 10 sextillion
@show typeof(z);
```

Usually, one must explicitly request the `BigInt` type to avoid modulo $2^{64}$ arithmetic:

```{julia}
@show 3^300        BigInt(3)^300;
```

*Arbitrary precision arithmetic* comes at a cost of significant memory and computation time.

We compare the time and memory requirements for summing 10 million integers as `Int64` and as `BigInt`.

```{julia}
# 10^7 random numbers, uniformly distributed between -10^7 and 10^7
vec_int = rand(-10^7:10^7, 10^7)

# The same numbers as BigInts
vec_bigint = BigInt.(vec_int)
```

An initial impression of the time and memory requirements is provided by the `@time` macro:

```{julia}
@time x = sum(vec_int)
@show x typeof(x)
```
```{julia}
@time x = sum(vec_bigint)
@show x typeof(x);
```

Due to Julia's just-in-time compilation, a single execution of a function is not very informative. The `BenchmarkTools` package provides the `@benchmark` macro, which calls a function multiple times and displays the execution times as a histogram.

:::{.ansitight}
```{julia}
using BenchmarkTools

@benchmark sum($vec_int)
```


```{julia}
@benchmark sum($vec_bigint)
```
:::
The `BigInt` addition is more than 30 times slower.

:::{.content-hidden unless-format="xxx"}
The following function should compute the sum of all numbers from 1 to n using arithmetic of type T.
Due to the *type promotion rules*, it is sufficient for `T ≥ Int64` to initialize the accumulator variable with a number of type T.
```{julia}
function mysum(n, T)
    s = T(0)
    for i = 1:n
        s += i
    end
    return s
end
```

An initial impression of the time and memory requirements is provided by the `@time` macro:

```{julia}
@time x = mysum(10_000_000, Int64)
@show x typeof(x);
```

```{julia}
@time x = mysum(10_000_000, BigInt)
@show x typeof(x);
```

Due to Julia's just-in-time compilation, a single execution of a function is not very informative. The `BenchmarkTools` package provides the `@benchmark` macro, which calls a function multiple times and displays the execution times as a histogram.

:::{.ansitight}
```{julia}
using BenchmarkTools

@benchmark mysum(10_000_000, Int64)
```


```{julia}

@benchmark mysum(10_000_000, BigInt)
```


The computation of $\sum_{n=1}^{10000000} n$ takes on my PC an average of 2 nanoseconds with standard 64-bit integers and over one second in *arbitrary precision arithmetic*, during which nearly 500MB of memory is also allocated.

:::
:::


## Floating-Point Numbers

From _floating point numbers_, one can form German **[Gleit|Fließ]--[Komma|Punkt]--Zahlen**, and indeed all 4 variants appear in the literature.

In numerical mathematics, one also often speaks of **machine numbers**.


### Basic Idea

- A "fixed number of digits before and after the decimal point" is unsuitable for many problems.
- A separation between "significant digits" and magnitude (mantissa and exponent), as in scientific notation, is much more flexible.
    $$ 345.2467 \times 10^3\qquad 34.52467\times 10^4\qquad \underline{3.452467\times 10^5}\qquad 0.3452467\times 10^6\qquad 0.03452467\times 10^7$$
- For uniqueness, one must choose one of these forms. In the mathematical analysis of machine numbers, one often chooses the form where the first digit after the decimal point is nonzero. Thus, for the mantissa $m$:
$$
1 > m \ge (0.1)_b = b^{-1},
$$
where $b$ denotes the base of the number system.
- We choose the form that corresponds to the actual implementation on the computer and specify:
The representation with exactly one nonzero digit before the decimal point is the __normalized representation__. Thus,
$$
 (10)_b = b> m \ge 1.
$$
- For binary numbers `1.01101`, this digit is always equal to one, and one can omit storing this digit. This actually stored (shortened) mantissa we denote by $M$, so that
$$m = 1 + M$$
holds.


:::{.callout-note }

## Machine Numbers

The set of machine numbers  $𝕄(b, p, e_{min}, e_{max})$ is characterized by the base $b$ used, the mantissa length $p$, and the value range of the exponent $\{e_{min}, ... ,e_{max}\}$.

In our convention, the mantissa of a normalized machine number has one digit (of base $b$) nonzero before the decimal point and $p-1$ digits after the decimal point.

If $b=2$, one needs only $p-1$ bits to store the mantissa of normalized floating-point numbers.

The IEEE 754 standard, implemented by most modern processors and programming languages, defines

- `Float32` as $𝕄(2, 24, -126, 127 )$ and
- `Float64` as $𝕄(2, 53, -1022, 1023 ).$

:::

### Structure of `Float64` according to [IEEE 754 standard](https://de.wikipedia.org/wiki/IEEE_754)


::: {.content-visible when-format="html"}
![Structure of a `Float64` (source:^[Source: <a href="https://commons.wikimedia.org/wiki/File:IEEE_754_Double_Floating_Point_Format.svg">Codekaizen</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons ])
](../images/1024px-IEEE_754_Double_Floating_Point_Format.png)
:::

::: {.content-visible when-format="pdf"}
![Structure of a `Float64`  \mysmall{(Source: \href{https://commons.wikimedia.org/wiki/File:IEEE_754_Double_Floating_Point_Format.svg}{Codekaizen}, \href{https://creativecommons.org/licenses/by-sa/4.0}{CC BY-SA 4.0}, via Wikimedia Commons)}
](../images/1024px-IEEE_754_Double_Floating_Point_Format.png){width="70%"}
:::


- 1 sign bit $S$
- 11 bits for the exponent, thus $0\le E \le 2047$
- The values $E=0$ and $E=(11111111111)_2=2047$ are reserved for encoding special values such as
$\pm0, \pm\infty$, NaN _(Not a Number)_ and subnormal numbers.
- 52 bits for the (shortened) mantissa $M,\quad 0\le M<1$, corresponding to approximately 16 decimal digits
- Thus, the following number is represented:
$$ x=(-1)^S \cdot(1+M)\cdot 2^{E-1023}$$

An example:
```{julia}
x = 27.56640625
bitstring(x)
```
This can be done more nicely:

```{julia}
function printbitsf64(x::Float64)
    s = bitstring(x)
    printstyled(s[1], color = :blue, reverse=true)
    printstyled(s[2:12], color = :green, reverse=true)
    printstyled(s[13:end], color=:red, bold=true, reverse=true)
    print("\n")
end

printbitsf64(x)
```
and we can read S (blue), E (green), and M (red):
$$
\begin{aligned}
S &= 0\\
E &= (10000000011)_2 = 2^{10} + 2^1 + 2^0 = 1027\\
M &= (.1011100100001)_2 =  \frac{1}{2} + \frac{1}{2^3} + \frac{1}{2^4} + \frac{1}{2^5} + \frac{1}{2^8} + \frac{1}{2^{12}}\\
x &=(-1)^S \cdot(1+M)\cdot 2^{E-1023}
\end{aligned}
$$


... and thus reconstruct the number:

```{julia}
x = (1 + 1/2 + 1/8 + 1/16 + 1/32 + 1/256 + 1/4096) * 2^4
```

- The machine numbers 𝕄 form a finite, discrete subset of ℝ. There is a smallest and a largest machine number, and apart from these, all x∈𝕄 have a predecessor and successor in 𝕄.
- What is the successor of x in 𝕄? To do this, we set the smallest mantissa bit from 0 to 1.
- Converting a string of zeros and ones into the corresponding machine number is possible e.g. as follows:


```{julia}
ux = parse(UInt64, "0100000000111011100100010000000000000000000000000000000000000001", base=2)
reinterpret(Float64, ux)
```

However, Julia can do this more simply:

```{julia}
y = nextfloat(x)
```

The predecessor of x is:

```{julia}
z = prevfloat(x)
println(z)
printbitsf64(z)
```


## Machine Epsilon

- The distance between `1` and its successor `nextfloat(1)` is called [**machine epsilon**](https://en.wikipedia.org/wiki/Machine_epsilon).
- For `Float64` with a mantissa length of 52 bits, $\epsilon=2^{-52}$.

```{julia}
@show nextfloat(1.) - 1   2^-52   eps(Float64);
```

- Machine epsilon is a measure of the relative distance between machine numbers and quantifies the statement: "64-bit floating-point numbers have a precision of approximately 16 decimal digits."
- Machine epsilon is something completely different from the smallest positive floating-point number:

```{julia}
floatmin(Float64)
```

- Part of the literature uses a different definition of machine epsilon, which is half as large.
$$
\epsilon' = \frac{\epsilon}{2}\approx 1.1\times 10^{-16}
$$
is the maximum relative error that can occur when rounding a real number to the nearest machine number.
- Since numbers in the interval $(1-\epsilon',1+\epsilon']$ are rounded to the machine number $1$, one can also define $\epsilon'$ as: *the largest number for which in machine arithmetic still holds: $1+\epsilon' = 1$.*

In this way, one can also compute machine epsilon:

:::{.ansitight}

```{julia}
Eps = 1
while(1 != 1 + Eps)
    Eps /= 2
    println(1+Eps)
end
Eps
```

or as a bit pattern:

```{julia}
Eps=1
while 1 != 1 + Eps
    Eps/= 2
    printbitsf64(1+Eps)
end
Eps
```

:::

:::{.callout-note}
## The Set of (normalized) Machine Numbers

- In the interval $[1,2)$ there are $2^{52}$ equidistant machine numbers.
- After that, the exponent increases by 1 and the mantissa $M$ is reset to 0. Thus, the interval $[2,4)$ again contains $2^{52}$ equidistant machine numbers, as does the interval $[4,8)$ up to $[2^{1023}, 2^{1024})$.
- Likewise, in the intervals $\ [\frac{1}{2},1), \ [\frac{1}{4},\frac{1}{2}),...$  there are $2^{52}$ equidistant machine numbers each, down to $[2^{-1022}, 2^{-1021})$.
- This forms the set $𝕄_+$ of positive machine numbers, and it is
$$
𝕄  =  -𝕄_+  \cup \{0\} \cup 𝕄_+
$$

:::

The largest and smallest positive representable normalized floating-point numbers of a floating-point type can be queried:

```{julia}
@show floatmax(Float64)
printbitsf64(floatmax(Float64))

@show floatmin(Float64)
printbitsf64(floatmin(Float64))
```


## Rounding to Machine Numbers

- The mapping rd: ℝ $\rightarrow$  𝕄 should round to the nearest representable number.
- Standard rounding mode: _round to nearest, ties to even_
  If one lands exactly in the middle between two machine numbers *(tie)*, one chooses the one whose last mantissa bit is 0.
- Justification: this way, in 50% of the cases one rounds up and in 50% down, thus avoiding a "statistical drift" in longer calculations.
- It holds:
$$
\frac{|x-\text{rd}(x)|}{|x|} \le \frac{1}{2} \epsilon
$$


## Machine Number Arithmetic

The machine numbers, as a subset of ℝ, are not algebraically closed. Even the sum of two machine numbers will generally not be a machine number.

:::{.callout-important}
The IEEE 754 standard requires that machine number arithmetic produces the *rounded exact result*:

The result must be equal to the one that would result from an exact execution of the corresponding operation followed by rounding.
$$
 a \oplus b = \text{rd}(a + b)
$$
The same must hold for the implementation of standard functions such as
`sqrt()`, `log()`, `sin()` ...: they also return the machine number closest to the exact result.
:::


Arithmetic is *not associative*:

```{julia}
1 +  10^-16 + 10^-16
```

```{julia}
1 + (10^-16 + 10^-16)
```

In the first case (without parentheses), evaluation proceeds from left to right:
$$
\begin{aligned}
1 \oplus 10^{-16} \oplus 10^{-16} &=
 (1 \oplus 10^{-16}) \oplus 10^{-16} \\  &=
\text{rd}\left(\text{rd}(1 + 10^{-16}) + 10^{-16} \right)\\
&= \text{rd}(1 + 10^{-16})\\
&= 1
\end{aligned}
$$

In the second case, one obtains:
$$
\begin{aligned}
1 \oplus \left(10^{-16} \oplus 10^{-16}\right) &=
\text{rd}\left(1 + \text{rd}(10^{-16} + 10^{-16}) \right)\\
&= \text{rd}(1 + 2*10^{-16})\\
&= 1.0000000000000002
\end{aligned}
$$


One should also be reminded that even "simple" decimal fractions often cannot be represented exactly as machine numbers:

$$
\begin{aligned}
(0.1)_{10} &= (0.000110011001100110011001100...)_2 = (0.000\overline{1100})_2 \\
 (0.3)_{10} &= (0.0100110011001100110011001100..)_2 = (0.0\overline{1001})_2
  \end{aligned}
  $$

```{julia}
printbitsf64(0.1)
printbitsf64(0.3)
```

Consequence:

```{julia}
0.1 + 0.1 == 0.2
```

```{julia}
0.2 + 0.1 == 0.3
```

```{julia}
0.2 + 0.1
```

When outputting a machine number, the binary fraction must be converted to a decimal fraction. One can also display more digits of this decimal fraction expansion:
```{julia}
using Printf
@printf("%.30f", 0.1)
```

```{julia}
@printf("%.30f", 0.3)
```
The binary fraction mantissa of a machine number can have a long or even infinitely periodic decimal expansion. Therefore,
one should not be misled into thinking there is "higher precision"!

:::{.callout-important}
Moral: when testing `Float`s for equality, one should almost always define a realistic accuracy `epsilon` appropriate to the problem and
test:

```julia
epsilon = 1.e-10

if abs(x-y) < epsilon
   # ...
end
```
:::

## Normalized and Subnormal Machine Numbers

The gap between zero and the smallest normalized machine number $2^{-1022} \approx 2.22\times 10^{-308}$
is filled with subnormal machine numbers.

For understanding, let's take a simple model:

- Let 𝕄(10,4,±5) be the set of machine numbers to base 10 with 4 mantissa digits (one before the decimal point, 3 after) and the exponent range -5 ≤ E ≤ 5.
- Then the normalized representation (nonzero leading digit)
of e.g. 1234.0 is 1.234e3 and of 0.00789 is 7.890e-3.
- It is important that machine numbers are kept normalized at every computation step. Only then is the mantissa length fully utilized and the accuracy is maximum.
- The smallest positive normalized number in our model is `x = 1.000e-5`. Already `x/2` would have to be rounded to 0.
- Here, for many applications, it is advantageous to allow also subnormal *(subnormal)* numbers and represent `x/2` as `0.500e-5` or `x/20` as `0.050e-5`.
- This *gradual underflow* is当然 associated with a loss of valid digits and thus accuracy.

In the `Float` data type, such *subnormal values* are represented by an exponent field in which all bits are equal to zero:

```{julia}
#| echo: false
#| output: false
flush(stdout)
```


:::{.ansitight}
```{julia}
using Printf

x = 2 * floatmin(Float64)   # 2*smallest normalized floating-point number > 0

while x != 0
    x /= 2
    @printf "%14.6g  " x
    printbitsf64(x)
end
```
:::

## Special Values

Floating-point arithmetic knows some special values, e.g.
```{julia}
nextfloat(floatmax(Float64))
```

```{julia}
for x ∈ (NaN, Inf, -Inf, -0.0)
    @printf "%6g  " x
    printbitsf64(x)
end
```

- An exponent overflow *(overflow)* leads to the result `Inf` or `-Inf`.
```{julia}
2/0, -3/0, floatmax(Float64) * 1.01, exp(1300)
```
- One can continue calculating with it:

```{julia}
-Inf + 20, Inf/30, 23/-Inf, sqrt(Inf),  Inf * 0, Inf - Inf
```
- `NaN` *(Not a Number)* stands for the result of an operation that is undefined. All further operations with `NaN` also result in `NaN`.

```{julia}
0/0, Inf - Inf, 2.3NaN, sqrt(NaN)
```
- Since `NaN` represents an undefined value, it is not equal to anything, not even to itself. This is sensible, because if two variables `x` and `y` are computed as `NaN`, one should not conclude that they are equal.
- There is therefore a boolean function `isnan()` to test for `NaN`.

```{julia}
x = 0/0
y = Inf - Inf
@show x==y   NaN==NaN  isfinite(NaN) isinf(NaN) isnan(x) isnan(y);
```

- There is a "minus zero". It signals an exponent underflow *(underflow)* of a magnitude that has become too small *negative* quantity.

```{julia}
@show 23/-Inf   -2/exp(1200)    -0.0==0.0;
```

## Mathematical Functions

Julia has the [usual mathematical functions](https://docs.julialang.org/en/v1/manual/mathematical-operations/#Rounding-functions)
`sqrt, exp, log, log2, log10, sin, cos,..., asin, acos,..., sinh,..., gcd, lcm, factorial,...,abs, max, min,...`,

including e.g. the [rounding functions](https://de.wikipedia.org/wiki/Abrundungsfunktion_und_Aufrundungsfunktion)

- `floor(T,x)` = $\lfloor x \rfloor$
- `ceil(T,x)`  = $\lceil x \rceil$

```{julia}
floor(3.4),   floor(Int64, 3.5),   floor(Int64, -3.5)
```

```{julia}
ceil(3.4),    ceil(Int64, 3.5),    ceil(Int64, -3.5)
```

Also worth noting is `atan(y, x)`, the [two-argument arctangent](https://de.wikipedia.org/wiki/Arctan2). In other programming languages, it is often implemented as a function with its own name *atan2*.
This solves the problem of converting from Cartesian to polar coordinates without awkward case distinctions.

- `atan(y,x)` is the angle of the polar coordinates of (x,y) in the interval $(-\pi,\pi]$. In the 1st and 4th quadrants, it is therefore equal to `atan(y/x)`

```{julia}
atan(3, -2),    atan(-3, 2),    atan(-3/2)
```


##  Conversion  Strings $\Longleftrightarrow$ Numbers

Conversion is possible with the functions `parse()` and `string()`.

```{julia}
parse(Int64, "1101", base=2)
```

```{julia}
string(13, base=2)
```

```{julia}
string(1/7)
```

```{julia}
string(77, base=16)
```

For conversion of numerical types into each other, one can use the type names. Type names are also constructors:


```{julia}
x = Int8(67)
@show x   typeof(x);
```

```{julia}
z = UInt64(3459)
```


```{julia}
y = Float64(z)
```


## Literature

- D. Goldberg, [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://www.validlab.com/goldberg/paper.pdf)
- C. Vuik, [Some Disasters caused by numerical errors](http://ta.twi.tudelft.nl/users/vuik/wi211/disasters.html)