789 lines
22 KiB
Plaintext
789 lines
22 KiB
Plaintext
---
|
||
engine: julia
|
||
julia:
|
||
exeflags: ["--color=yes"]
|
||
---
|
||
|
||
```{julia}
|
||
#| error: false
|
||
#| echo: false
|
||
#| output: false
|
||
using InteractiveUtils
|
||
|
||
#struct M a::Int end; x = M(22); @show x
|
||
#should not print "Main.Notebook.M(22)" but only "M(22)"
|
||
function Base.show(io::IO, x::T) where T
|
||
if parentmodule(T) == @__MODULE__
|
||
# Print "TypeName(fields...)" without module prefix
|
||
print(io, nameof(T), "(")
|
||
fields = fieldnames(T)
|
||
for (i, f) in enumerate(fields)
|
||
print(io, getfield(x, f))
|
||
i < length(fields) && print(io, ", ")
|
||
end
|
||
print(io, ")")
|
||
else
|
||
invoke(Base.show, Tuple{IO, Any}, io, x)
|
||
end
|
||
end
|
||
```
|
||
|
||
# Machine Numbers
|
||
|
||
```{julia}
|
||
#| error: false
|
||
#| echo: false
|
||
#| output: false
|
||
using InteractiveUtils
|
||
```
|
||
|
||
```{julia}
|
||
for x ∈ ( 3, 3.3e4, Int16(20), Float32(3.3e4), UInt16(9) )
|
||
@show x sizeof(x) typeof(x)
|
||
println()
|
||
end
|
||
```
|
||
|
||
## Integers
|
||
|
||
Integers are stored as fixed-length bit patterns. Therefore, the value range is finite.
|
||
**Within this value range** addition, subtraction, multiplication, and integer division with remainder
|
||
are exact operations without rounding errors.
|
||
|
||
Integer numbers come in two types: `Signed` and `Unsigned`, which can be viewed as machine models for ℤ and ℕ respectively.
|
||
|
||
|
||
### *Unsigned integers*
|
||
```{julia}
|
||
subtypes(Unsigned)
|
||
```
|
||
`UInts` are binary numbers with a bit width of 8, 16, 32, 64, or 128 and the corresponding value range of
|
||
$$
|
||
0 \le x < 2^n
|
||
$$
|
||
They are used rarely in *scientific computing*. In low-level hardware programming, they are used, e.g., for handling binary data and memory addresses. By default, Julia displays them as hexadecimal numbers (with prefix `0x` and digits `0-9a-f`).
|
||
|
||
```{julia}
|
||
x = 0x0033efef
|
||
@show x typeof(x) Int(x)
|
||
|
||
z = UInt(32)
|
||
@show z typeof(z);
|
||
|
||
```
|
||
|
||
### *Signed Integers*
|
||
```{julia}
|
||
subtypes(Signed)
|
||
```
|
||
*Integers* have the value range
|
||
$$
|
||
-2^{n-1} \le x < 2^{n-1}
|
||
$$
|
||
|
||
|
||
In Julia, integer numbers are 64-bit by default:
|
||
```{julia}
|
||
x = 42
|
||
typeof(x)
|
||
```
|
||
Therefore, they have the value range:
|
||
$$
|
||
-9.223.372.036.854.775.808 \le x \le 9.223.372.036.854.775.807
|
||
$$
|
||
|
||
32-bit integers have the value range
|
||
$$
|
||
-2.147.483.648 \le x \le 2.147.483.647
|
||
$$
|
||
By the way, the maximum value $2^{31}-1$ is a Mersenne prime:
|
||
|
||
```{julia}
|
||
using Primes
|
||
isprime(2^31-1)
|
||
```
|
||
|
||
Negative numbers are represented in two's complement:
|
||
|
||
$x \Rightarrow -x$ corresponds to: _flip all bits, then add 1_
|
||
|
||
This looks as follows:
|
||
|
||
::: {.content-visible when-format="html"}
|
||
{width=50%}
|
||
:::
|
||
|
||
::: {.content-visible when-format="typst"}
|
||
{width=50%}
|
||
:::
|
||
|
||
The most significant bit indicates the sign. However, its "flipping" does not correspond to the operation `x → -x`.
|
||
|
||
:::{.callout-important}
|
||
All integer arithmetic is __arithmetic modulo $2^n$__
|
||
|
||
```{julia}
|
||
x = 2^62 - 10 + 2^62
|
||
```
|
||
```{julia}
|
||
x + 20
|
||
```
|
||
No error message, no warning! Fixed-length integers do not lie on a line, but on a **circle.**
|
||
|
||
:::
|
||
|
||
Let's look at a few *integers*:
|
||
|
||
|
||
```{julia}
|
||
using Printf
|
||
|
||
for i in (1, 2, 7, -7, 2^50, -1, Int16(7), Int8(7))
|
||
s = bitstring(i)
|
||
t = typeof(i)
|
||
@printf "%20i %-5s ⇒ %s\n" i t s
|
||
end
|
||
```
|
||
|
||
Julia has functions that provide type information (*introspection*):
|
||
|
||
|
||
```{julia}
|
||
typemin(Int64), typemax(Int64)
|
||
```
|
||
|
||
```{julia}
|
||
typemin(UInt64), typemax(UInt64), BigInt(typemax(UInt64))
|
||
```
|
||
|
||
```{julia}
|
||
typemin(Int8), typemax(Int8)
|
||
```
|
||
|
||
## Integer Arithmetic
|
||
|
||
#### Addition, Multiplication
|
||
The operations `+`,`-`,`*` have the usual exact arithmetic **modulo $2^{64}$**.
|
||
|
||
#### Powers `a^b`
|
||
|
||
- Powers `a^n` are computed exactly modulo $2^{64}$ for natural exponents `n`.
|
||
- For negative exponents, the result is a `Float`.
|
||
- `0^0` is [naturally](https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero#cite_note-T4n3B-4) equal to 1.
|
||
```{julia}
|
||
(-2)^63, 2^64, 3^(-3), 0^0
|
||
```
|
||
- For natural exponents, [*exponentiation by squaring*](https://en.wikipedia.org/wiki/Exponentiation_by_squaring) is used, so for example `x^23` requires only 7 multiplications:
|
||
$$
|
||
x^{23} = \left( \left( (x^2)^2 \cdot x \right)^2 \cdot x \right)^2 \cdot x
|
||
$$
|
||
|
||
#### Division
|
||
|
||
- Division `/` produces a floating-point number.
|
||
|
||
```{julia}
|
||
x = 40/5
|
||
```
|
||
#### Integer Division and Remainder
|
||
|
||
- The functions `div(a,b)`, `rem(a,b)`, and `divrem(a,b)` compute the quotient of integer division, the corresponding remainder, or both as a tuple.
|
||
- For `div(a,b)` there is the operator form `a ÷ b` (input: `\div<TAB>`), and for `rem(a,b)` the operator form `a % b`.
|
||
- By default, division uses "rounding toward zero", so the corresponding remainder has the same sign as the dividend `a`:
|
||
|
||
```{julia}
|
||
@show divrem( 27, 4)
|
||
@show ( 27 ÷ 4, 27 % 4)
|
||
@show (-27 ÷ 4, -27 % 4 )
|
||
@show ( 27 ÷ -4, 27 % -4);
|
||
```
|
||
|
||
- A rounding rule other than `RoundToZero` can be specified as the third optional argument for these functions.
|
||
- `?RoundingMode` shows the possible rounding modes.
|
||
- For the rounding rule `RoundDown` ("toward minus infinity" -- so that the corresponding remainder has the same sign as the divisor `b`), there are also the functions `fld(a,b)` *(floored division)* and `mod(a,b)`:
|
||
|
||
```{julia}
|
||
@show divrem(-27, 4, RoundDown)
|
||
@show (fld(-27, 4), mod(-27, 4))
|
||
@show (fld( 27, -4), mod( 27, -4));
|
||
```
|
||
|
||
For all rounding modes, the following holds:
|
||
```
|
||
div(a, b, RoundingMode) * b + rem(a, b, RoundingMode) = a
|
||
```
|
||
|
||
#### The `BigInt` Type
|
||
|
||
The `BigInt` type supports arbitrary-precision integers with dynamically allocated memory.
|
||
|
||
Numeric constants automatically have a sufficiently large type:
|
||
|
||
```{julia}
|
||
z = 10
|
||
@show typeof(z)
|
||
z = 10_000_000_000_000_000 # 10 quadrillion
|
||
@show typeof(z)
|
||
z = 10_000_000_000_000_000_000 # 10 quintillion
|
||
@show typeof(z)
|
||
z = 10_000_000_000_000_000_000_000_000_000_000_000_000_000 # 10 sextillion
|
||
@show typeof(z);
|
||
```
|
||
|
||
In most cases, you must explicitly specify the `BigInt` type to avoid modulo $2^{64}$ arithmetic:
|
||
|
||
```{julia}
|
||
@show 3^300 BigInt(3)^300;
|
||
```
|
||
|
||
*Arbitrary precision arithmetic* comes at a cost of significant memory and computation time.
|
||
|
||
We compare the time and memory requirements for summing 10 million integers as `Int64` versus `BigInt`.
|
||
|
||
```{julia}
|
||
# 10^7 random numbers, uniformly distributed between -10^7 and 10^7
|
||
vec_int = rand(-10^7:10^7, 10^7)
|
||
|
||
# The same numbers as BigInts
|
||
vec_bigint = BigInt.(vec_int)
|
||
```
|
||
|
||
The `@time` macro provides a rough estimate of the required time and memory:
|
||
|
||
```{julia}
|
||
@time x = sum(vec_int)
|
||
@show x typeof(x)
|
||
```
|
||
```{julia}
|
||
@time x = sum(vec_bigint)
|
||
@show x typeof(x);
|
||
```
|
||
|
||
Due to Julia's just-in-time compilation, timing a single function call is not very informative. The `BenchmarkTools` package provides the `@benchmark` macro, which calls a function multiple times and displays the execution times as a histogram.
|
||
|
||
:::{.ansitight}
|
||
```{julia}
|
||
using BenchmarkTools
|
||
|
||
@benchmark sum($vec_int)
|
||
```
|
||
|
||
|
||
```{julia}
|
||
@benchmark sum($vec_bigint)
|
||
```
|
||
:::
|
||
The `BigInt` addition is more than 30 times slower.
|
||
|
||
:::{.content-hidden unless-format="xxx"}
|
||
The following function computes the sum of all numbers from 1 to n using arithmetic of type T.
|
||
Due to *type promotion rules*, it is sufficient to initialize the accumulator with a value of type T (for `T ≥ Int64`).
|
||
```{julia}
|
||
function mysum(n, T)
|
||
s = T(0)
|
||
for i = 1:n
|
||
s += i
|
||
end
|
||
return s
|
||
end
|
||
```
|
||
|
||
An initial impression of the time and memory requirements is provided by the `@time` macro:
|
||
|
||
```{julia}
|
||
@time x = mysum(10_000_000, Int64)
|
||
@show x typeof(x);
|
||
```
|
||
|
||
```{julia}
|
||
@time x = mysum(10_000_000, BigInt)
|
||
@show x typeof(x);
|
||
```
|
||
|
||
Due to Julia's just-in-time compilation, a single execution of a function is not very informative. The `BenchmarkTools` package provides the `@benchmark` macro, which calls a function multiple times and displays the execution times as a histogram.
|
||
|
||
:::{.ansitight}
|
||
```{julia}
|
||
using BenchmarkTools
|
||
|
||
@benchmark mysum(10_000_000, Int64)
|
||
```
|
||
|
||
|
||
```{julia}
|
||
|
||
@benchmark mysum(10_000_000, BigInt)
|
||
```
|
||
|
||
|
||
The computation of $\sum_{n=1}^{10000000} n$ takes on my PC an average of 2 milliseconds with standard 64-bit integers and over one second in *arbitrary precision arithmetic*, during which nearly 500MB of memory is also allocated.
|
||
|
||
:::
|
||
:::
|
||
|
||
|
||
## Floating-Point Numbers
|
||
|
||
|
||
In numerical mathematics, the term **machine numbers** is also commonly used.
|
||
|
||
|
||
### Basic Idea
|
||
|
||
- A "fixed number of digits before and after the decimal point" is unsuitable for many problems.
|
||
- A separation between "significant digits" and magnitude (mantissa and exponent), as in scientific notation, is much more flexible.
|
||
$$ 345.2467 \times 10^3\qquad 34.52467\times 10^4\qquad \underline{3.452467\times 10^5}\qquad 0.3452467\times 10^6\qquad 0.03452467\times 10^7$$
|
||
- For uniqueness, one must choose one of these forms. In the mathematical analysis of machine numbers, one often chooses the form where the first digit after the decimal point is nonzero. Thus, for the mantissa $m$:
|
||
$$
|
||
1 > m \ge (0.1)_b = b^{-1},
|
||
$$
|
||
where $b$ denotes the base of the number system.
|
||
- We choose the form that corresponds to the actual implementation on the computer and specify:
|
||
The representation with exactly one nonzero digit before the decimal point is the __normalized representation__. Thus,
|
||
$$
|
||
(10)_b = b> m \ge 1.
|
||
$$
|
||
- For binary numbers `1.01101`, this digit is always equal to one, and one can omit storing this digit. This actually stored (shortened) mantissa we denote by $M$, so that
|
||
$$m = 1 + M$$
|
||
holds.
|
||
|
||
|
||
:::{.callout-note }
|
||
|
||
## Machine Numbers
|
||
|
||
The set of machine numbers $𝕄(b, p, e_{min}, e_{max})$ is characterized by the base $b$, the mantissa length $p$, and the value range of the exponent $\{e_{min}, ... ,e_{max}\}$.
|
||
|
||
In our convention, the mantissa of a normalized machine number has one digit (of base $b$) before the decimal point and $p-1$ digits after the decimal point.
|
||
|
||
If $b=2$, one needs only $p-1$ bits to store the mantissa of normalized floating-point numbers.
|
||
|
||
The IEEE 754 standard, implemented by most modern processors and programming languages, defines
|
||
|
||
- `Float32` as $𝕄(2, 24, -126, 127 )$ and
|
||
- `Float64` as $𝕄(2, 53, -1022, 1023 ).$
|
||
|
||
:::
|
||
|
||
### Structure of `Float64` according to the [IEEE 754 standard](https://en.wikipedia.org/wiki/IEEE_754)
|
||
|
||
|
||
::: {.content-visible when-format="html"}
|
||
![Structure of a `Float64` (source:^[Source: <a href="https://commons.wikimedia.org/wiki/File:IEEE_754_Double_Floating_Point_Format.svg">Codekaizen</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons ])
|
||
](../images/1024px-IEEE_754_Double_Floating_Point_Format.png)
|
||
:::
|
||
|
||
::: {.content-visible when-format="typst"}
|
||
{width="70%"}
|
||
:::
|
||
|
||
|
||
- 1 sign bit $S$
|
||
- 11 bits for the exponent, thus $0\le E \le 2047$
|
||
- The values $E=0$ and $E=(11111111111)_2=2047$ are reserved for encoding special values such as
|
||
$\pm0, \pm\infty$, NaN _(Not a Number)_ and subnormal numbers.
|
||
- 52 bits for the (shortened) mantissa $M,\quad 0\le M<1$, corresponding to approximately 16 decimal digits
|
||
- Thus, the number represented is:
|
||
$$ x=(-1)^S \cdot(1+M)\cdot 2^{E-1023}$$
|
||
|
||
An example:
|
||
```{julia}
|
||
x = 27.56640625
|
||
bitstring(x)
|
||
```
|
||
This can be displayed more clearly:
|
||
|
||
```{julia}
|
||
function printbitsf64(x::Float64)
|
||
s = bitstring(x)
|
||
printstyled(s[1], color = :blue, reverse=true)
|
||
printstyled(s[2:12], color = :green, reverse=true)
|
||
printstyled(s[13:end], color=:red, bold=true, reverse=true)
|
||
print("\n")
|
||
end
|
||
|
||
printbitsf64(x)
|
||
```
|
||
and we can read S (blue), E (green), and M (red):
|
||
$$
|
||
\begin{aligned}
|
||
S &= 0\\
|
||
E &= (10000000011)_2 = 2^{10} + 2^1 + 2^0 = 1027\\
|
||
M &= (.1011100100001)_2 = \frac{1}{2} + \frac{1}{2^3} + \frac{1}{2^4} + \frac{1}{2^5} + \frac{1}{2^8} + \frac{1}{2^{12}}\\
|
||
x &=(-1)^S \cdot(1+M)\cdot 2^{E-1023}
|
||
\end{aligned}
|
||
$$
|
||
|
||
|
||
... and thus reconstruct the number:
|
||
|
||
```{julia}
|
||
x = (1 + 1/2 + 1/8 + 1/16 + 1/32 + 1/256 + 1/4096) * 2^4
|
||
```
|
||
|
||
- The set of machine numbers 𝕄 forms a finite, discrete subset of ℝ. There exists a smallest and a largest machine number; all other elements x∈𝕄 have both a predecessor and successor in 𝕄.
|
||
- What is the successor of x in 𝕄? To do this, we set the smallest mantissa bit from 0 to 1.
|
||
- Converting a string of zeros and ones into the corresponding machine number:
|
||
|
||
|
||
```{julia}
|
||
ux = parse(UInt64, "0100000000111011100100010000000000000000000000000000000000000001", base=2)
|
||
reinterpret(Float64, ux)
|
||
```
|
||
|
||
However, Julia can do this more simply:
|
||
|
||
```{julia}
|
||
y = nextfloat(x)
|
||
```
|
||
|
||
The predecessor of x is:
|
||
|
||
```{julia}
|
||
z = prevfloat(x)
|
||
println(z)
|
||
printbitsf64(z)
|
||
```
|
||
|
||
|
||
|
||
## Machine Epsilon
|
||
|
||
- The distance between `1` and its successor `nextfloat(1)` is called [**machine epsilon**](https://en.wikipedia.org/wiki/Machine_epsilon).
|
||
- For `Float64` with a mantissa length of 52 bits, $\epsilon=2^{-52}$.
|
||
|
||
```{julia}
|
||
@show nextfloat(1.) - 1 2^-52 eps(Float64);
|
||
```
|
||
|
||
- Machine epsilon measures the relative distance between machine numbers and quantifies the statement: "64-bit floating-point numbers have a precision of approximately 16 decimal digits."
|
||
- Machine epsilon should not be confused with the smallest positive floating-point number:
|
||
|
||
```{julia}
|
||
floatmin(Float64)
|
||
```
|
||
|
||
- Part of the literature uses a different definition of machine epsilon, which is half as large.
|
||
$$
|
||
\epsilon' = \frac{\epsilon}{2}\approx 1.1\times 10^{-16}
|
||
$$
|
||
This is the maximum relative error that can occur when rounding a real number to the nearest machine number.
|
||
- Since numbers in the interval $(1-\epsilon',1+\epsilon']$ are rounded to the machine number $1$, one can also define $\epsilon'$ as: *the largest number for which $1+\epsilon' = 1$ still holds in machine arithmetic.*
|
||
|
||
This allows to compute machine epsilon using the floating point arithmetic:
|
||
|
||
:::{.ansitight}
|
||
|
||
```{julia}
|
||
Eps = 1
|
||
while(1 != 1 + Eps)
|
||
Eps /= 2
|
||
println(1+Eps)
|
||
end
|
||
Eps
|
||
```
|
||
|
||
or as a bit pattern:
|
||
|
||
```{julia}
|
||
Eps=1
|
||
while 1 != 1 + Eps
|
||
Eps/= 2
|
||
printbitsf64(1+Eps)
|
||
end
|
||
Eps
|
||
```
|
||
|
||
:::
|
||
|
||
:::{.callout-note}
|
||
## The Set of (normalized) Machine Numbers
|
||
|
||
- In the interval $[1,2)$ there are $2^{52}$ equidistant machine numbers.
|
||
- After that, the exponent increases by 1 and the mantissa $M$ is reset to 0. Thus, the interval $[2,4)$ again contains $2^{52}$ equidistant machine numbers, as does the interval $[4,8)$ up to $[2^{1023}, 2^{1024})$.
|
||
- Likewise, in the intervals $\ [\frac{1}{2},1), \ [\frac{1}{4},\frac{1}{2}),...$ there are $2^{52}$ equidistant machine numbers each, down to $[2^{-1022}, 2^{-1021})$.
|
||
- This forms the set $𝕄_+$ of positive machine numbers, and we have
|
||
$$
|
||
𝕄 = -𝕄_+ \cup \{0\} \cup 𝕄_+
|
||
$$
|
||
|
||
:::
|
||
|
||
The largest and smallest positive representable normalized floating-point numbers of a floating-point type can be queried:
|
||
|
||
```{julia}
|
||
@show floatmax(Float64)
|
||
printbitsf64(floatmax(Float64))
|
||
|
||
@show floatmin(Float64)
|
||
printbitsf64(floatmin(Float64))
|
||
```
|
||
|
||
|
||
|
||
## Rounding to Machine Numbers
|
||
|
||
- The map rd: ℝ $\rightarrow$ 𝕄 should round to the nearest representable number.
|
||
- Standard rounding mode is _round to nearest, ties to even_:
|
||
when a value falls exactly midway between two machine numbers *(tie)*, the one with 0 as its last mantissa bit is selected.
|
||
- Justification: this way, we round up in 50% of the cases and down in 50% of the cases, thus avoiding a "statistical drift" in longer calculations.
|
||
- It holds:
|
||
$$
|
||
\frac{|x-\text{rd}(x)|}{|x|} \le \frac{1}{2} \epsilon
|
||
$$
|
||
|
||
|
||
## Machine Number Arithmetic
|
||
|
||
The machine numbers, as a subset of ℝ, are not algebraically closed. Even the sum of two machine numbers is generally not representable as a machine number.
|
||
|
||
:::{.callout-important}
|
||
The IEEE 754 standard requires that machine number arithmetic produces the *rounded exact result*:
|
||
|
||
The result must be equal to the one that would result from an exact execution of the corresponding operation followed by rounding.
|
||
$$
|
||
a \oplus b = \text{rd}(a + b)
|
||
$$
|
||
The same must hold for the implementation of standard functions such as
|
||
`sqrt()`, `log()`, `sin()`, ... -- they also return the machine number closest to the exact result.
|
||
:::
|
||
|
||
|
||
Arithmetic is *not associative*:
|
||
|
||
```{julia}
|
||
1 + 10^-16 + 10^-16
|
||
```
|
||
|
||
```{julia}
|
||
1 + (10^-16 + 10^-16)
|
||
```
|
||
|
||
In the first case (without parentheses), evaluation proceeds from left to right:
|
||
$$
|
||
\begin{aligned}
|
||
1 \oplus 10^{-16} \oplus 10^{-16} &=
|
||
(1 \oplus 10^{-16}) \oplus 10^{-16} \\ &=
|
||
\text{rd}\left(\text{rd}(1 + 10^{-16}) + 10^{-16} \right)\\
|
||
&= \text{rd}(1 + 10^{-16})\\
|
||
&= 1
|
||
\end{aligned}
|
||
$$
|
||
|
||
In the second case, one obtains:
|
||
$$
|
||
\begin{aligned}
|
||
1 \oplus \left(10^{-16} \oplus 10^{-16}\right) &=
|
||
\text{rd}\left(1 + \text{rd}(10^{-16} + 10^{-16}) \right)\\
|
||
&= \text{rd}(1 + 2*10^{-16})\\
|
||
&= 1.0000000000000002
|
||
\end{aligned}
|
||
$$
|
||
|
||
|
||
One should also remember that even "simple" decimal fractions cannot always be represented exactly as machine numbers:
|
||
|
||
$$
|
||
\begin{aligned}
|
||
(0.1)_{10} &= (0.000110011001100110011001100...)_2 = (0.000\overline{1100})_2 \\
|
||
(0.3)_{10} &= (0.0100110011001100110011001100..)_2 = (0.0\overline{1001})_2
|
||
\end{aligned}
|
||
$$
|
||
|
||
```{julia}
|
||
printbitsf64(0.1)
|
||
printbitsf64(0.3)
|
||
```
|
||
|
||
Consequence:
|
||
|
||
```{julia}
|
||
0.1 + 0.1 == 0.2
|
||
```
|
||
|
||
```{julia}
|
||
0.2 + 0.1 == 0.3
|
||
```
|
||
|
||
```{julia}
|
||
0.2 + 0.1
|
||
```
|
||
|
||
When outputting a machine number, the binary fraction must be converted to a decimal fraction. Julia can display more digits of this decimal fraction expansion:
|
||
```{julia}
|
||
using Printf
|
||
@printf("%.30f", 0.1)
|
||
```
|
||
|
||
```{julia}
|
||
@printf("%.30f", 0.3)
|
||
```
|
||
The binary fraction mantissa of a machine number can have a long or even infinitely periodic decimal expansion. But
|
||
one should not be misled into thinking that this is "higher precision"!
|
||
|
||
:::{.callout-important}
|
||
Key message: When testing `Float`s for equality, one should almost always define a realistic tolerance `epsilon` appropriate to the problem and
|
||
test:
|
||
|
||
```julia
|
||
epsilon = 1.e-10
|
||
|
||
if abs(x-y) < epsilon
|
||
# ...
|
||
end
|
||
```
|
||
:::
|
||
|
||
## Normalized and Subnormal Machine Numbers
|
||
|
||
The gap between zero and the smallest normalized machine number $2^{-1022} \approx 2.22\times 10^{-308}$
|
||
is filled with subnormal machine numbers.
|
||
|
||
Let's look at a simple model:
|
||
|
||
- Let 𝕄(10,4,±5) be the set of machine numbers to base 10 with 4 mantissa digits (one before the decimal point, 3 after) and the exponent range -5 ≤ E ≤ 5.
|
||
- Then the normalized representation (nonzero leading digit)
|
||
of e.g. 1234.0 is 1.234e3 and of 0.00789 is 7.890e-3.
|
||
- It is essential that machine numbers remain normalized at each computation step. Only then is the full mantissa length utilized, maximizing accuracy.
|
||
- The smallest positive normalized number in our model is `x = 1.000e-5`. Already `x/2` would have to be rounded to 0.
|
||
- But for many applications, it is advantageous to allow also subnormal numbers and represent `x/2` as `0.500e-5` or `x/20` as `0.050e-5`.
|
||
- This *gradual underflow* is of course associated with a loss of significant digits and thus accuracy.
|
||
|
||
In the `Float` data type, such *subnormal values* are represented by an exponent field in which all bits are equal to zero:
|
||
|
||
```{julia}
|
||
#| echo: false
|
||
#| output: false
|
||
flush(stdout)
|
||
```
|
||
|
||
|
||
:::{.ansitight}
|
||
```{julia}
|
||
using Printf
|
||
|
||
x = 2 * floatmin(Float64) # 2*smallest normalized floating-point number > 0
|
||
|
||
while x != 0
|
||
x /= 2
|
||
@printf "%14.6g " x
|
||
printbitsf64(x)
|
||
end
|
||
```
|
||
:::
|
||
|
||
## Special Values
|
||
|
||
Floating-point arithmetic defines certain special values, e.g.,
|
||
```{julia}
|
||
nextfloat(floatmax(Float64))
|
||
```
|
||
|
||
```{julia}
|
||
for x ∈ (NaN, Inf, -Inf, -0.0)
|
||
@printf "%6g " x
|
||
printbitsf64(x)
|
||
end
|
||
```
|
||
|
||
- An exponent overflow leads to the result `Inf` or `-Inf`.
|
||
```{julia}
|
||
2/0, -3/0, floatmax(Float64) * 1.01, exp(1300)
|
||
```
|
||
- One can continue calculating with these values:
|
||
|
||
```{julia}
|
||
-Inf + 20, Inf/30, 23/-Inf, sqrt(Inf), Inf * 0, Inf - Inf
|
||
```
|
||
- `NaN` *(Not a Number)* represents the result of an undefined operation. All further operations with `NaN` also result in `NaN`.
|
||
|
||
```{julia}
|
||
0/0, Inf - Inf, 2.3NaN, sqrt(NaN)
|
||
```
|
||
- Since `NaN` represents an undefined value, it is not equal to anything, not even to itself. This is sensible, because if two variables `x` and `y` are computed as `NaN`, one should not conclude that they are equal.
|
||
- There is therefore a boolean function `isnan()` to test for `NaN`.
|
||
|
||
```{julia}
|
||
x = 0/0
|
||
y = Inf - Inf
|
||
@show x==y NaN==NaN isfinite(NaN) isinf(NaN) isnan(x) isnan(y);
|
||
```
|
||
|
||
- There is a "minus zero". It signals a numerical underflow of a small *negative* quantity.
|
||
|
||
```{julia}
|
||
@show 23/-Inf -2/exp(1200) -0.0==0.0;
|
||
```
|
||
|
||
## Mathematical Functions
|
||
|
||
Julia has the [usual mathematical functions](https://docs.julialang.org/en/v1/manual/mathematical-operations/#Rounding-functions)
|
||
`sqrt, exp, log, log2, log10, sin, cos,..., asin, acos,..., sinh,..., gcd, lcm, factorial,...,abs, max, min,...`,
|
||
|
||
including e.g. the [rounding functions](https://en.wikipedia.org/wiki/Floor_and_ceiling_functions)
|
||
|
||
- `floor(T,x)` = $\lfloor x \rfloor$
|
||
- `ceil(T,x)` = $\lceil x \rceil$
|
||
|
||
```{julia}
|
||
floor(3.4), floor(Int64, 3.5), floor(Int64, -3.5)
|
||
```
|
||
|
||
```{julia}
|
||
ceil(3.4), ceil(Int64, 3.5), ceil(Int64, -3.5)
|
||
```
|
||
|
||
Also worth noting is `atan(y, x)`, the two-argument arctangent (known as `atan2` in many programming languages, see [atan2](https://en.wikipedia.org/wiki/Atan2)).
|
||
This solves the problem of converting from Cartesian to polar coordinates without awkward case distinctions.
|
||
|
||
- `atan(y,x)` is the angle of the polar coordinates of (x,y) in the interval $(-\pi,\pi]$. In the 1st and 4th quadrants, it is therefore equal to `atan(y/x)`
|
||
|
||
```{julia}
|
||
atan(3, -2), atan(-3, 2), atan(-3/2)
|
||
```
|
||
|
||
|
||
## Conversion Between Strings and Numbers
|
||
|
||
Use the functions `parse()` and `string()` for such conversions:
|
||
|
||
```{julia}
|
||
parse(Int64, "1101", base=2)
|
||
```
|
||
|
||
```{julia}
|
||
string(13, base=2)
|
||
```
|
||
|
||
```{julia}
|
||
string(1/7)
|
||
```
|
||
|
||
```{julia}
|
||
string(77, base=16)
|
||
```
|
||
|
||
For conversion of numerical types into each other, one can use the type names. Type names are also constructors:
|
||
|
||
|
||
```{julia}
|
||
x = Int8(67)
|
||
@show x typeof(x);
|
||
```
|
||
|
||
```{julia}
|
||
z = UInt64(3459)
|
||
```
|
||
|
||
|
||
```{julia}
|
||
y = Float64(z)
|
||
```
|
||
|
||
|
||
|
||
## Literature
|
||
|
||
- D. Goldberg, [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://www.validlab.com/goldberg/paper.pdf)
|
||
- C. Vuik, [Some Disasters caused by numerical errors](http://ta.twi.tudelft.nl/users/vuik/wi211/disasters.html) |