Fast float to integer conversions
created , updated
In Rust, the standard way of converting floating point values to integers is with the as
operator. This conversion has various guarantees as listed in the reference. One of them is that it saturates: Input values out of range of the output type convert to the minimal/maximal value of the output type.
assert_eq!(300f32 as u8, 255);
assert_eq!(-5f32 as u8, 0);
Saturation comes with a downside. It is slower than not saturating. In C/C++ this kind of cast is undefined behavior. Rust's casts are never undefined behavior (good), but they are slower than C/C++.
On many hardware targets a float to integer conversion can be done in one instruction. For example CVTTSS2SI
on x86_84+SSE. This is the minimum amount of work we need to do to perform the conversion. This is what the C/C++ cast compiles to. Rust needs more instructions because of the additional guarantees.
Sometimes you want faster conversions and don't need saturation. You might already know that your inputs are in range or you might not care about the result for out of range inputs. My new fast-float-to-integer crate provides this. If in range, then the conversion functions work like the standard as
operator conversion. If not in range, then you get an unspecified value. We get the speed of the C/C++ version without the undefined behavior.
You never get undefined behavior but you can get unspecified behavior. In the unspecified case, you get an arbitrary value. The function returns and you get a valid value of the output type, but there is no guarantee what that value is. On the assembly level the result is specified but we cannot guarantee this in the crate because the behavior varies between platforms.
Assembly
We keep track of the generated assembly in the repository and enforce on CI that it is up to date. Here is a typical assembly comparison on x86_64+SSE.
standard:
f32_to_i64:
cvttss2si rax, xmm0
ucomiss xmm0, dword ptr [rip + .L_0]
movabs rcx, 9223372036854775807
cmovbe rcx, rax
xor eax, eax
ucomiss xmm0, xmm0
cmovnp rax, rcx
ret
fast:
f32_to_i64:
cvttss2si rax, xmm0
ret
As expected, we do the conversion in one instruction. The extra assembly in the standard version performs the saturation.
A more complicated case is f32 to u64. We cannot handle this in one instruction, because the cvttss2si instruction outputs an i64. It cannot handle the values between i64::MAX
and u64::MAX
. (AVX512 could do this, but the intrinsics are not stable and it might not be faster.) With some clever optimization, we can still do better than the standard version. I copied the following Rust code from the repository.
standard:
f32_to_u64:
cvttss2si rax, xmm0
mov rcx, rax
sar rcx, 63
movaps xmm1, xmm0
subss xmm1, dword ptr [rip + .L_0]
cvttss2si rdx, xmm1
and rdx, rcx
or rdx, rax
xor ecx, ecx
xorps xmm1, xmm1
ucomiss xmm0, xmm1
cmovae rcx, rdx
ucomiss xmm0, dword ptr [rip + .L_1]
mov rax, -1
cmovbe rax, rcx
ret
fast:
f32_to_u64:
cvttss2si rcx, xmm0
addss xmm0, dword ptr [rip + .L_0]
cvttss2si rdx, xmm0
mov rax, rcx
sar rax, 63
and rax, rdx
or rax, rcx
ret
First, let us consider a relatively naive branchful solution:
#[inline(always)]
fn _f32_to_u64_branchful(float: f32) -> u64 {
const THRESHOLD_FLOAT: f32 = power_of_two_f32(63);
const THRESHOLD_INTEGER: u64 = 2u64.pow(63);
let in_range = float <= THRESHOLD_FLOAT;
if in_range {
f32_to_i64(float) as u64
} else {
// Subtract the threshold from the float. The result is >= 0 because the input is larger
// than the subtrahend. The result is <= i64::MAX because `u64::MAX - i64::MAX == i64::MAX`.
let in_range_float = float - THRESHOLD_FLOAT;
let integer = f32_to_i64(in_range_float) as u64;
// Overflow is benign because it can only occur for invalid inputs.
integer.overflowing_add(THRESHOLD_INTEGER).0
}
}
We can turn this into a less naive branchless solution:
#[inline(always)]
fn f32_to_u64_branchless(float: f32) -> u64 {
const THRESHOLD: f32 = power_of_two_f32(63);
let integer1 = f32_to_i64(float);
let integer2 = f32_to_i64(float - THRESHOLD);
// If the input is larger than i64::MAX, then integer1 is i64::MIN. This value has 1 as the
// leftmost bit and 0 as the remaining bits. Right shift on signed values is arithmetic, not
// logical [1]. We end up with all 0 (in range) or all 1 (out of range).
let too_large = integer1 >> 63;
// # If the input is not too large:
//
// Integer1 has the correct value. The mask is all 0, which makes the Or result in integer1.
//
// # If the input is too large:
//
// Integer1 is i64::MIN and the mask is all 1. The Or results in `i64::MIN | integer2`.
// integer2 has the correct result minus 2**63. This is the correct result without the
// leftmost bit. The Or adds the missing leftmost bit back.
(integer1 | (integer2 & too_large)) as u64
}
This compiles to the assembly above.