How to force Rust compiler to use several x86 instructions (popcount, etc)

Sometimes you need tricky operations on your binary data like counting bits, leading or trailing zeros and so on (for example, when you want to implement HAMT or sparse arrays). In Rust this is achieved by using methods like .count_ones() and .trailing_zeros(). The problem with those methods is that they are usually expanded in a huge pile of assembler code. But x86 (and some other architectures) have instructions to perform these counts (specifically, popcnt and tzcnt) and they are really fast (1 cycle for execution and latency of 3 cycles). Today we will learn how to force Rust compiler to use those instructions and what are the possible pitfalls.

Let’s start with an example. Here’s the code that finds population counts of random integers (on Rust Playground):

use rand::prelude::*;

// this is only to have this function separately in the asm output
#[inline(never)]
fn count(a: u32) -> u32 {
    a.count_ones()
}

fn main() {
    let a: u32 = random();
    println!("{}", count(a));
    println!("{}", count(a + 1));
}

And the count function looks like that:

playground::count:
	mov	eax, edi
	shr	eax
	and	eax, 1431655765
	sub	edi, eax
	mov	eax, edi
	and	eax, 858993459
	shr	edi, 2
	and	edi, 858993459
	add	edi, eax
	mov	eax, edi
	shr	eax, 4
	add	eax, edi
	and	eax, 252645135
	imul	eax, eax, 16843009
	shr	eax, 24
	ret

Wow. A huge pile of instructions and magic numbers. This is clearly not something we would like to see, especially in performance-critical places. Let’s make this a little bit better (on Rust Playground):

use rand::prelude::*;

// this is only to have this function separately in the asm output
#[inline(never)]
#[cfg_attr(target_arch = "x86_64", target_feature(enable = "popcnt"))]
unsafe fn count(a: u32) -> u32 {
    a.count_ones()
}

fn main() {
    let a: u32 = random();
    println!("{}", unsafe { count(a) });
    println!("{}", unsafe { count(a + 1) });
}

And here is the assembly code of count:

playground::count:
	popcnt	eax, edi
	ret

Only a single very fast instruction! And that this will work with all integer types. There is a pitfall though: Rust requires us to mark functions as unsafe when we use target_feature so it makes sense to make functions using those features as small as possible.

Also, you can do something similar with .trailing_zeros() or .leading_zeros() by using target_feature(enable = "bmi1").

To find feature names you can refer to architecture-specific intrinsics documentation.

That’s all, hope you find it useful!