Playing with AES intrinsics

What is an intrinsics

Intrinsics are way to access special CPU intructions through some special C functions and data types. It seem fairly portable (YMMV) and multiple mainstream compilers (gcc, intel, visual studio, and llvm) seems to provide support.

Here i’ll be looking on how to interact with them from a typical C program, which would provide a quick access to fancy instructions without having to drop to inline assembly or separate assembly file.

In this adventure, i’m particularly interested in looking at the AES instruction set in the latest AMD and Intel cpu.

DataTypes

Intrinsics have they own datatypes, and the only one i’m interested in is the __m128i type. This type represent a 128 bit value. This simple program get a value in the datatype using two uint64_t. SSE operations need to be aligned to 16 for loading and storing, which is the reason of the align attribute.

1
2
3
    uint64_t _value[2] __attribute__((aligned(16))) = { 0x0, 0x1 };
    __m128i value;
    value = _mm_load_si128((__m128i *) _value);

Now we can get a simple value out using:

1
2
3
4
    __m128i value;
    uint64_t out[2] __attribute__((aligned(16)));

    _mm_store_si128((__m128i *) out, value);

Simple operations

From there it’s easy to use simple operation like a xor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    uint64_t _v1[2] ALIGNED(16) = { 0x1020304050607080, 0x90a0b0c0d0e0f000 };
    uint64_t _v2[2] ALIGNED(16) = { 0xffffffffffffffff, 0xffffffffffffffff };
    uint64_t _out[2] ALIGNED(16) = { 0, 0 };
    __m128i v1;
    __m128i v2;
    int i;
    
    v1 = _mm_load_si128((__m128i *) _v1);
    v2 = _mm_load_si128((__m128i *) _v2);
    v1 = _mm_xor_si128(v1,v2);
    
    _mm_store_si128((__m128i *) _out, v1);
    for (i = 0; i < 2; i++) {
        printf("%lx", _out[i]);
    }
    printf("\n");

AES operation

AES is composed of 2 phases, key expansion and then the rounds.

The first phase is covered by the AESKEYGENASSIST instruction which is available as an intrinsic as _mm_aeskeygenassist_si128.

This operation take 2 parameters, the previous key and the round to generate mapped through the AES’s rcon box. here’s how to generate an AES 128 bits expanded key:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
	__m128i aes128_keyexpand(__m128i key, __m128i keygened)
	{
		key = _mm_xor_si128(key, _mm_slli_si128(key, 4));
		key = _mm_xor_si128(key, _mm_slli_si128(key, 4));
		key = _mm_xor_si128(key, _mm_slli_si128(key, 4));
		keygened = _mm_shuffle_epi32(keygened, _MM_SHUFFLE(3,3,3,3));
		return _mm_xor_si128(key, keygened);
	}

	#define KEYEXP(K, I) aes128_keyexpand(K, _mm_aeskeygenassist_si128(K, I))

	/* The initial part of the expanded key is the key itself. */
        __m128i K0  = _mm_load_si128((__m128i*)(_k));
	/* then every step generate more part of the key */
	__m128i K1  = KEYEXP(K0, 0x01);
        __m128i K2  = KEYEXP(K1, 0x02);
        __m128i K3  = KEYEXP(K2, 0x04);
        __m128i K4  = KEYEXP(K3, 0x08);
        __m128i K5  = KEYEXP(K4, 0x10);
        __m128i K6  = KEYEXP(K5, 0x20);
        __m128i K7  = KEYEXP(K6, 0x40);
        __m128i K8  = KEYEXP(K7, 0x80);
        __m128i K9  = KEYEXP(K8, 0x1B);
        __m128i K10 = KEYEXP(K9, 0x36);

Then the rounds is in total 10 rounds, where the last round is special. This is covered by AESENC and AESENCLAST available as _mm_aesenc_si128 and _mm_aesenclast_si128 respectively.

both intrinsics takes as parameter the key part used for the round as second parameter, and the current state of encryption as first parameter.

A typical AES 128 bits encryption will looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

	/* load the 16 bytes message into m */
	__m128i m = _mm_load_si128((const __m128i *) _m);
	/* first xor the loaded message with k0, which is the AES key supplied */
        m = _mm_xor_si128(m, K0);
	/* then do 9 rounds of aesenc, using the associated key parts */
        m = _mm_aesenc_si128(m, K1);
        m = _mm_aesenc_si128(m, K2);
        m = _mm_aesenc_si128(m, K3);
        m = _mm_aesenc_si128(m, K4);
        m = _mm_aesenc_si128(m, K5);
        m = _mm_aesenc_si128(m, K6);
        m = _mm_aesenc_si128(m, K7);
        m = _mm_aesenc_si128(m, K8);
        m = _mm_aesenc_si128(m, K9);
	/* then 1 aesenclast rounds */
        m = _mm_aesenclast_si128(m, K10);
	/* and then we store the result in an out variable */
        _mm_store_si128((__m128i *) _out, m);

That’s it, it’s a very small tutorial that doesn’t go in the depth of every operations used. It serves as an implementation of an AES-ni backend for my AES implementation, This might helps people understand how to use those fancy instructions (SSE, NEON, ..) that more than often yields a very significant improvement over using standard instructions.


posted by Vincent Hanquez on April 12, 2012.

tags c, aes, crypto, intrinsics.

in technical.