Geohash in Golang Assembly: Lessons in absurd optimization:

Geohash is a geocoding scheme which maps a latitude longitude pair to a 64-bit integer or equivalent short character string. It has become popular as a representation of point data in non-geo-specific databases, where the reduction to a single value enables proximity queries to take advantage of standard indexes. Indeed, integer geohashes are used by redis to implement geospatial commands on top of the sorted set type.

I am not aware of applications that require extremely high performance geohashing, but while premature optimization may be evil, nobody said it wasn’t fun. As the author of a geohash package for Golang, I found myself discussing geohash optimization with a colleague over lunch.

This post describes the journey to over 880M integer geohashes per second for Golang amd64 architecture.

Specification

The official specification of geohash is the Wikipedia page, as designated by geohash.org. For our purposes, encoding a (lat, lng) pair to a string geohash is a three stage process:

Quantize lat and lng to 32-bit integers
Bit-interleave integers to produce 64-bit value (the integer geohash)
Base32 encode (with custom alphabet) to produce string geohash

Let’s consider these stages in turn, using Mount Everest (27.988056, 86.925278) as an example.

Quantization

Quantize latitude and longitude by mapping to the unit interval [0, 1] and multiplying by 2^32. Latitude takes the range [-90, 90], therefore the 32-bit integer representation of latitude is

lat32 = floor( 2.0^32 * (lat + 90.0) / 180.0 )

Likewise for longitude:

lng32 = floor( 2.0^32 * (lng + 180.0) / 360.0 )

Applied to Mount Everest coordinates, the result is 0xa7ce23e4 and 0xbdd04391.

Bit Interleaving

The 32-bit quantized latitude and longitude are bit-interleaved to produce a 64-bit value. Note latitude and longitude occupy even and odd bits respectively. The diagram below illustrates the operation.

Bit interleaving diagram

This 64-bit integer is the integer geohash. In our running example the integer geohash is 0xceb7f254240fd612.

The integer geohash may be taken at lower precision. The precision n integer geohash consists of the n high bits of the integer geohash.

Note this step ultimately provides geohash’s critical prefix property: the geohash of a point at a lower precision is a prefix of the geohash at a higher precision. Therefore proximity searches can be reduced to matching geohash prefixes (as a first pass).

Base32 Encoding

The string geohash is obtained from the integer geohash by base32 encoding. This is standard except for the choice of alphabet.

0123456789bcdefghjkmnpqrstuvwxyz

Encoding starts at the high bits, consuming 5 bits of the integer geohash for each character of precision. As a result the maximum precision is 12 characters or 60 bits.

The string geohash for our Mount Everest example is tuvz4p141zc1.

For internal use cases integer geohash is usually sufficient. The base32 step was included here for completeness, but will not be relevant for the rest of the article.

Implementation

Let’s start with a pure Golang implementation for reference. For expository purposes I have broken out the intermediate functions in a “deconstructed geohash” github repository.

Quantization maps to Go code fairly naturally. Perhaps the only unusual function here is math.Ldexp(x, n), which computes x * 2^n faster than a regular floating point multiplication by exploiting features of floating point representation (more on that later).

// Quantize maps latitude and longitude to 32-bit integers.
func Quantize(lat, lng float64) (lat32 uint32, lng32 uint32) {
	lat32 = uint32(math.Ldexp((lat+90.0)/180.0, 32))
	lng32 = uint32(math.Ldexp((lng+180.0)/360.0, 32))
	return
}

Given 32-bit quantized latitude and longitude, some bit fiddling tricks can be used interleave them. The following helper spreads out 32 bits into the even bit positions of a 64-bit word.

// Spread out the 32 bits of x into 64 bits, where the bits of x occupy even
// bit positions.
func Spread(x uint32) uint64 {
	X := uint64(x)
	X = (X | (X << 16)) & 0x0000ffff0000ffff
	X = (X | (X << 8)) & 0x00ff00ff00ff00ff
	X = (X | (X << 4)) & 0x0f0f0f0f0f0f0f0f
	X = (X | (X << 2)) & 0x3333333333333333
	X = (X | (X << 1)) & 0x5555555555555555
	return X
}

Interleave is simple to write given Spread:

// Interleave the bits of x and y. In the result, x and y occupy even and odd
// bitlevels, respectively.
func Interleave(x, y uint32) uint64 {
	return Spread(x) | (Spread(y) << 1)
}

Now integer geohash encoding is just a matter of composing the quantization and interleaving steps.

// EncodeInt encodes the point (lat, lng) to a 64-bit integer geohash.
func EncodeInt(lat, lng float64) uint64 {
	return Interleave(Quantize(lat, lng))
}

Evaluate performance with a benchmark.

// BenchmarkEncodeInt benchmarks integer geohash encoding.
func BenchmarkEncodeInt(b *testing.B) {
	var lat, lng = 40.463833, -79.972422
	for i := 0; i < b.N; i++ {
		EncodeInt(lat, lng)
	}
}

On my system¹ I get the following results:

BenchmarkEncodeInt-4       	500000000	        27.4 ns/op

Assembly

This section will demonstrate a massive performance gain using floating point tricks to speed up quantization, and special instruction sets for interleaving.

Quantization

Quantization requires floating point arithmetic to map to the range [0,1], then another multiplication by 2^32 and conversion to integer. This may be optimized with a neat trick for integer conversion that exploits features of IEEE 754 float64 representation.

Example

To see how this works let’s delve into the floating point format. Returning to our example, the latitude of Mount Everest (27.988056) will actually be represented in binary as the following 64 bits.

Floating point binary format

The three fields of the 64-bit value are: sign bit s, exponent e and significand b. These represent the value:

(-1)^s * 2^(e-1023) * (1.0 + b/2^52)

Thus in the example case of 27.988056 above, the sign bit s is 0. The exponent field is 1027 because 2^(1027-1023) = 16 is the greatest power of two less than 27.988056. The significand b = 0xbfcf13cee9dd8 satisfies

1.0 + b/2^52 = 27.988056/16 = 1.7492535.

Unit Interval Trick

Given a floating point value x in the unit interval [0, 1], consider the representation of the value y = 1+x. Since y is in the range [1,2], the largest power of two less than or equal to y is 1.0 and the exponent field will always be 1023. Referring back to the floating point formula this means the following relation holds

y = 1.0 + b/2^52.

This implies b = floor(2^52 * x). In other words, given x in the unit interval the bits of floor(2^52 * x) can be read directly from the binary representation of 1+x. This saves any explicit integer conversion.

Application to Geohash

Note that quantization of latitude is equivalent to computing floor(2^32 * x) where x = (lat + 90.0) / 180.0. The above trick suggests this can be done by actually computing 1+x instead, and taking the high 32 bits of the significand.

To drive the point home let’s show an example. From above, 27.988056 should quantize to 0xa7ce23e4. The following helper applies the trick to perform the conversion.

// QuantizeLatBits maps latitude to the range [1,2] and returns the bits of
// the floating point representation.
func QuantizeLatBits(lat float64) uint64 {
	return math.Float64bits(lat/180.0 + 1.5)
}

When applied to 27.988056 the result is 0x3ffa7ce23e4e1968, where indeed the expected quantized value is in the high bits of the significand.

Implementation

The following Golang assembly implements this trick for latitude quantization.

// func QuantizeLatAsm(lat float64) uint32
TEXT ·QuantizeLatAsm(SB), NOSPLIT, $0
	MOVSD lat+0(FP), X0

	MULSD $(0.005555555555555556), X0
	ADDSD $(1.5), X0
	MOVQ  X0, AX
	SHRQ  $20, AX

	MOVL AX, ret+8(FP)
	RET

Interleaving

The interleave operation is a natural fit for acceleration with the special Bit Manipulation Instructions Sets (BMI). In particular the Spread operation implemented above is an example of the general “parallel bit deposit” operation, available as the PDEP instruction in modern Intel and AMD processors.

Parallel bit deposit takes a selector mask and an input. Low bits of the input are deposited in the positions of 1 bits in the selector. For example, given the selector 0xf0ff00f0 and input 0x0000cafe parallel bit deposit would produce 0xc0af00e0.

Parallel bit deposit example

Given this powerful instruction, Spread can be implemented using PDEP with the alternating selector mask 0x5555555555555555. Therefore Interleave may be implemented by combining two PDEP calls.

// func InterleaveAsm(x, y uint32) uint64
TEXT ·InterleaveAsm(SB), NOSPLIT, $0
	MOVL x+0(FP), AX
	MOVL y+4(FP), BX

	MOVQ  $0x5555555555555555, CX
	PDEPQ CX, AX, AX
	PDEPQ CX, BX, BX

	SHLQ $1, BX
	XORQ BX, AX

	MOVQ AX, ret+8(FP)
	RET

Bringing it Together

Combining these techniques gives a full assembly version of EncodeInt.

// func EncodeIntAsm(lat, lng float64) uint64
TEXT ·EncodeIntAsm(SB), NOSPLIT, $0
#define LATF	X0
#define LATI	R8
#define LNGF	X1
#define LNGI	R9
#define TEMP	R10
#define GHSH	R11
#define MASK	BX

	MOVSD lat+0(FP), LATF
	MOVSD lng+8(FP), LNGF

	MOVQ $0x5555555555555555, MASK

	MULSD $(0.005555555555555556), LATF
	ADDSD $(1.5), LATF

	MULSD $(0.002777777777777778), LNGF
	ADDSD $(1.5), LNGF

	MOVQ LNGF, LNGI
	SHRQ $20, LNGI

	MOVQ  LATF, LATI
	SHRQ  $20, LATI
	PDEPQ MASK, LATI, GHSH

	PDEPQ MASK, LNGI, TEMP

	SHLQ $1, TEMP
	XORQ TEMP, GHSH

	MOVQ GHSH, ret+16(FP)
	RET

Let’s add a benchmark² to judge the performance gain.

BenchmarkEncodeInt-4       	500000000	        27.4 ns/op
BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op

That’s a very satisfying 11.8X speedup! This function is now sufficiently fast that it’s approaching the cost of function call overhead. To illustrate this, consider a no-op version of an encoder function.

// func NoopAsm(lat, lng float64) uint64
TEXT ·NoopAsm(SB), NOSPLIT, $0
	RET

Benchmarking this in the same way shows the following:

BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op
BenchmarkNoopAsm-4         	10000000000	         1.40 ns/op

Unfortunately this is unavoidable overhead, since the Go compiler will not inline user defined assembly functions. George Tankersly explored this frustration in his “I Wanna Go Fast!” talk at Gophercon. One solution would be to implement an EncodeIntBatch method that would amortize the function call overhead over many points. For large enough batch sizes performance should approach 0.92 ns per point³. I have not done the grunt work to confirm this, instead preferring to move on to more specialist instruction sets.

SIMD

Since I was already on the path of absurd optimization, why not go the whole hog? Encouraged by my performance-obsessed colleague I set out to investigate whether SIMD instruction sets could provide a speedup.

With AVX2’s 256-bit wide vector instructions, it should be possible to perform four integer geohash encodings at once. The quantization step should parallelize naturally, but since there is no vectorized version of the PDEP instruction the SIMD interleaving would need to be implemented manually. Daniel Lemire has already considered this question in depth and demonstrated that 32-bit integer interleaving can in fact be achieved faster than PDEP with AVX2. The following section will demonstrate how to combine these techniques into an optimized integer geohash encoder.

Quantization

As seen above, quantization simply involves multiplication, addition and bitwise right shift. These all have direct equivalents as vector instructions: _mm256_mul_pd, _mm256_add_pd and _mm256_srli_epi64. Therefore this part should be simple, at least as far as any intrinsics programming can be.

Interleaving

The vectorized implementation of interleaving also reduces to the Spread operation, which distributes the bits of a 32-bit integer to the even positions of a 64-bit word. Our pure Go “bit fiddling” implementation of Spread works in five stages which spread successively smaller segments of the word: first splitting 16-bit halves into 32-bit lanes, then 8-bit values within 16-bit lanes, and so on as shown below.

Spread stages diagram

Each stage of this process is a bitwise shift, or and and which translates directly to AVX2 instructions. However Daniel Lemire’s post already demonstrated that approach will be sub-optimal. Two tricks are necessary to surpass PDEP performance: the first shuffles bytes, and the second gets us the rest of the way.

The first trick achieves the first two stages of the spread operation at once by using the byte shuffle AVX2 instruction. This instruction may be used to move the four bytes of each 32-bit input into its correct position within each 64-bit lane. After this three more stages of the spread operation remain.

The next trick exploits the fact that the byte shuffle instruction _mm256_shuffle_epi8 may actually be used to execute a 4-to-8 bit lookup table (LUT). This instruction operates within each 128-bit lane: it takes input x and an index idx, and the output byte at byte position i is x[idx[i]&0xf]. Note that the last two stages of the spread operation spread nibbles into bytes, therefore could be represented as as a 4-to-8 bit lookup table and implemented with the byte shuffle intrinsic. Moreover, the final three steps of this process can be performed with two byte shuffles (plus some shifting). The combination of these tricks gives the following C implementation of vectorized 32-to-64 bit spread.

static inline __m256i spread(__m256i x)
{
  x  = _mm256_shuffle_epi8(x, _mm256_set_epi8(
            -1, 11, -1, 10, -1, 9, -1, 8,
            -1,  3, -1,  2, -1, 1, -1, 0,
            -1, 11, -1, 10, -1, 9, -1, 8,
            -1,  3, -1,  2, -1, 1, -1, 0));

  const __m256i lut = _mm256_set_epi8(
            85, 84, 81, 80, 69, 68, 65, 64,
            21, 20, 17, 16,  5,  4,  1,  0,
            85, 84, 81, 80, 69, 68, 65, 64,
            21, 20, 17, 16,  5,  4,  1,  0);

  __m256i lo = _mm256_and_si256(x, _mm256_set1_epi8(0xf));
  lo = _mm256_shuffle_epi8(lut, lo);

  __m256i hi = _mm256_and_si256(x, _mm256_set1_epi8(0xf0));
  hi = _mm256_shuffle_epi8(lut, _mm256_srli_epi64(hi, 4));

  x = _mm256_or_si256(lo, _mm256_slli_epi64(hi, 8));

  return x;
}

C Implementation

The spread function was the tricky part. This can be used to produce a full implementation of vectorized geohash encoding.

void encode_int(double *lat, double *lng, uint64_t *output)
{
  // Quantize.
  __m256d latq = _mm256_loadu_pd(lat);
  latq = _mm256_mul_pd(latq, _mm256_set1_pd(1/180.0));
  latq = _mm256_add_pd(latq, _mm256_set1_pd(1.5));
  __m256i lati = _mm256_srli_epi64(_mm256_castpd_si256(latq), 20);

  __m256d lngq = _mm256_loadu_pd(lng);
  lngq = _mm256_mul_pd(lngq, _mm256_set1_pd(1/360.0));
  lngq = _mm256_add_pd(lngq, _mm256_set1_pd(1.5));
  __m256i lngi = _mm256_srli_epi64(_mm256_castpd_si256(lngq), 20);

  // Spread.
  __m256i hash = _mm256_or_si256(
      spread(lati),
      _mm256_slli_epi64(spread(lngi), 1));
  _mm256_storeu_si256((__m256i *)output, hash);
}

Notes on this implementation:

For easier manipulation with SIMD instructions, the “structure of arrays”-like approach to accepting the arguments is preferred. An alternative would have been to take an array of struct { double lat; double lng; } but this would have been somewhat unnatural for SIMD processing.
You might have expected to see aligned memory access instead of the unaligned variants (_mm256_loadu_pd and _mm256_storeu_si256). This was chosen with a view to converting the resulting assembly to Golang, where memory alignment cannot be easily controlled.

Golang Assembly

It is perhaps unfortunate that I felt the need to digress into C for the exploratory phase of the SIMD optimization. Intrinsics provide us with a low-level access to special instruction sets, combined with the conveniences of the compiler. The Go team has always preferred to support high-level interfaces over machine-level instructions, probably with good reason. However when combined with Golang’s unnecessary custom assembly language you end up with a slightly unfriendly environment for micro-optimizations.

Against this backdrop it is unsurprising that some impressive but ugly C to Go assembly conversion tools have emerged. I had hoped to employ these techniques with the encode_int kernel, but ended up having to fall back to manual conversion between the objdump output and Golang assembly. Thankfully, for AVX2 instructions the mapping is essentially one-to-one.

// func EncodeIntSimd(lat, lng []float64, hash []uint64)
TEXT ·EncodeIntSimd(SB), NOSPLIT, $0
	MOVQ lat+0(FP), AX
	MOVQ lng+24(FP), BX
	MOVQ hash+48(FP), CX

	VBROADCASTSD reciprocal180<>+0x00(SB), Y0
	VMULPD       (AX), Y0, Y0
	VBROADCASTSD onepointfive<>+0x00(SB), Y1
	VADDPD       Y1, Y0, Y0
	VPSRLQ       $20, Y0, Y0
	VBROADCASTSD reciprocal360<>+0x00(SB), Y2
	VMULPD       (BX), Y2, Y2
	VADDPD       Y1, Y2, Y1
	VPSRLQ       $20, Y1, Y1
	VMOVDQU      spreadbyte<>+0x00(SB), Y2
	VPSHUFB      Y2, Y0, Y0
	VBROADCASTSD lonibblemask<>+0x00(SB), Y3
	VPAND        Y3, Y0, Y4
	VMOVDQU      spreadnibblelut<>+0x00(SB), Y5
	VPSHUFB      Y4, Y5, Y4
	VBROADCASTSD hinibblemask<>+0x00(SB), Y6
	VPAND        Y6, Y0, Y0
	VPSRLQ       $4, Y0, Y0
	VPSHUFB      Y0, Y5, Y0
	VPSLLQ       $8, Y0, Y0
	VPOR         Y4, Y0, Y0
	VPSHUFB      Y2, Y1, Y1
	VPAND        Y3, Y1, Y2
	VPSHUFB      Y2, Y5, Y2
	VPAND        Y6, Y1, Y1
	VPSRLQ       $4, Y1, Y1
	VPSHUFB      Y1, Y5, Y1
	VPSLLQ       $8, Y1, Y1
	VPOR         Y2, Y1, Y1
	VPADDQ       Y1, Y1, Y1
	VPOR         Y1, Y0, Y0
	VMOVDQU      Y0, (CX)

	RET

Implementation notes:

Memory references are to generated data sections.
Argument loading is based on the in-memory representation of slices.

As always, performance gains are assessed with a benchmark (note that the SIMD function performs four geohashes per iteration):

BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op
BenchmarkEncodeIntSimd-4   	3000000000	         4.53 ns/op

The SIMD implementation therefore takes 1.13 nanoseconds per point, a 2.04X speedup over the single-point version. However this assessment is somewhat dishonest, since the SIMD version benefits substantially merely by amortizing function call overhead over the four points. Accounting for function call overhead time per point becomes 0.78 nanoseconds, a 1.17X speedup. Still, at this level that’s not to be sniffed at!

Conclusion

This article has taken us on a tour of floating point representation and special instruction sets, demonstrating massive speedups in geohash encoding. The single-point encoding gains have been incorporated into the mmcloughlin/geohash package, making it the fastest geohash implementation for Golang. Thanks to this absurd optimization work, SIMD-obsessed Gophers can now concisely describe their location 880M times per second!

Thanks for input and advice: Ozan Onay, Rami Jaber, Will Demaine, Joshua Goller.

Discussion: r/golang, r/programming, lobsters, hacker news.

Benchmarks were run in the following environment:

$ go version
go version go1.10.3 darwin/amd64
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz

↩

The assembly benchmark is the same as BenchmarkEncodeInt:

// BenchmarkEncodeIntAsm benchmarks assembly integer geohash encoding.
func BenchmarkEncodeIntAsm(b *testing.B) {
	var lat, lng = 40.463833, -79.972422
	for i := 0; i < b.N; i++ {
		EncodeIntAsm(lat, lng)
	}
}

↩

Estimate for hypothetical EncodeIntBatch performance of 0.92 obtained by subtracting noop function call time 1.4 from EncodeIntAsm time 2.32. ↩