Geohash is a geocoding scheme which maps a latitude longitude pair to a 64-bit integer or equivalent short character string. It has become popular as a representation of point data in non-geo-specific databases, where the reduction to a single value enables proximity queries to take advantage of standard indexes. Indeed, integer geohashes are used by redis to implement geospatial commands on top of the sorted set type.

I am not aware of applications that require extremely high performance geohashing, but while premature optimization may be evil, nobody said it wasn’t fun. As the author of a geohash package for Golang, I found myself discussing geohash optimization with a colleague over lunch.

This post describes the journey to over 880M integer geohashes per second for Golang `amd64` architecture.

## Specification

The official specification of geohash is the Wikipedia page, as designated by geohash.org. For our purposes, encoding a `(lat, lng)` pair to a string geohash is a three stage process:

1. Quantize `lat` and `lng` to 32-bit integers
2. Bit-interleave integers to produce 64-bit value (the integer geohash)
3. Base32 encode (with custom alphabet) to produce string geohash

Let’s consider these stages in turn, using Mount Everest (27.988056, 86.925278) as an example.

### Quantization

Quantize latitude and longitude by mapping to the unit interval `[0, 1]` and multiplying by `2^32`. Latitude takes the range `[-90, 90]`, therefore the 32-bit integer representation of latitude is

``````lat32 = floor( 2.0^32 * (lat + 90.0) / 180.0 )
``````

Likewise for longitude:

``````lng32 = floor( 2.0^32 * (lng + 180.0) / 360.0 )
``````

Applied to Mount Everest coordinates, the result is `0xa7ce23e4` and `0xbdd04391`.

### Bit Interleaving

The 32-bit quantized latitude and longitude are bit-interleaved to produce a 64-bit value. Note latitude and longitude occupy even and odd bits respectively. The diagram below illustrates the operation. This 64-bit integer is the integer geohash. In our running example the integer geohash is `0xceb7f254240fd612`.

The integer geohash may be taken at lower precision. The precision n integer geohash consists of the n high bits of the integer geohash.

Note this step ultimately provides geohash’s critical prefix property: the geohash of a point at a lower precision is a prefix of the geohash at a higher precision. Therefore proximity searches can be reduced to matching geohash prefixes (as a first pass).

### Base32 Encoding

The string geohash is obtained from the integer geohash by base32 encoding. This is standard except for the choice of alphabet.

``````0123456789bcdefghjkmnpqrstuvwxyz
``````

Encoding starts at the high bits, consuming 5 bits of the integer geohash for each character of precision. As a result the maximum precision is 12 characters or 60 bits.

The string geohash for our Mount Everest example is `tuvz4p141zc1`.

For internal use cases integer geohash is usually sufficient. The base32 step was included here for completeness, but will not be relevant for the rest of the article.

## Implementation

Let’s start with a pure Golang implementation for reference. For expository purposes I have broken out the intermediate functions in a “deconstructed geohash” github repository.

Quantization maps to Go code fairly naturally. Perhaps the only unusual function here is `math.Ldexp(x, n)`, which computes `x * 2^n` faster than a regular floating point multiplication by exploiting features of floating point representation (more on that later).

``````// Quantize maps latitude and longitude to 32-bit integers.
func Quantize(lat, lng float64) (lat32 uint32, lng32 uint32) {
lat32 = uint32(math.Ldexp((lat+90.0)/180.0, 32))
lng32 = uint32(math.Ldexp((lng+180.0)/360.0, 32))
return
}
``````

Given 32-bit quantized latitude and longitude, some bit fiddling tricks can be used interleave them. The following helper spreads out 32 bits into the even bit positions of a 64-bit word.

``````// Spread out the 32 bits of x into 64 bits, where the bits of x occupy even
// bit positions.
X := uint64(x)
X = (X | (X << 16)) & 0x0000ffff0000ffff
X = (X | (X << 8)) & 0x00ff00ff00ff00ff
X = (X | (X << 4)) & 0x0f0f0f0f0f0f0f0f
X = (X | (X << 2)) & 0x3333333333333333
X = (X | (X << 1)) & 0x5555555555555555
return X
}
``````

Interleave is simple to write given `Spread`:

``````// Interleave the bits of x and y. In the result, x and y occupy even and odd
// bitlevels, respectively.
func Interleave(x, y uint32) uint64 {
}
``````

Now integer geohash encoding is just a matter of composing the quantization and interleaving steps.

``````// EncodeInt encodes the point (lat, lng) to a 64-bit integer geohash.
func EncodeInt(lat, lng float64) uint64 {
return Interleave(Quantize(lat, lng))
}
``````

Evaluate performance with a benchmark.

``````// BenchmarkEncodeInt benchmarks integer geohash encoding.
func BenchmarkEncodeInt(b *testing.B) {
var lat, lng = 40.463833, -79.972422
for i := 0; i < b.N; i++ {
EncodeInt(lat, lng)
}
}
``````

On my system1 I get the following results:

``````BenchmarkEncodeInt-4       	500000000	        27.4 ns/op
``````

## Assembly

This section will demonstrate a massive performance gain using floating point tricks to speed up quantization, and special instruction sets for interleaving.

### Quantization

Quantization requires floating point arithmetic to map to the range `[0,1]`, then another multiplication by `2^32` and conversion to integer. This may be optimized with a neat trick for integer conversion that exploits features of IEEE 754 `float64` representation.

#### Example

To see how this works let’s delve into the floating point format. Returning to our example, the latitude of Mount Everest (27.988056) will actually be represented in binary as the following 64 bits. The three fields of the 64-bit value are: sign bit `s`, exponent `e` and significand `b`. These represent the value:

``````(-1)^s * 2^(e-1023) * (1.0 + b/2^52)
``````

Thus in the example case of 27.988056 above, the sign bit `s` is `0`. The exponent field is 1027 because `2^(1027-1023) = 16` is the greatest power of two less than 27.988056. The significand `b = 0xbfcf13cee9dd8` satisfies

``````1.0 + b/2^52 = 27.988056/16 = 1.7492535.
``````

#### Unit Interval Trick

Given a floating point value `x` in the unit interval `[0, 1]`, consider the representation of the value `y = 1+x`. Since `y` is in the range `[1,2]`, the largest power of two less than or equal to `y` is 1.0 and the exponent field will always be `1023`. Referring back to the floating point formula this means the following relation holds

``````y = 1.0 + b/2^52.
``````

This implies `b = floor(2^52 * x)`. In other words, given `x` in the unit interval the bits of `floor(2^52 * x)` can be read directly from the binary representation of `1+x`. This saves any explicit integer conversion.

#### Application to Geohash

Note that quantization of latitude is equivalent to computing ```floor(2^32 * x)``` where `x = (lat + 90.0) / 180.0`. The above trick suggests this can be done by actually computing `1+x` instead, and taking the high 32 bits of the significand.

To drive the point home let’s show an example. From above, 27.988056 should quantize to `0xa7ce23e4`. The following helper applies the trick to perform the conversion.

``````// QuantizeLatBits maps latitude to the range [1,2] and returns the bits of
// the floating point representation.
func QuantizeLatBits(lat float64) uint64 {
return math.Float64bits(lat/180.0 + 1.5)
}
``````

When applied to 27.988056 the result is 0x3ffa7ce23e4e1968, where indeed the expected quantized value is in the high bits of the significand.

#### Implementation

The following Golang assembly implements this trick for latitude quantization.

``````// func QuantizeLatAsm(lat float64) uint32
TEXT ·QuantizeLatAsm(SB), NOSPLIT, \$0
MOVSD lat+0(FP), X0

MULSD \$(0.005555555555555556), X0
MOVQ  X0, AX
SHRQ  \$20, AX

MOVL AX, ret+8(FP)
RET
``````

### Interleaving

The interleave operation is a natural fit for acceleration with the special Bit Manipulation Instructions Sets (BMI). In particular the `Spread` operation implemented above is an example of the general “parallel bit deposit” operation, available as the `PDEP` instruction in modern Intel and AMD processors.

Parallel bit deposit takes a selector mask and an input. Low bits of the input are deposited in the positions of 1 bits in the selector. For example, given the selector `0xf0ff00f0` and input `0x0000cafe` parallel bit deposit would produce `0xc0af00e0`. Given this powerful instruction, `Spread` can be implemented using `PDEP` with the alternating selector mask `0x5555555555555555`. Therefore `Interleave` may be implemented by combining two `PDEP` calls.

``````// func InterleaveAsm(x, y uint32) uint64
TEXT ·InterleaveAsm(SB), NOSPLIT, \$0
MOVL x+0(FP), AX
MOVL y+4(FP), BX

MOVQ  \$0x5555555555555555, CX
PDEPQ CX, AX, AX
PDEPQ CX, BX, BX

SHLQ \$1, BX
XORQ BX, AX

MOVQ AX, ret+8(FP)
RET
``````

### Bringing it Together

Combining these techniques gives a full assembly version of `EncodeInt`.

``````// func EncodeIntAsm(lat, lng float64) uint64
TEXT ·EncodeIntAsm(SB), NOSPLIT, \$0
#define LATF	X0
#define LATI	R8
#define LNGF	X1
#define LNGI	R9
#define TEMP	R10
#define GHSH	R11

MOVSD lat+0(FP), LATF
MOVSD lng+8(FP), LNGF

MULSD \$(0.005555555555555556), LATF

MULSD \$(0.002777777777777778), LNGF

MOVQ LNGF, LNGI
SHRQ \$20, LNGI

MOVQ  LATF, LATI
SHRQ  \$20, LATI

SHLQ \$1, TEMP
XORQ TEMP, GHSH

MOVQ GHSH, ret+16(FP)
RET
``````

Let’s add a benchmark2 to judge the performance gain.

``````BenchmarkEncodeInt-4       	500000000	        27.4 ns/op
BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op
``````

That’s a very satisfying 11.8X speedup! This function is now sufficiently fast that it’s approaching the cost of function call overhead. To illustrate this, consider a no-op version of an encoder function.

``````// func NoopAsm(lat, lng float64) uint64
TEXT ·NoopAsm(SB), NOSPLIT, \$0
RET
``````

Benchmarking this in the same way shows the following:

``````BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op
BenchmarkNoopAsm-4         	10000000000	         1.40 ns/op
``````

Unfortunately this is unavoidable overhead, since the Go compiler will not inline user defined assembly functions. George Tankersly explored this frustration in his “I Wanna Go Fast!” talk at Gophercon. One solution would be to implement an `EncodeIntBatch` method that would amortize the function call overhead over many points. For large enough batch sizes performance should approach 0.92 ns per point3. I have not done the grunt work to confirm this, instead preferring to move on to more specialist instruction sets.

## SIMD

Since I was already on the path of absurd optimization, why not go the whole hog? Encouraged by my performance-obsessed colleague I set out to investigate whether SIMD instruction sets could provide a speedup.

With AVX2’s 256-bit wide vector instructions, it should be possible to perform four integer geohash encodings at once. The quantization step should parallelize naturally, but since there is no vectorized version of the `PDEP` instruction the SIMD interleaving would need to be implemented manually. Daniel Lemire has already considered this question in depth and demonstrated that 32-bit integer interleaving can in fact be achieved faster than `PDEP` with AVX2. The following section will demonstrate how to combine these techniques into an optimized integer geohash encoder.

### Quantization

As seen above, quantization simply involves multiplication, addition and bitwise right shift. These all have direct equivalents as vector instructions: `_mm256_mul_pd`, `_mm256_add_pd` and `_mm256_srli_epi64`. Therefore this part should be simple, at least as far as any intrinsics programming can be.

### Interleaving

The vectorized implementation of interleaving also reduces to the `Spread` operation, which distributes the bits of a 32-bit integer to the even positions of a 64-bit word. Our pure Go “bit fiddling” implementation of `Spread` works in five stages which spread successively smaller segments of the word: first splitting 16-bit halves into 32-bit lanes, then 8-bit values within 16-bit lanes, and so on as shown below. Each stage of this process is a bitwise `shift`, `or` and `and` which translates directly to AVX2 instructions. However Daniel Lemire’s post already demonstrated that approach will be sub-optimal. Two tricks are necessary to surpass `PDEP` performance: the first shuffles bytes, and the second gets us the rest of the way.

The first trick achieves the first two stages of the spread operation at once by using the `byte shuffle` AVX2 instruction. This instruction may be used to move the four bytes of each 32-bit input into its correct position within each 64-bit lane. After this three more stages of the spread operation remain.

The next trick exploits the fact that the byte shuffle instruction `_mm256_shuffle_epi8` may actually be used to execute a 4-to-8 bit lookup table (LUT). This instruction operates within each 128-bit lane: it takes input `x` and an index `idx`, and the output byte at byte position `i` is `x[idx[i]&0xf]`. Note that the last two stages of the spread operation spread nibbles into bytes, therefore could be represented as as a 4-to-8 bit lookup table and implemented with the byte shuffle intrinsic. Moreover, the final three steps of this process can be performed with two byte shuffles (plus some shifting). The combination of these tricks gives the following C implementation of vectorized 32-to-64 bit spread.

``````static inline __m256i spread(__m256i x)
{
x  = _mm256_shuffle_epi8(x, _mm256_set_epi8(
-1, 11, -1, 10, -1, 9, -1, 8,
-1,  3, -1,  2, -1, 1, -1, 0,
-1, 11, -1, 10, -1, 9, -1, 8,
-1,  3, -1,  2, -1, 1, -1, 0));

const __m256i lut = _mm256_set_epi8(
85, 84, 81, 80, 69, 68, 65, 64,
21, 20, 17, 16,  5,  4,  1,  0,
85, 84, 81, 80, 69, 68, 65, 64,
21, 20, 17, 16,  5,  4,  1,  0);

__m256i lo = _mm256_and_si256(x, _mm256_set1_epi8(0xf));
lo = _mm256_shuffle_epi8(lut, lo);

__m256i hi = _mm256_and_si256(x, _mm256_set1_epi8(0xf0));
hi = _mm256_shuffle_epi8(lut, _mm256_srli_epi64(hi, 4));

x = _mm256_or_si256(lo, _mm256_slli_epi64(hi, 8));

return x;
}
``````

### C Implementation

The `spread` function was the tricky part. This can be used to produce a full implementation of vectorized geohash encoding.

``````void encode_int(double *lat, double *lng, uint64_t *output)
{
// Quantize.
latq = _mm256_mul_pd(latq, _mm256_set1_pd(1/180.0));
__m256i lati = _mm256_srli_epi64(_mm256_castpd_si256(latq), 20);

lngq = _mm256_mul_pd(lngq, _mm256_set1_pd(1/360.0));
__m256i lngi = _mm256_srli_epi64(_mm256_castpd_si256(lngq), 20);

__m256i hash = _mm256_or_si256(
_mm256_storeu_si256((__m256i *)output, hash);
}
``````

Notes on this implementation:

### Golang Assembly

It is perhaps unfortunate that I felt the need to digress into C for the exploratory phase of the SIMD optimization. Intrinsics provide us with a low-level access to special instruction sets, combined with the conveniences of the compiler. The Go team has always preferred to support high-level interfaces over machine-level instructions, probably with good reason. However when combined with Golang’s unnecessary custom assembly language you end up with a slightly unfriendly environment for micro-optimizations.

Against this backdrop it is unsurprising that some impressive but ugly C to Go assembly conversion tools have emerged. I had hoped to employ these techniques with the `encode_int` kernel, but ended up having to fall back to manual conversion between the objdump output and Golang assembly. Thankfully, for AVX2 instructions the mapping is essentially one-to-one.

``````// func EncodeIntSimd(lat, lng []float64, hash []uint64)
TEXT ·EncodeIntSimd(SB), NOSPLIT, \$0
MOVQ lat+0(FP), AX
MOVQ lng+24(FP), BX
MOVQ hash+48(FP), CX

VMULPD       (AX), Y0, Y0
VPSRLQ       \$20, Y0, Y0
VMULPD       (BX), Y2, Y2
VPSRLQ       \$20, Y1, Y1
VPSHUFB      Y2, Y0, Y0
VPAND        Y3, Y0, Y4
VPSHUFB      Y4, Y5, Y4
VPAND        Y6, Y0, Y0
VPSRLQ       \$4, Y0, Y0
VPSHUFB      Y0, Y5, Y0
VPSLLQ       \$8, Y0, Y0
VPOR         Y4, Y0, Y0
VPSHUFB      Y2, Y1, Y1
VPAND        Y3, Y1, Y2
VPSHUFB      Y2, Y5, Y2
VPAND        Y6, Y1, Y1
VPSRLQ       \$4, Y1, Y1
VPSHUFB      Y1, Y5, Y1
VPSLLQ       \$8, Y1, Y1
VPOR         Y2, Y1, Y1
VPOR         Y1, Y0, Y0
VMOVDQU      Y0, (CX)

RET
``````

Implementation notes:

As always, performance gains are assessed with a benchmark (note that the SIMD function performs four geohashes per iteration):

``````BenchmarkEncodeIntAsm-4    	10000000000	         2.32 ns/op
BenchmarkEncodeIntSimd-4   	3000000000	         4.53 ns/op
``````

The SIMD implementation therefore takes 1.13 nanoseconds per point, a 2.04X speedup over the single-point version. However this assessment is somewhat dishonest, since the SIMD version benefits substantially merely by amortizing function call overhead over the four points. Accounting for function call overhead time per point becomes 0.78 nanoseconds, a 1.17X speedup. Still, at this level that’s not to be sniffed at!

## Conclusion

This article has taken us on a tour of floating point representation and special instruction sets, demonstrating massive speedups in geohash encoding. The single-point encoding gains have been incorporated into the `mmcloughlin/geohash` package, making it the fastest geohash implementation for Golang. Thanks to this absurd optimization work, SIMD-obsessed Gophers can now concisely describe their location 880M times per second!

Thanks for input and advice: Ozan Onay, Rami Jaber, Will Demaine, Joshua Goller.

Discussion: r/golang, r/programming, lobsters, hacker news.

1. Benchmarks were run in the following environment:

``````\$ go version
go version go1.10.3 darwin/amd64
\$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
``````

2. The assembly benchmark is the same as `BenchmarkEncodeInt`:

``````// BenchmarkEncodeIntAsm benchmarks assembly integer geohash encoding.
func BenchmarkEncodeIntAsm(b *testing.B) {
var lat, lng = 40.463833, -79.972422
for i := 0; i < b.N; i++ {
EncodeIntAsm(lat, lng)
}
}
``````

3. Estimate for hypothetical `EncodeIntBatch` performance of 0.92 obtained by subtracting noop function call time 1.4 from `EncodeIntAsm` time 2.32.