Geohash in Golang Assembly
Lessons in absurd optimization
Geohash is a geocoding scheme which maps a latitude longitude pair to a 64-bit integer or equivalent short character string. It has become popular as a representation of point data in non-geo-specific databases, where the reduction to a single value enables proximity queries to take advantage of standard indexes. Indeed, integer geohashes are used by redis to implement geospatial commands on top of the sorted set type.
I am not aware of applications that require extremely high performance geohashing, but while premature optimization may be evil, nobody said it wasn’t fun. As the author of a geohash package for Golang, I found myself discussing geohash optimization with a colleague over lunch.
This post describes the journey to over 880M integer
geohashes per second for Golang amd64
architecture.
Specification
The official specification of geohash is the Wikipedia
page, as designated by
geohash.org. For our purposes,
encoding a (lat, lng)
pair to a string geohash is a three stage process:
- Quantize
lat
andlng
to 32-bit integers - Bit-interleave integers to produce 64-bit value (the integer geohash)
- Base32 encode (with custom alphabet) to produce string geohash
Let’s consider these stages in turn, using Mount Everest (27.988056, 86.925278) as an example.
Quantization
Quantize latitude and longitude by mapping to the unit interval [0, 1]
and
multiplying by 2^32
. Latitude takes the range [-90, 90]
, therefore the
32-bit integer representation of latitude is
lat32 = floor( 2.0^32 * (lat + 90.0) / 180.0 )
Likewise for longitude:
lng32 = floor( 2.0^32 * (lng + 180.0) / 360.0 )
Applied to Mount Everest coordinates, the result is 0xa7ce23e4
and
0xbdd04391
.
Bit Interleaving
The 32-bit quantized latitude and longitude are bit-interleaved to produce a 64-bit value. Note latitude and longitude occupy even and odd bits respectively. The diagram below illustrates the operation.
This 64-bit integer is the integer geohash. In our running example the
integer geohash is 0xceb7f254240fd612
.
The integer geohash may be taken at lower precision. The precision n integer geohash consists of the n high bits of the integer geohash.
Note this step ultimately provides geohash’s critical prefix property: the geohash of a point at a lower precision is a prefix of the geohash at a higher precision. Therefore proximity searches can be reduced to matching geohash prefixes (as a first pass).
Base32 Encoding
The string geohash is obtained from the integer geohash by base32 encoding. This is standard except for the choice of alphabet.
0123456789bcdefghjkmnpqrstuvwxyz
Encoding starts at the high bits, consuming 5 bits of the integer geohash for each character of precision. As a result the maximum precision is 12 characters or 60 bits.
The string geohash for our Mount Everest example is tuvz4p141zc1
.
For internal use cases integer geohash is usually sufficient. The base32 step was included here for completeness, but will not be relevant for the rest of the article.
Implementation
Let’s start with a pure Golang implementation for reference. For expository purposes I have broken out the intermediate functions in a “deconstructed geohash” github repository.
Quantization maps to Go code fairly naturally. Perhaps the only unusual function
here is math.Ldexp(x, n)
, which
computes x * 2^n
faster than a regular floating point multiplication by
exploiting features of floating point representation (more on that later).
// Quantize maps latitude and longitude to 32-bit integers.
func Quantize(lat, lng float64) (lat32 uint32, lng32 uint32) {
lat32 = uint32(math.Ldexp((lat+90.0)/180.0, 32))
lng32 = uint32(math.Ldexp((lng+180.0)/360.0, 32))
return
}
Given 32-bit quantized latitude and longitude, some bit fiddling tricks can be used interleave them. The following helper spreads out 32 bits into the even bit positions of a 64-bit word.
// Spread out the 32 bits of x into 64 bits, where the bits of x occupy even
// bit positions.
func Spread(x uint32) uint64 {
X := uint64(x)
X = (X | (X << 16)) & 0x0000ffff0000ffff
X = (X | (X << 8)) & 0x00ff00ff00ff00ff
X = (X | (X << 4)) & 0x0f0f0f0f0f0f0f0f
X = (X | (X << 2)) & 0x3333333333333333
X = (X | (X << 1)) & 0x5555555555555555
return X
}
Interleave is simple to write given Spread
:
// Interleave the bits of x and y. In the result, x and y occupy even and odd
// bitlevels, respectively.
func Interleave(x, y uint32) uint64 {
return Spread(x) | (Spread(y) << 1)
}
Now integer geohash encoding is just a matter of composing the quantization and interleaving steps.
// EncodeInt encodes the point (lat, lng) to a 64-bit integer geohash.
func EncodeInt(lat, lng float64) uint64 {
return Interleave(Quantize(lat, lng))
}
Evaluate performance with a benchmark.
// BenchmarkEncodeInt benchmarks integer geohash encoding.
func BenchmarkEncodeInt(b *testing.B) {
var lat, lng = 40.463833, -79.972422
for i := 0; i < b.N; i++ {
EncodeInt(lat, lng)
}
}
On my system1 I get the following results:
BenchmarkEncodeInt-4 500000000 27.4 ns/op
Assembly
This section will demonstrate a massive performance gain using floating point tricks to speed up quantization, and special instruction sets for interleaving.
Quantization
Quantization requires floating point arithmetic to map to the range [0,1]
,
then another multiplication by 2^32
and conversion to integer. This may be
optimized with a neat trick for integer conversion that exploits features of
IEEE 754 float64
representation.
Example
To see how this works let’s delve into the floating point format. Returning to our example, the latitude of Mount Everest (27.988056) will actually be represented in binary as the following 64 bits.
The three fields of the 64-bit value are: sign bit s
, exponent e
and
significand b
. These represent the value:
(-1)^s * 2^(e-1023) * (1.0 + b/2^52)
Thus in the example case of 27.988056 above, the sign bit s
is 0
. The exponent field is 1027 because
2^(1027-1023) = 16
is the greatest power of two less than
27.988056. The significand b = 0xbfcf13cee9dd8
satisfies
1.0 + b/2^52 = 27.988056/16 = 1.7492535.
Unit Interval Trick
Given a floating point value x
in the unit interval [0, 1]
, consider the
representation of the value y = 1+x
. Since y
is in the range [1,2]
, the
largest power of two less than or equal to y
is 1.0 and the exponent field
will always be 1023
. Referring back to the floating point formula this means
the following relation holds
y = 1.0 + b/2^52.
This implies b = floor(2^52 * x)
. In other words, given x
in the unit
interval the bits of floor(2^52 * x)
can be read directly from the binary
representation of 1+x
. This saves any explicit integer conversion.
Application to Geohash
Note that quantization of latitude is equivalent to computing floor(2^32 *
x)
where x = (lat + 90.0) / 180.0
. The above trick suggests this can be
done by actually computing 1+x
instead, and taking the high 32 bits of the
significand.
To drive the point home let’s show an example. From above, 27.988056 should
quantize to 0xa7ce23e4
. The following helper applies the trick to
perform the conversion.
// QuantizeLatBits maps latitude to the range [1,2] and returns the bits of
// the floating point representation.
func QuantizeLatBits(lat float64) uint64 {
return math.Float64bits(lat/180.0 + 1.5)
}
When applied to 27.988056 the result is 0x3ffa7ce23e4e1968, where indeed the expected quantized value is in the high bits of the significand.
Implementation
The following Golang assembly implements this trick for latitude quantization.
// func QuantizeLatAsm(lat float64) uint32
TEXT ·QuantizeLatAsm(SB), NOSPLIT, $0
MOVSD lat+0(FP), X0
MULSD $(0.005555555555555556), X0
ADDSD $(1.5), X0
MOVQ X0, AX
SHRQ $20, AX
MOVL AX, ret+8(FP)
RET
Interleaving
The interleave operation is a natural fit for acceleration with the special
Bit Manipulation Instructions Sets
(BMI). In
particular the Spread
operation implemented above is an example of the
general “parallel bit deposit” operation, available as the
PDEP
instruction in modern Intel and AMD
processors.
Parallel bit deposit takes a selector mask and an input. Low bits of the input
are deposited in the positions of 1 bits in the selector. For example, given
the selector 0xf0ff00f0
and input 0x0000cafe
parallel bit deposit would produce 0xc0af00e0
.
Given this powerful instruction, Spread
can be implemented using PDEP
with
the alternating selector mask 0x5555555555555555
. Therefore Interleave
may
be implemented by combining two PDEP
calls.
// func InterleaveAsm(x, y uint32) uint64
TEXT ·InterleaveAsm(SB), NOSPLIT, $0
MOVL x+0(FP), AX
MOVL y+4(FP), BX
MOVQ $0x5555555555555555, CX
PDEPQ CX, AX, AX
PDEPQ CX, BX, BX
SHLQ $1, BX
XORQ BX, AX
MOVQ AX, ret+8(FP)
RET
Bringing it Together
Combining these techniques gives a full assembly version of EncodeInt
.
// func EncodeIntAsm(lat, lng float64) uint64
TEXT ·EncodeIntAsm(SB), NOSPLIT, $0
#define LATF X0
#define LATI R8
#define LNGF X1
#define LNGI R9
#define TEMP R10
#define GHSH R11
#define MASK BX
MOVSD lat+0(FP), LATF
MOVSD lng+8(FP), LNGF
MOVQ $0x5555555555555555, MASK
MULSD $(0.005555555555555556), LATF
ADDSD $(1.5), LATF
MULSD $(0.002777777777777778), LNGF
ADDSD $(1.5), LNGF
MOVQ LNGF, LNGI
SHRQ $20, LNGI
MOVQ LATF, LATI
SHRQ $20, LATI
PDEPQ MASK, LATI, GHSH
PDEPQ MASK, LNGI, TEMP
SHLQ $1, TEMP
XORQ TEMP, GHSH
MOVQ GHSH, ret+16(FP)
RET
Let’s add a benchmark2 to judge the performance gain.
BenchmarkEncodeInt-4 500000000 27.4 ns/op
BenchmarkEncodeIntAsm-4 10000000000 2.32 ns/op
That’s a very satisfying 11.8X speedup! This function is now sufficiently fast that it’s approaching the cost of function call overhead. To illustrate this, consider a no-op version of an encoder function.
// func NoopAsm(lat, lng float64) uint64
TEXT ·NoopAsm(SB), NOSPLIT, $0
RET
Benchmarking this in the same way shows the following:
BenchmarkEncodeIntAsm-4 10000000000 2.32 ns/op
BenchmarkNoopAsm-4 10000000000 1.40 ns/op
Unfortunately this is unavoidable overhead, since the Go compiler will not
inline user defined assembly functions. George
Tankersly explored this frustration in his “I Wanna Go
Fast!” talk at Gophercon. One
solution would be to implement an EncodeIntBatch
method that would amortize
the function call overhead over many points. For large enough batch sizes
performance should approach 0.92 ns per point3. I have
not done the grunt work to confirm this, instead preferring to move on to more
specialist instruction sets.
SIMD
Since I was already on the path of absurd optimization, why not go the whole hog? Encouraged by my performance-obsessed colleague I set out to investigate whether SIMD instruction sets could provide a speedup.
With AVX2’s
256-bit wide vector instructions, it should be possible to perform four
integer geohash encodings at once. The quantization step should parallelize
naturally, but since there is no vectorized version of the PDEP
instruction
the SIMD interleaving would need to be implemented manually. Daniel
Lemire has already considered this question in
depth
and demonstrated that 32-bit integer interleaving can in fact be achieved
faster than PDEP
with AVX2. The following section will demonstrate how to
combine these techniques into an optimized integer geohash encoder.
Quantization
As seen above, quantization simply involves
multiplication, addition and bitwise right shift. These all have direct
equivalents as vector instructions:
_mm256_mul_pd
,
_mm256_add_pd
and
_mm256_srli_epi64
. Therefore this part should be simple, at
least as far as any intrinsics programming can be.
Interleaving
The vectorized implementation of interleaving also reduces to the
Spread
operation, which distributes the bits of a 32-bit integer to the even
positions of a 64-bit word. Our pure Go “bit fiddling” implementation of
Spread
works in five stages which spread successively smaller segments
of the word: first splitting 16-bit halves into 32-bit lanes, then 8-bit
values within 16-bit lanes, and so on as shown below.
Each stage of this process is a bitwise
shift
,
or
and
and
which translates
directly to AVX2 instructions. However Daniel Lemire’s post already
demonstrated that approach will be sub-optimal. Two tricks are necessary to
surpass PDEP
performance: the first shuffles bytes, and the second
gets us the rest of the way.
The first trick achieves the first two stages of the spread operation at once
by using the byte shuffle
AVX2
instruction. This instruction may be used to move the four bytes of each 32-bit
input into its correct position within each 64-bit lane. After this
three more stages of the spread operation remain.
The next trick exploits the fact that the byte shuffle instruction
_mm256_shuffle_epi8
may actually be used to execute a 4-to-8
bit lookup table (LUT). This instruction operates within each 128-bit lane: it
takes input x
and an index idx
, and the output byte at byte position i
is x[idx[i]&0xf]
. Note that the last two stages of the spread
operation spread nibbles into bytes, therefore could be
represented as as a 4-to-8 bit lookup table and implemented with the byte
shuffle intrinsic. Moreover, the final three steps of this process can be
performed with two byte shuffles (plus some shifting). The combination of
these tricks gives the following C implementation of vectorized 32-to-64 bit
spread.
static inline __m256i spread(__m256i x)
{
x = _mm256_shuffle_epi8(x, _mm256_set_epi8(
-1, 11, -1, 10, -1, 9, -1, 8,
-1, 3, -1, 2, -1, 1, -1, 0,
-1, 11, -1, 10, -1, 9, -1, 8,
-1, 3, -1, 2, -1, 1, -1, 0));
const __m256i lut = _mm256_set_epi8(
85, 84, 81, 80, 69, 68, 65, 64,
21, 20, 17, 16, 5, 4, 1, 0,
85, 84, 81, 80, 69, 68, 65, 64,
21, 20, 17, 16, 5, 4, 1, 0);
__m256i lo = _mm256_and_si256(x, _mm256_set1_epi8(0xf));
lo = _mm256_shuffle_epi8(lut, lo);
__m256i hi = _mm256_and_si256(x, _mm256_set1_epi8(0xf0));
hi = _mm256_shuffle_epi8(lut, _mm256_srli_epi64(hi, 4));
x = _mm256_or_si256(lo, _mm256_slli_epi64(hi, 8));
return x;
}
C Implementation
The spread
function was the tricky part. This can be used to produce a full
implementation of vectorized geohash encoding.
void encode_int(double *lat, double *lng, uint64_t *output)
{
// Quantize.
__m256d latq = _mm256_loadu_pd(lat);
latq = _mm256_mul_pd(latq, _mm256_set1_pd(1/180.0));
latq = _mm256_add_pd(latq, _mm256_set1_pd(1.5));
__m256i lati = _mm256_srli_epi64(_mm256_castpd_si256(latq), 20);
__m256d lngq = _mm256_loadu_pd(lng);
lngq = _mm256_mul_pd(lngq, _mm256_set1_pd(1/360.0));
lngq = _mm256_add_pd(lngq, _mm256_set1_pd(1.5));
__m256i lngi = _mm256_srli_epi64(_mm256_castpd_si256(lngq), 20);
// Spread.
__m256i hash = _mm256_or_si256(
spread(lati),
_mm256_slli_epi64(spread(lngi), 1));
_mm256_storeu_si256((__m256i *)output, hash);
}
Notes on this implementation:
- For easier manipulation with SIMD instructions, the “structure of
arrays”-like approach to
accepting the arguments is preferred. An alternative would have been to take
an array of
struct { double lat; double lng; }
but this would have been somewhat unnatural for SIMD processing. - You might have expected to see aligned memory access instead of the
unaligned variants (
_mm256_loadu_pd
and_mm256_storeu_si256
). This was chosen with a view to converting the resulting assembly to Golang, where memory alignment cannot be easily controlled.
Golang Assembly
It is perhaps unfortunate that I felt the need to digress into C for the exploratory phase of the SIMD optimization. Intrinsics provide us with a low-level access to special instruction sets, combined with the conveniences of the compiler. The Go team has always preferred to support high-level interfaces over machine-level instructions, probably with good reason. However when combined with Golang’s unnecessary custom assembly language you end up with a slightly unfriendly environment for micro-optimizations.
Against this backdrop it is unsurprising that some impressive but ugly C to
Go assembly conversion tools have emerged.
I had hoped to employ these techniques with the encode_int
kernel, but ended
up having to fall back to manual conversion between the objdump
output and Golang assembly.
Thankfully, for AVX2 instructions the mapping is essentially one-to-one.
// func EncodeIntSimd(lat, lng []float64, hash []uint64)
TEXT ·EncodeIntSimd(SB), NOSPLIT, $0
MOVQ lat+0(FP), AX
MOVQ lng+24(FP), BX
MOVQ hash+48(FP), CX
VBROADCASTSD reciprocal180<>+0x00(SB), Y0
VMULPD (AX), Y0, Y0
VBROADCASTSD onepointfive<>+0x00(SB), Y1
VADDPD Y1, Y0, Y0
VPSRLQ $20, Y0, Y0
VBROADCASTSD reciprocal360<>+0x00(SB), Y2
VMULPD (BX), Y2, Y2
VADDPD Y1, Y2, Y1
VPSRLQ $20, Y1, Y1
VMOVDQU spreadbyte<>+0x00(SB), Y2
VPSHUFB Y2, Y0, Y0
VBROADCASTSD lonibblemask<>+0x00(SB), Y3
VPAND Y3, Y0, Y4
VMOVDQU spreadnibblelut<>+0x00(SB), Y5
VPSHUFB Y4, Y5, Y4
VBROADCASTSD hinibblemask<>+0x00(SB), Y6
VPAND Y6, Y0, Y0
VPSRLQ $4, Y0, Y0
VPSHUFB Y0, Y5, Y0
VPSLLQ $8, Y0, Y0
VPOR Y4, Y0, Y0
VPSHUFB Y2, Y1, Y1
VPAND Y3, Y1, Y2
VPSHUFB Y2, Y5, Y2
VPAND Y6, Y1, Y1
VPSRLQ $4, Y1, Y1
VPSHUFB Y1, Y5, Y1
VPSLLQ $8, Y1, Y1
VPOR Y2, Y1, Y1
VPADDQ Y1, Y1, Y1
VPOR Y1, Y0, Y0
VMOVDQU Y0, (CX)
RET
Implementation notes:
- Memory references are to generated data sections.
- Argument loading is based on the in-memory representation of slices.
As always, performance gains are assessed with a benchmark (note that the SIMD function performs four geohashes per iteration):
BenchmarkEncodeIntAsm-4 10000000000 2.32 ns/op
BenchmarkEncodeIntSimd-4 3000000000 4.53 ns/op
The SIMD implementation therefore takes 1.13 nanoseconds per point, a 2.04X speedup over the single-point version. However this assessment is somewhat dishonest, since the SIMD version benefits substantially merely by amortizing function call overhead over the four points. Accounting for function call overhead time per point becomes 0.78 nanoseconds, a 1.17X speedup. Still, at this level that’s not to be sniffed at!
Conclusion
This article has taken us on a tour of floating point representation and
special instruction sets, demonstrating massive speedups in geohash encoding.
The single-point encoding gains have been incorporated into the
mmcloughlin/geohash
package,
making it the fastest geohash implementation for
Golang. Thanks to this absurd
optimization work, SIMD-obsessed Gophers can now concisely describe
their location 880M times per second!
Thanks for input and advice: Ozan Onay, Rami Jaber, Will Demaine, Joshua Goller.
Discussion: r/golang, r/programming, lobsters, hacker news.
-
Benchmarks were run in the following environment:
$ go version go version go1.10.3 darwin/amd64 $ sysctl -n machdep.cpu.brand_string Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
-
The assembly benchmark is the same as
BenchmarkEncodeInt
:// BenchmarkEncodeIntAsm benchmarks assembly integer geohash encoding. func BenchmarkEncodeIntAsm(b *testing.B) { var lat, lng = 40.463833, -79.972422 for i := 0; i < b.N; i++ { EncodeIntAsm(lat, lng) } }
-
Estimate for hypothetical
EncodeIntBatch
performance of 0.92 obtained by subtracting noop function call time 1.4 fromEncodeIntAsm
time 2.32. ↩