-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy pathpresentation.slide
235 lines (174 loc) · 10 KB
/
presentation.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Parquet, Go, and Unreasonable Amounts of SIMD
9 July 2024
Tags: parquet, go, parquet-go, simd
Summary: This is a presentation about the github.com/parquet-go/parquet-go project.
Achille Roussel
Co-founder, dispatch.run
https://github.com/parquet-go/parquet-go
achille-roussel (Github)
: Hello, my name is Achille, I'm the co-founder of dispatch.run, and also an open-source contributor.
: I've worked on a bunch of projects in the Go ecosystem.
: Today I'm here to talk to you about another open-source project I maintain: parquet-go
## What is Parquet?
- Columnar file format
- Per-column encoding for efficient compression
- Popular in "big data"
.image images/parquet-logo.png
: Parquet is a columnar file format that's gotten really popular in the "big data" space.
: It has a lot of the features you want for storing and querying large volumes of data.
: Like very efficient encoding and compression, indexes, bloom filters, etc...
## Why did we develop parquet-go?
.link https://github.com/parquet-go/parquet-go
- Project started in 2021
- Unmaintained packages
- Poor performance
.image images/SCR-20240709-kbhm.png 400 _
: Back when I was working at Segment in 2021, we had a lot of "big data" that we wanted to keep in object store and query efficiently.
: Unfortunately all the existing open source packages for reading and writing parquet files were either unmaintained, crippled with bugs, but more importantly didn't meet our performance requirements.
: So as one usually does, I thought to myself "how hard can it be to write a parquet library?"
: Let me tell you, "It was really hard"
## Who is using parquet-go?
.link https://github.com/parquet-go/parquet-go
- FrostDB (Polar Signals)
- Grafana Tempo
- Grafana Pyroscope
- Redpanda Connect
- ...
.image images/GOPHER_SHARE.png 200 _
: Luckily around that time, I met friends at Polar Signals and Grafana Labs who were in search of a similar solution.
: So I open-sourced parquet-go and created a small community around the project which has grown a lot since.
## How did we make parquet-go fast?
.link https://github.com/parquet-go/parquet-go
- API designed for high-performance
- Unreasonnable amounts of assembly
```
$ find . -name '*.go' -and -not -name '*_test.go' | xargs wc -l | tail -1
34609 total
```
```
$ find . -name '*.s' | xargs wc -l | tail -1
9410 total
```
```
$ echo '100 * (9410 / 34609)' | bc -l
27.189459389176225837
```
: So how did we make it fast?
: Two main reasons, one lies in the design of the high-level abstractions.
: The other is that I implemented all the data crunching code paths as vectorized algorithms in assembly.
: When we take a look at the project today, we have over 25% of assembly code.
## Crafting APIs for high-performance software
- Precompute as much as possible
- Receive input/output buffers
- Amortize the cost of abstractions
```
type RowReader interface {
ReadRows(rows []Row) (int, error)
}
```
```
type RowWriter interface {
WriteRows(rows []Row) (int, error)
}
```
: If you want to write high-performance software, the most important part to get right are your abstractions.
: There are two core interfaces in the package which are the row reader and writer.
: The key here is that we pass an array of rows as input or output buffer.
: This amortizes the cost of abstractions, which means we can have high-performance code while also using high-level constructs.
: Basically this design means that the program can spend most of its CPU time in the innermost parts of the code where it's actually doing work, instead of crossing abstraction bounaries over and over.
## Run-Length / Binary Packing
.link https://arxiv.org/pdf/1209.2137
**Decoding billions of integers per second through vectorization**
*Daniel Lemire, Leonid Boytsov*
- VPSHUFD
- VPSLLVQ
- VBLENDPD
: In parquet, there is a lot of run-length encoding and bit-packing done to represent various parts of the data.
: We implement this paper from Daniel Lemire which describes how to do efficient bit-packing of integers.
: One thing I learned about these research papers is that they tend to describe the high-level idea, but rarely go into the implementation details.
: So a lot of the difficulty here was figuring out how to "finish" the work.
: I think what I achieved in parquet-go is pretty unique, to my knowledge there aren't a lot of production-scale implementation of these algorithms.
## Prefix Sum
.link https://en.algorithmica.org/hpc/algorithms/prefix/
- Really difficult to vectorize due to data dependencies
- Compute and validate at the same time
```
for i := 1; i < len(values); i++ {
values[i] += values[i-1]
}
```
: In multiple encodings, parquet usually stores deltas between values, because the deltas are usually small and can be packed into fewer bits.
: There's another paper that we implemented to vectorize the encoding and decoding of deltas.
: But in our case, the data comes from external files that may have been corrupted, we also need to do validation.
: We improved the reference algorithm a bit tho to do both computation and validation at the same time and avoid thrashing the memory caches by doing multiple passes.
## Bytestream Split
.link https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
Swap bytes of 32 bit and 64 bit floats to improve the compression ratio with Zstd
- VPSCATTER
- VPGATHER
- VPSHUFB
- VPERMB
: This is another weird trick. If you have a column of floating point numbers, instead of storing them as a plain array, store all the first bytes first, then all the second bytes, all the third bytes, etc...
: Then compress it, you get much better compression ratios.
: This is an encoding called bytestream split in parquet, and it's also a perfect use case for GATHER/SCATTER instructions, because they let us operate on non-contiguous areas of memory.
: Take one float value, and you can scatter each of its bytes at the right position in the output buffer with a single instruction.
: This encoding ends up being as fast as a simple memory copy.
## Bloom Filters
AVX2 implementation of split block filters:
```
name old time/op new time/op delta
FilterInsertBulk 45.1ns ± 2% 17.8ns ± 3% -60.41% (p=0.000 n=10+10)
FilterInsert 3.48ns ± 2% 2.55ns ± 1% -26.86% (p=0.000 n=10+8)
FilterCheck 3.64ns ± 3% 2.66ns ± 2% -26.82% (p=0.000 n=10+9)
name old speed new speed delta
FilterInsertBulk 11.4GB/s ± 2% 28.7GB/s ± 3% +152.61% (p=0.000 n=10+10)
FilterInsert 9.19GB/s ± 2% 12.56GB/s ± 1% +36.71% (p=0.000 n=10+8)
FilterCheck 8.80GB/s ± 3% 12.03GB/s ± 2% +36.61% (p=0.000 n=10+9)
```
: Finally getting into some of the most advanced and unique parts of parquet-go.
: The split-block bloom filters of the parquet spec map extremely well to parallel processing using AVX2, so obviously I had to vectorize all of that as well.
: As you see parquet-go can generate bloom filters at almost 30GB/s.
## Parallel xxhash with AVX-512
Parallel computation of xxhash when inserting multiple values to filters
```
name old speed new speed delta
MultiSum64Uint8/4KB 4.97GB/s ± 0% 14.59GB/s ± 1% +193.73% (p=0.000 n=10+10)
MultiSum64Uint16/4KB 3.55GB/s ± 0% 9.46GB/s ± 0% +166.20% (p=0.000 n=10+9)
MultiSum64Uint32/4KB 4.48GB/s ± 0% 13.93GB/s ± 1% +210.93% (p=0.000 n=10+10)
MultiSum64Uint64/4KB 3.57GB/s ± 0% 11.12GB/s ± 1% +211.73% (p=0.000 n=9+10)
MultiSum64Uint128/4KB 2.54GB/s ± 0% 6.49GB/s ± 1% +155.69% (p=0.000 n=10+10)
name old hash/s new hash/s delta
MultiSum64Uint8/4KB 621M ± 0% 1823M ± 1% +193.73% (p=0.000 n=10+10)
MultiSum64Uint16/4KB 444M ± 0% 1182M ± 0% +166.20% (p=0.000 n=10+9)
MultiSum64Uint32/4KB 560M ± 0% 1742M ± 1% +210.93% (p=0.000 n=10+10)
MultiSum64Uint64/4KB 446M ± 0% 1391M ± 1% +211.73% (p=0.000 n=9+10)
MultiSum64Uint128/4KB 317M ± 0% 811M ± 1% +155.69% (p=0.000 n=10+10)
```
: Values inserted in bloom filters must be hashed first, so the hashing algorithm can become a bottleneck in the generation of bloom filters.
: We implemented a vectorized version of xxhash which computes multiple hashes in parallel and can hash 1 to 2 billion values per second.
## Contribute back to Go
- bytes.Count implemented with AVX-512
```
│ base │ avx512 │
│ sec/op │ sec/op vs base │
CountSingle/10 4.832n ± 0% 4.858n ± 1% +0.55% (p=0.000 n=10)
CountSingle/32 5.477n ± 1% 5.966n ± 1% +8.93% (p=0.000 n=10)
CountSingle/4K 63.35n ± 1% 39.92n ± 0% -36.98% (p=0.000 n=10)
CountSingle/4M 124.8µ ± 5% 121.9µ ± 3% ~ (p=0.123 n=10)
CountSingle/64M 4.706m ± 3% 4.348m ± 2% -7.61% (p=0.000 n=10)
geomean 996.9n 906.8n -9.03%
│ base │ avx512 │
│ B/s │ B/s vs base │
CountSingle/10 1.928Gi ± 0% 1.917Gi ± 1% -0.55% (p=0.000 n=10)
CountSingle/32 5.441Gi ± 1% 4.995Gi ± 0% -8.20% (p=0.000 n=10)
CountSingle/4K 60.21Gi ± 1% 95.54Gi ± 0% +58.68% (p=0.000 n=10)
CountSingle/4M 31.30Gi ± 5% 32.05Gi ± 3% ~ (p=0.123 n=10)
CountSingle/64M 13.28Gi ± 3% 14.37Gi ± 2% +8.23% (p=0.000 n=10)
geomean 12.13Gi 13.33Gi +9.93%
```
.link https://go-review.googlesource.com/c/go/+/542695
: This was just an overview, there are many more optimizations that we've implemented in the package, and some that can also be ported back to Go so the rest of the ecosystem benefits from it.
: In particular, I have this CL open to add an AVX-512 implementation of bytes.Count which yields singificant throughput improvements, so please someone merge it!
: Remember to write high-performance software, the most important part is to get your high-level abstractions right.
: Alright, that's all I had, thanks for listening and don't forget to support your open-source maintainers by being generous with github stars!