GopherCon 2018 - Performance Tuning Workshop Notes
These are some notes from my experiences at the GopherCon 2018 Performance Tuning Workshop, run by Dave Cheney and Francesc Campoy. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.
The contents of this post are licensed under https://creativecommons.org/licenses/by-sa/4.0/ as the basis of the content is this workshop repository: https://github.com/davecheney/gophercon2018-performance-tuning-workshop
Mechanical Sympathy
- Important to know how a thing works to work well with it.
- Don’t need to be a writer/designer/maintainer of the language itself, but having a cursory understanding helps
The Case for Performance Tuning
CPU-Land
-
Huge benefits of Moore’s law (doubling of # of transistors on a chip every 18 months for ~50 years) might be slowing down
- If computers were still going to get faster at the same rate, maybe we don’t need to worry so much
- Performance now increasing at maybe ~3% per year, meaning doubling every ~20 years
- CPU frequencies have been basically static for a while, largely due to heat management
-
The trend has been to increase the number of cores, rather than increasing the performance of a single core
- Parallelism does not get you nearly as far as chip performance; often you can’t parallelize enough to effectively use more cores (after a point)
-
Most improvements have been architectural
- e.g., Out of Order Execution, Speculative Execution
- Optimization has tended towards doing things in bulk well, but one-offs not as well
Memory Land
-
Memory capacities are basically unbounded still, but time to access is not increasing nearly as well
- Getting data out of main memory is Slow
- Cache lines are faster; speed degrades as you fall down cache levels, and cache sizes are limited (especially L1/L2, whose size is limited by speed-of-light concerns and the need to have stable access times for cpu instruction scheduling)
Takeaways
- For increased performance, will have to write for discrete units, not just CPU
- Compiled lanaguages will enable better working with CPU branch prediction (vs. interpreted)
- Need languages that allow efficient reasoning/working with memory (not hiding it behind a VM or excessive pointer chasing)
Benchmarking
-
To improve performance, you need to be able to measure performance
- Benchmark on idle machines (not shared, no web browsing, etc)
- Disable power saving and other performance-hurting factors
- Watch out for VMs and shared cloud hosting (noisy)
-
Run benchmarks multiple times (both before and after) to get consistent results
Testing Package Tools
func BenchmarkFoo(b *testing.B) {
for i := 0; i < b.N; i++ {
Foo()
}
}
-
By default, benchmarks are not run by
go test
; add-bench
flag to run them- Go will run the regular tests before benchmarks; can disable with
-run
- Can pick core counts w/
-cpu=1,2,4,8
, e.g. (actually adjusting GOMAXPROCS)
- Go will run the regular tests before benchmarks; can disable with
-
Reports
b.N
and ns/op (ns per loop body)b.N
is an increasing number based on how long the benchmark takes (tries to get the most runs in ~1s)- open issue to set
b.N
instead of determining
-
Can increase
-benchtime
to get a high-enough number of iterations (M * 1k) to be valid/stable -
Can increase
-count
to get more runs -
Tool from Russ Cox
go get golang.org/x/perf/cmd/benchstat
can tell you how stable a set of benchmark results are (with-count
, e.g.)tee
to a file, and thenbenchstat
the file to get stats
Getting Solid Before/After Samples
-
Compile test binaries w/
go test -c && mv foo.test foo.old-version
- Pass args as
-test.foo
instead of-foo
- Pass args as
-
Make sure there are tests to make sure you don’t break things while chasing optimizations
-
Can pass both out files to
benchstat
to get comparative stats (including p-value and number of non-outliers)
Benchmark Setups
func BenchmarkFoo(b *testing.B) {
expensiveSetup()
b.ResetTimer()
for i := 0; i < b.N; i++ {
b.StopTimer()
loopIterSetup()
b.StartTimer()
Foo()
}
}
Benchmark Allocations
Use b.ReportAllocs()
in the benchmark function, or force on with -benchmem
Beware Compiler Optimizations
-
(popcount example – see examples in original repo)
-
Compiler: “You can’t prove I didn’t do it” – inlined away (no actual function call, just the loop)
math/bits
package has the built-in instructions (when available)
-
To fix: make sure the compiler can’t prove that some other part of the program can’t see it (store to local var, then at the end of the loop, assign value to a package public variable)
- private package variable could (in the future) still result in optimizing away if the compiler starts supporting that
- don’t want to just do
//go:noinline
; want the inlining performance benefits – just fix the benchmark
-
-gcflags "-S"
to see assembly (can be useful to see if things got optimized away) -
Also beware of only-constants in benchmarks
Profiling
-
Benchmarks help give you an idea of where you are, but not which parts could or should be optimized
-
pprof
- beware sampling issues
runtime/pprof
packagego tool pprof
(not recommended – drop-in replacement exists)
-
pprof profiling methods
- cpu
- memory (two modes)
- blocking
- mutex contention
CPU Profiling
-
100/s (every 10ms) records a stacktrace from the go runtime
-
Benchmark w/
-cpuprofile=cpu.pb.gz
- Has a speed cost (not unexpected)
-
Examine w/
go tool pprof cpu.pb.gz
top
inside is not that useful, b/c runtime and things like that show up when we can’t do things there- and we really want cumulative (
top -cum
) - can just do
web
(opens a graph)
-
pprof drop-in replacement
go get github.com/google/pprof
pprof ...
instead ofgo tool pprof ...
pprof -http=:6060 ...
to open a web server w/ flame graphs and other cool stuff- code view w/ timing on a per-line basis
Memory Profiling
-
For heap allocations only
-
Samples 1 of every 1000 allocations (can tweak)
- tracks code paths that led to allocations, not what is using the memory allocated
-
-alloc_objects
vs.-inuse_objects
to detect different things (or no flag)- no flag gives memory size
- the other flags give counts
-
Apparently it can be used to detect leaks, but not quite straightforward
Block Profiling
-
Similar to CPU profile but records time spent waiting on a shared resource
-
Possibly useful for identifying concurrency bottlenecks
- channel waits
- mutex waits (can also do mutex profiling)
Mutex Profiling
- Like block profiling for mutexes specifically
General Tips
- Profile one thing at a time (more will hurt performance too much, probably)
Profiling Non-Benchmarks
-
Option 1:
github.com/pkg/profile
callingdefer profile.Start().Stop()
(worked pretty well in testing) -
Option 2:
golang.org/...
(see repo) -
Too much time in syscalls means you should probably be buffering to avoid doing syscalls as often
-
"net/http/pprof"
- can pass the url to the debug endpoints to the
pprof
tools instead of a file - only has a performance hit when you run
pprof
against the url
- can pass the url to the debug endpoints to the
-
go-wrk
to generate load
Execution Tracer
-
Can’t catch really fast or rare events with pprof
-
Instead of asking the runtime for data, configure the runtime to log/report data
- Performance hit is not nearly as predictable as pprof
- start with
trace.Start
,defer trace.Stop()
, usually in main, but can do in a specific function
-
go tool trace trace.out
(needs Chrome)w
/s
zoom in/outa
/d
move left/right?
for help- Can see “mark assist” things that slow down goroutines as they help the garbage collector
- Can see when goroutines get tagged for GC assist (only those that ask for a malloc can get co-opted)
- STW (stop everything so we can free memory)
- SWEEP (do the memory cleanup)
Compiler Optimizations
- 3 Big ones: escape analysis, inlining, dead code elimination
- Go compiler was built from the Plan9 compiler
Escape Analysis
-
Can return pointers out of a function (and the compiler will allocate on the heap)
-
But can take things that are usually thought of as on the heap and put them on the stack instead
- e.g., slice that is created and used entirely within a function
- can’t use
make
/new
to determine heap/stack ...interface{}
can make things escape
-
Can make the compiler tell you about it w/
-gcflags=-m
, or-gcflags="-m -m"
for even more info -
There is a threshold of stack growth vs. heap allocation where in-function things might escape anyhow
-
const
vsvar
sizes matter for escape analysis
Inlining
-
Only for leaf functions
- Leafs are functions that do not call other functions
-
Only for “small” functions – ratio of preamble overhead to work done matters
- e.g., getters, setters
- “can inline” vs. “inlining call to "
-
go1.11 seems to be able to chain inlines, and can inline bits that might panic
-
can control inlining
-gcflags=-l
disables-gcflags="-l -l"
more aggressive, maybe bigger binaries (note: NOT the same as no-l
(?))-gcflags="-l -l -l"
more aggressive, maybe bigger binaries again, might be buggy-gcflags="-l=4"
in go1.11 will enable experimental mid stack inlining (bigger and slower right now)
-
things like
const
vs.var
won’t affect inlining like they do for escape analysis -
but (public) package variables will
Dead Code Elimination
- Can use some conditional compilation to set consts to different values so that sometimes the compiler can optimize away code (e.g., debug stuff)
Tips and Tricks
-
unicode.IsSpace(b)
for detecting all kinds of spaces -
“concurrency of gophers”
-
reduce allocations
- a la
Read([]byte) (int, error)
vs.Read() ([]byte, error)
- a la
-
avoid
[]byte
tostring
conversions- conversion creates a copy
- pick one representation and stick with it
-
there are compiler optimizations for doing
v, ok := m[string(key)]
(and that specifically, not when doing the conversion before map lookup) -
string concatenation is complicated
- initial benchmark was appending to properly sized []byte > +-concat > Sprintf > FPrintf
-
preallocate slices if the length is known
-
goroutines can seem free, but there are memory concerns (each one has a stack of at least 2K)
- make sure there is a stop condition so the memory can get cleaned up
-
async filesystem IO is not widely available
- filesystem IO will consume an OS thread while running
-
use
io.Reader
/io.Writer
instead of passing around large[]byte
when possible -
Timeouts
- use
SetReadDeadline
/SetWriteDeadline
/SetDeadline
for network IO - use worker pool channels
- use
-
defer-ing a function that is too-small can be expensive
-
avoid finalizers (non-deterministic, since connected to GC)
-
avoid, or at least minimize, cgo
-
http://justforfunc.com <- find video about go execution tracer (uses mandelbrot example from workshop)