GopherCon 2018 - Performance Tuning Workshop Notes
These are some notes from my experiences at the GopherCon 2018 Performance Tuning Workshop, run by Dave Cheney and Francesc Campoy. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.
The contents of this post are licensed under as the basis of the content is this workshop repository:
Mechanical Sympathy
- Important to know how a thing works to work well with it.
- Don’t need to be a writer/designer/maintainer of the language itself, but having a cursory understanding helps
The Case for Performance Tuning
Huge benefits of Moore’s law (doubling of # of transistors on a chip every 18 months for ~50 years) might be slowing down
- If computers were still going to get faster at the same rate, maybe we don’t need to worry so much
- Performance now increasing at maybe ~3% per year, meaning doubling every ~20 years
- CPU frequencies have been basically static for a while, largely due to heat management
The trend has been to increase the number of cores, rather than increasing the performance of a single core
- Parallelism does not get you nearly as far as chip performance; often you can’t parallelize enough to effectively use more cores (after a point)
Most improvements have been architectural
- e.g., Out of Order Execution, Speculative Execution
- Optimization has tended towards doing things in bulk well, but one-offs not as well
Memory Land
Memory capacities are basically unbounded still, but time to access is not increasing nearly as well
- Getting data out of main memory is Slow
- Cache lines are faster; speed degrades as you fall down cache levels, and cache sizes are limited (especially L1/L2, whose size is limited by speed-of-light concerns and the need to have stable access times for cpu instruction scheduling)
- For increased performance, will have to write for discrete units, not just CPU
- Compiled lanaguages will enable better working with CPU branch prediction (vs. interpreted)
- Need languages that allow efficient reasoning/working with memory (not hiding it behind a VM or excessive pointer chasing)
To improve performance, you need to be able to measure performance
- Benchmark on idle machines (not shared, no web browsing, etc)
- Disable power saving and other performance-hurting factors
- Watch out for VMs and shared cloud hosting (noisy)
Run benchmarks multiple times (both before and after) to get consistent results
Testing Package Tools
func BenchmarkFoo(b *testing.B) {
for i := 0; i < b.N; i++ {
By default, benchmarks are not run by
go test
; add-bench
flag to run them- Go will run the regular tests before benchmarks; can disable with
- Can pick core counts w/
, e.g. (actually adjusting GOMAXPROCS)
- Go will run the regular tests before benchmarks; can disable with
and ns/op (ns per loop body)b.N
is an increasing number based on how long the benchmark takes (tries to get the most runs in ~1s)- open issue to set
instead of determining
Can increase
to get a high-enough number of iterations (M * 1k) to be valid/stable -
Can increase
to get more runs -
Tool from Russ Cox
go get
can tell you how stable a set of benchmark results are (with-count
, e.g.)tee
to a file, and thenbenchstat
the file to get stats
Getting Solid Before/After Samples
Compile test binaries w/
go test -c && mv foo.test foo.old-version
- Pass args as
instead of-foo
- Pass args as
Make sure there are tests to make sure you don’t break things while chasing optimizations
Can pass both out files to
to get comparative stats (including p-value and number of non-outliers)
Benchmark Setups
func BenchmarkFoo(b *testing.B) {
for i := 0; i < b.N; i++ {
Benchmark Allocations
Use b.ReportAllocs()
in the benchmark function, or force on with -benchmem
Beware Compiler Optimizations
(popcount example – see examples in original repo)
Compiler: “You can’t prove I didn’t do it” – inlined away (no actual function call, just the loop)
package has the built-in instructions (when available)
To fix: make sure the compiler can’t prove that some other part of the program can’t see it (store to local var, then at the end of the loop, assign value to a package public variable)
- private package variable could (in the future) still result in optimizing away if the compiler starts supporting that
- don’t want to just do
; want the inlining performance benefits – just fix the benchmark
-gcflags "-S"
to see assembly (can be useful to see if things got optimized away) -
Also beware of only-constants in benchmarks
Benchmarks help give you an idea of where you are, but not which parts could or should be optimized
- beware sampling issues
packagego tool pprof
(not recommended – drop-in replacement exists)
pprof profiling methods
- cpu
- memory (two modes)
- blocking
- mutex contention
CPU Profiling
100/s (every 10ms) records a stacktrace from the go runtime
Benchmark w/
- Has a speed cost (not unexpected)
Examine w/
go tool pprof cpu.pb.gz
inside is not that useful, b/c runtime and things like that show up when we can’t do things there- and we really want cumulative (
top -cum
) - can just do
(opens a graph)
pprof drop-in replacement
go get
pprof ...
instead ofgo tool pprof ...
pprof -http=:6060 ...
to open a web server w/ flame graphs and other cool stuff- code view w/ timing on a per-line basis
Memory Profiling
For heap allocations only
Samples 1 of every 1000 allocations (can tweak)
- tracks code paths that led to allocations, not what is using the memory allocated
to detect different things (or no flag)- no flag gives memory size
- the other flags give counts
Apparently it can be used to detect leaks, but not quite straightforward
Block Profiling
Similar to CPU profile but records time spent waiting on a shared resource
Possibly useful for identifying concurrency bottlenecks
- channel waits
- mutex waits (can also do mutex profiling)
Mutex Profiling
- Like block profiling for mutexes specifically
General Tips
- Profile one thing at a time (more will hurt performance too much, probably)
Profiling Non-Benchmarks
Option 1:
callingdefer profile.Start().Stop()
(worked pretty well in testing) -
Option 2:
(see repo) -
Too much time in syscalls means you should probably be buffering to avoid doing syscalls as often
- can pass the url to the debug endpoints to the
tools instead of a file - only has a performance hit when you run
against the url
- can pass the url to the debug endpoints to the
to generate load
Execution Tracer
Can’t catch really fast or rare events with pprof
Instead of asking the runtime for data, configure the runtime to log/report data
- Performance hit is not nearly as predictable as pprof
- start with
,defer trace.Stop()
, usually in main, but can do in a specific function
go tool trace trace.out
(needs Chrome)w
zoom in/outa
move left/right?
for help- Can see “mark assist” things that slow down goroutines as they help the garbage collector
- Can see when goroutines get tagged for GC assist (only those that ask for a malloc can get co-opted)
- STW (stop everything so we can free memory)
- SWEEP (do the memory cleanup)
Compiler Optimizations
- 3 Big ones: escape analysis, inlining, dead code elimination
- Go compiler was built from the Plan9 compiler
Escape Analysis
Can return pointers out of a function (and the compiler will allocate on the heap)
But can take things that are usually thought of as on the heap and put them on the stack instead
- e.g., slice that is created and used entirely within a function
- can’t use
to determine heap/stack ...interface{}
can make things escape
Can make the compiler tell you about it w/
, or-gcflags="-m -m"
for even more info -
There is a threshold of stack growth vs. heap allocation where in-function things might escape anyhow
sizes matter for escape analysis
Only for leaf functions
- Leafs are functions that do not call other functions
Only for “small” functions – ratio of preamble overhead to work done matters
- e.g., getters, setters
- “can inline” vs. “inlining call to "
go1.11 seems to be able to chain inlines, and can inline bits that might panic
can control inlining
disables-gcflags="-l -l"
more aggressive, maybe bigger binaries (note: NOT the same as no-l
(?))-gcflags="-l -l -l"
more aggressive, maybe bigger binaries again, might be buggy-gcflags="-l=4"
in go1.11 will enable experimental mid stack inlining (bigger and slower right now)
things like
won’t affect inlining like they do for escape analysis -
but (public) package variables will
Dead Code Elimination
- Can use some conditional compilation to set consts to different values so that sometimes the compiler can optimize away code (e.g., debug stuff)
Tips and Tricks
for detecting all kinds of spaces -
“concurrency of gophers”
reduce allocations
- a la
Read([]byte) (int, error)
vs.Read() ([]byte, error)
- a la
conversions- conversion creates a copy
- pick one representation and stick with it
there are compiler optimizations for doing
v, ok := m[string(key)]
(and that specifically, not when doing the conversion before map lookup) -
string concatenation is complicated
- initial benchmark was appending to properly sized []byte > +-concat > Sprintf > FPrintf
preallocate slices if the length is known
goroutines can seem free, but there are memory concerns (each one has a stack of at least 2K)
- make sure there is a stop condition so the memory can get cleaned up
async filesystem IO is not widely available
- filesystem IO will consume an OS thread while running
instead of passing around large[]byte
when possible -
- use
for network IO - use worker pool channels
- use
defer-ing a function that is too-small can be expensive
avoid finalizers (non-deterministic, since connected to GC)
avoid, or at least minimize, cgo
- <- find video about go execution tracer (uses mandelbrot example from workshop)