GopherCon 2018 - Performance Tuning Workshop Notes

: 27 August 2018
: conference, golang, gophercon2018, notes
: https://github.com/davecheney/gophercon2018-performance-tuning-workshop

These are some notes from my experiences at the GopherCon 2018 Performance Tuning Workshop, run by Dave Cheney and Francesc Campoy. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.

The contents of this post are licensed under https://creativecommons.org/licenses/by-sa/4.0/ as the basis of the content is this workshop repository: https://github.com/davecheney/gophercon2018-performance-tuning-workshop

Mechanical Sympathy

Important to know how a thing works to work well with it.
Don’t need to be a writer/designer/maintainer of the language itself, but having a cursory understanding helps

The Case for Performance Tuning

CPU-Land

Huge benefits of Moore’s law (doubling of # of transistors on a chip every 18 months for ~50 years) might be slowing down
- If computers were still going to get faster at the same rate, maybe we don’t need to worry so much
- Performance now increasing at maybe ~3% per year, meaning doubling every ~20 years
- CPU frequencies have been basically static for a while, largely due to heat management
The trend has been to increase the number of cores, rather than increasing the performance of a single core
- Parallelism does not get you nearly as far as chip performance; often you can’t parallelize enough to effectively use more cores (after a point)
Most improvements have been architectural
- e.g., Out of Order Execution, Speculative Execution
- Optimization has tended towards doing things in bulk well, but one-offs not as well

Memory Land

Memory capacities are basically unbounded still, but time to access is not increasing nearly as well
- Getting data out of main memory is Slow
- Cache lines are faster; speed degrades as you fall down cache levels, and cache sizes are limited (especially L1/L2, whose size is limited by speed-of-light concerns and the need to have stable access times for cpu instruction scheduling)

Takeaways

For increased performance, will have to write for discrete units, not just CPU
Compiled lanaguages will enable better working with CPU branch prediction (vs. interpreted)
Need languages that allow efficient reasoning/working with memory (not hiding it behind a VM or excessive pointer chasing)

Benchmarking

To improve performance, you need to be able to measure performance
- Benchmark on idle machines (not shared, no web browsing, etc)
- Disable power saving and other performance-hurting factors
- Watch out for VMs and shared cloud hosting (noisy)
Run benchmarks multiple times (both before and after) to get consistent results

Testing Package Tools

    func BenchmarkFoo(b *testing.B) {
        for i := 0; i < b.N; i++ {
            Foo()
        }
    }

By default, benchmarks are not run by go test; add -bench flag to run them
- Go will run the regular tests before benchmarks; can disable with -run
- Can pick core counts w/ -cpu=1,2,4,8, e.g. (actually adjusting GOMAXPROCS)
Reports b.N and ns/op (ns per loop body)
- b.N is an increasing number based on how long the benchmark takes (tries to get the most runs in ~1s)
- open issue to set b.N instead of determining
Can increase -benchtime to get a high-enough number of iterations (M * 1k) to be valid/stable
Can increase -count to get more runs
Tool from Russ Cox go get golang.org/x/perf/cmd/benchstat can tell you how stable a set of benchmark results are (with -count, e.g.)
- tee to a file, and then benchstat the file to get stats

Getting Solid Before/After Samples

Compile test binaries w/ go test -c && mv foo.test foo.old-version
- Pass args as -test.foo instead of -foo
Make sure there are tests to make sure you don’t break things while chasing optimizations
Can pass both out files to benchstat to get comparative stats (including p-value and number of non-outliers)

Benchmark Setups

    func BenchmarkFoo(b *testing.B) {
        expensiveSetup()
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            b.StopTimer()
            loopIterSetup()
            b.StartTimer()
            Foo()
        }

    }

Benchmark Allocations

Use b.ReportAllocs() in the benchmark function, or force on with -benchmem

Beware Compiler Optimizations

(popcount example – see examples in original repo)
Compiler: “You can’t prove I didn’t do it” – inlined away (no actual function call, just the loop)
- math/bits package has the built-in instructions (when available)
To fix: make sure the compiler can’t prove that some other part of the program can’t see it (store to local var, then at the end of the loop, assign value to a package public variable)
- private package variable could (in the future) still result in optimizing away if the compiler starts supporting that
- don’t want to just do //go:noinline; want the inlining performance benefits – just fix the benchmark
-gcflags "-S" to see assembly (can be useful to see if things got optimized away)
Also beware of only-constants in benchmarks

Profiling

Benchmarks help give you an idea of where you are, but not which parts could or should be optimized
pprof
- beware sampling issues
- runtime/pprof package
- go tool pprof (not recommended – drop-in replacement exists)
pprof profiling methods
- cpu
- memory (two modes)
- blocking
- mutex contention

CPU Profiling

100/s (every 10ms) records a stacktrace from the go runtime
Benchmark w/ -cpuprofile=cpu.pb.gz
- Has a speed cost (not unexpected)
Examine w/ go tool pprof cpu.pb.gz
- top inside is not that useful, b/c runtime and things like that show up when we can’t do things there
- and we really want cumulative (top -cum)
- can just do web (opens a graph)
pprof drop-in replacement go get github.com/google/pprof
- pprof ... instead of go tool pprof ...
- pprof -http=:6060 ... to open a web server w/ flame graphs and other cool stuff
- code view w/ timing on a per-line basis

Memory Profiling

For heap allocations only
Samples 1 of every 1000 allocations (can tweak)
- tracks code paths that led to allocations, not what is using the memory allocated
-alloc_objects vs. -inuse_objects to detect different things (or no flag)
- no flag gives memory size
- the other flags give counts
Apparently it can be used to detect leaks, but not quite straightforward

Block Profiling

Similar to CPU profile but records time spent waiting on a shared resource
Possibly useful for identifying concurrency bottlenecks
- channel waits
- mutex waits (can also do mutex profiling)

Mutex Profiling

Like block profiling for mutexes specifically

General Tips

Profile one thing at a time (more will hurt performance too much, probably)

Profiling Non-Benchmarks

Option 1: github.com/pkg/profile calling defer profile.Start().Stop() (worked pretty well in testing)
Option 2: golang.org/... (see repo)
Too much time in syscalls means you should probably be buffering to avoid doing syscalls as often
"net/http/pprof"
- can pass the url to the debug endpoints to the pprof tools instead of a file
- only has a performance hit when you run pprof against the url
go-wrk to generate load

Execution Tracer

Can’t catch really fast or rare events with pprof
Instead of asking the runtime for data, configure the runtime to log/report data
- Performance hit is not nearly as predictable as pprof
- start with trace.Start, defer trace.Stop(), usually in main, but can do in a specific function
go tool trace trace.out (needs Chrome)
- w / s zoom in/out
- a / d move left/right
- ? for help
- Can see “mark assist” things that slow down goroutines as they help the garbage collector
- Can see when goroutines get tagged for GC assist (only those that ask for a malloc can get co-opted)
- STW (stop everything so we can free memory)
- SWEEP (do the memory cleanup)

Compiler Optimizations

3 Big ones: escape analysis, inlining, dead code elimination
Go compiler was built from the Plan9 compiler

Escape Analysis

Can return pointers out of a function (and the compiler will allocate on the heap)
But can take things that are usually thought of as on the heap and put them on the stack instead
- e.g., slice that is created and used entirely within a function
- can’t use make/new to determine heap/stack
- ...interface{} can make things escape
Can make the compiler tell you about it w/ -gcflags=-m, or -gcflags="-m -m" for even more info
There is a threshold of stack growth vs. heap allocation where in-function things might escape anyhow
const vs var sizes matter for escape analysis

Inlining

Only for leaf functions
- Leafs are functions that do not call other functions
Only for “small” functions – ratio of preamble overhead to work done matters
- e.g., getters, setters
- “can inline” vs. “inlining call to "
go1.11 seems to be able to chain inlines, and can inline bits that might panic
can control inlining
- -gcflags=-l disables
- -gcflags="-l -l" more aggressive, maybe bigger binaries (note: NOT the same as no -l (?))
- -gcflags="-l -l -l" more aggressive, maybe bigger binaries again, might be buggy
- -gcflags="-l=4" in go1.11 will enable experimental mid stack inlining (bigger and slower right now)
things like const vs. var won’t affect inlining like they do for escape analysis
but (public) package variables will

Dead Code Elimination

Can use some conditional compilation to set consts to different values so that sometimes the compiler can optimize away code (e.g., debug stuff)

Tips and Tricks

unicode.IsSpace(b) for detecting all kinds of spaces
“concurrency of gophers”
reduce allocations
- a la Read([]byte) (int, error) vs. Read() ([]byte, error)
avoid []byte to string conversions
- conversion creates a copy
- pick one representation and stick with it
there are compiler optimizations for doing v, ok := m[string(key)] (and that specifically, not when doing the conversion before map lookup)
string concatenation is complicated
- initial benchmark was appending to properly sized []byte > +-concat > Sprintf > FPrintf
preallocate slices if the length is known
goroutines can seem free, but there are memory concerns (each one has a stack of at least 2K)
- make sure there is a stop condition so the memory can get cleaned up
async filesystem IO is not widely available
- filesystem IO will consume an OS thread while running
use io.Reader / io.Writer instead of passing around large []byte when possible
Timeouts
- use SetReadDeadline / SetWriteDeadline / SetDeadline for network IO
- use worker pool channels
defer-ing a function that is too-small can be expensive
avoid finalizers (non-deterministic, since connected to GC)
avoid, or at least minimize, cgo
http://justforfunc.com <- find video about go execution tracer (uses mandelbrot example from workshop)