GopherCon 2018 - Building and Scaling Reddit's Ad Server
These are some notes from my experiences at the GopherCon 2018. I don’t expect these will be laid out in any particularly useful way; I am mostly taking them so I can remember some of the bits I found most useful in the future.
Note: reddit mascot is called “snu”
Ad Serving Goals
-
Scale
-
Speed
- SLA: < 30ms p99
-
Auction in real time for every request
-
Pacing: don’t blow budgets inappropriately
Before Rewrite
-
Call out to 3rd party for each request
- Slow
- Not customizable
- No observability
Current System
-
Tech: Prometheus, RocksDB, Kafka, Thrift, and other stuff
-
Call to AdSelector for every request
- call to enrichment service for more data (about user, request, etc)
- select ads based on enriched data
- log to kafka for reporting, billing, monitoring, etc.
-
Browser fires a tracker event
- Service writes impression to kafka
-
Background service to read kafka and writing back to enrichment service (spark)
-
Background service to read kafka and writing pacing data back to AdSelector (spark)
AdSelector
- Business Rules
- Auction
- Horizontal Scaling
EventTracker
- SLA <1ms p99
- Reliable
Enrichment Service
- SLA <4ms p99
- gorocksdb
- prefix scan to find data to return
Other Tools
- Reporting
- Admin
Benefits of Go
-
Increased velocity
- Strong conventions
- Fast compilation
-
Performance is great
-
Easy to focus on business logic
Lessons Learned
Getting Production-Ready
- Logging, metrics, etc all over the place
- Changing transport layers was hard (thrift)
- Considered
go-micro
,gizmo
,go-kit
- Picked
go-kit
for flexibility (check Peter Bourgon’s talk at GopherCon 2015) - Use a framework/toolkit
Deployment
-
Need rapid iteration to roll out in a safe way (keep using old service along the way)
-
Started with a blind proxy to the 3rd party, so AB tests were possible
- Start w/ still returning old sets
- Log results and do offline analysis for comparison
- Then switch over when ready
-
Go makes rapid iteration easy and safe
Debugging Latency
-
Distributed tracing
-
Client: extract identifiers and send along w/ a request
-
Server: extract identifier from request and inject into context
-
Issues with tracing: thrift didn’t have headers
-
Handling Slowness / Timeouts
-
Add timeouts to context (and check them) on the client side
-
Need to make the server not do unnecessary work also
- Pass deadlines around
- For thrift, include in request headers, and then process in server to inject timeout
-
Use deadlines within & across services
Ensuring New Features Don’t Hurt
-
Need to make sure SLAs don’t get violated by new features/refactors
-
Bender
for load testing (from Pinterest)- supports http, thrift, but not yet grpc
-
Write benchmarks along side tests
-
Use load testing and benchmarks