Make trio.lowlevel.checkpoint much faster #1613

njsmith · 2020-06-14T04:42:52Z

This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (gh-733) and/or the
cancellation checking APIs (gh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

await checkpoint(): ~24k/second
await cancel_shielded_checkpoint(): ~180k/second

After this change:

await checkpoint(): ~170k/second
await cancel_shielded_checkpoint(): ~180k/second

Benchmark script:

import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)

This is a bit of a kluge, and will hopefully be cleaned up in the future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common operation, and this speeds it up by ~7x, so... not bad for a ~4 line change. Before this change: - `await checkpoint()`: ~24k/second - `await cancel_shielded_checkpoint()`: ~180k/second After this change: - `await checkpoint()`: ~170k/second - `await cancel_shielded_checkpoint()`: ~180k/second Benchmark script: ```python import time import trio LOOP_SIZE = 1_000_000 async def main(): start = time.monotonic() for _ in range(LOOP_SIZE): await trio.lowlevel.checkpoint() #await trio.lowlevel.cancel_shielded_checkpoint() end = time.monotonic() print(f"{LOOP_SIZE / (end - start):.2f} schedules/second") trio.run(main) ```

codecov · 2020-06-14T04:51:08Z

Codecov Report

Merging #1613 into master will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1613   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files         110      110           
  Lines       13840    13840           
  Branches     1058     1059    +1     
=======================================
  Hits        13798    13798           
  Misses         27       27           
  Partials       15       15

Impacted Files	Coverage Δ
trio/_core/_run.py	`99.76% <100.00%> (+<0.01%)`	⬆️
trio/_core/_ki.py	`100.00% <0.00%> (ø)`
trio/_core/_thread_cache.py	`100.00% <0.00%> (ø)`
trio/_core/_wakeup_socketpair.py	`100.00% <0.00%> (ø)`

Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP sendto operations were spending a substantial amount of time in _resolve_address, which is responsible for resolving any hostname given in the sendto call. This is weird, since the benchmark wasn't using hostnames, just a raw IP address. Surprisingly, it turns out that calling getaddrinfo at all is also quite expensive, even you give it an already-resolved plain IP address so there's no work for it to do: In [1]: import socket In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80) 5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit socket.inet_aton("127.0.0.1") 187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) I thought getaddrinfo would be effectively free in this case, but apparently I was wrong. Also, this doesn't really matter for TCP connections, since they only pass through this path once per connection, but UDP goes through here on every packet, so the overhead adds up quickly. Solution: add a fast-path to _resolve_address when user's address is already resolved, so we skip all the work. On the benchmark in python-triogh-1595 on my laptop, this PR takes us from ~7000 queries/second to ~9000 queries/second, so a ~30% speedup. The patch is a bit more complicated than I expected. There are three parts: - The fast path itself - Skipping unnecessary checkpoints in _resolve_address, including when we're on the fastpath: this is an internal function and it turns out that almost all our callers are already doing checkpoints, so there's no need to do another checkpoint inside _resolve_address. Even with the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still only ~8000 queries/second on the DNS microbenchmark, versus ~9000 queries/second without the checkpoint. - _resolve_address used to always normalize IPv6 addresses into 4-tuples, as a side-effect of how getaddrinfo works. The fast path doesn't call getaddrinfo, so the tests needed adjusting so they don't expect this normalization, and to make sure that our tests for the getaddrinfo normalization code don't hit the fast path.

njsmith force-pushed the fast-checkpoint branch from 880e3af to e59050e Compare June 14, 2020 04:48

oremanj merged commit 6c13b13 into python-trio:master Jun 14, 2020

njsmith mentioned this pull request Jun 15, 2020

Add a fast path for the simplest cases of socket address resolution #1617

Merged

oremanj mentioned this pull request Jun 26, 2020

Either optimize yield_briefly or de-optimize yield_briefly_no_cancel #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make trio.lowlevel.checkpoint much faster #1613

Make trio.lowlevel.checkpoint much faster #1613

njsmith commented Jun 14, 2020

codecov bot commented Jun 14, 2020 •

edited

Loading

Make trio.lowlevel.checkpoint much faster #1613

Make trio.lowlevel.checkpoint much faster #1613

Conversation

njsmith commented Jun 14, 2020

codecov bot commented Jun 14, 2020 • edited Loading

Codecov Report

codecov bot commented Jun 14, 2020 •

edited

Loading