notes.txt

Notes for Homa implementation in Linux:
---------------------------------------

* Thoughts on making TCP and Homa play better together:
  * Goals:
    * Keep the NIC tx queue from growing long.
    * Share bandwidth fairly between TCP and Homa
    * If one protocol is using a lot less bandwidth, give it preference
      for transmission?
    * Balance queue lengths for the protocols?
  * Approach #1: analogous to "fair share" CPU scheduling
    * Keep separate tx queues for Homa and TCP
    * Try to equalize the queue lengths while pacing packets at
      network speed?
    * Keep track of lengths for each of many queues
    * Each queue paces itself based on relative lengths
    * Do all pacing centrally, call back to individual queues for output?
  * Approach #2:
    * Keep track of recent bandwidth consumed by each protocol; when there
      is overload, restrict each protocol to its fraction of recent bandwidth.
    * "Consumed" has to be measured in terms of bytes offered, not bytes
      actually transmitted (otherwise a protocol could get "stuck" at a low
      transmittion rate?).
  * Approach #3: token bucket
    * Use a token bucket for each protocol with 50% of available bandwidth
      (or maybe less?). Split any extra available bandwidth among the
      protocols. Maybe adjust rates for the token buckets based on recent
      traffic?
  * Also consider the amount of data that is "stuck" in the NIC?

* Notes on Linux qdiscs:

* Remedies to consider for the performance problems at 100 Gbps, where
  one tx channel gets very backed up:
  * Implement zero-copy on output in order to reduce memory bandwidth
    consumption (presumed with this will increase throughput?)
  * Reserve one channel for the pacer, and don't send non-paced packets
    on that channel; this should eliminate the latency problems caused
    by short messages getting queued on that channel

* Rework cp_node so that there aren't separate senders and receivers on the
  client. Instead, have each client thread send, then conditionally receive,
  then send again, etc. Hmmm, I believe there is a reason why this won't
  work, but I have forgotten what it is.

* (July 2024) Found throughput problem in 2-node "--workload 50000 --one-way"
  benchmark. The first packet for message N doesn't get sent until message
  N-1 has been completely transmitted. This results in a gap between the
  completion of message N and the arrival of a grant for message N+1, which
  wastes throughput. Perhaps send an empty data packet for message N+1 so that
  the grant can return earlier? Or, make a bigger change:
  * Either a message is entirely scheduled or entirely unscheduled.
  * For a scheduled message, send a 0-length data packet immediately,
    before staring to copy in from user space.
  * The round-trip for a grant can now happen in parallel with copying
    bytes in from user space (it takes 16 usec to copy in 60 KB at 30 Gbps).
  * The grant will be sent more quickly by the server because it doesn't
    have to process a batch of data packets first

* homa_grant_recalc is being invoked a *lot* and it takes a lot of time each
  call (13 usec/call, duty cycle > 100%):
  * Does it need to relock every ranked RPC every call?
  * About 25% of calls are failed attempts to promote an unranked RPC
    (perhaps because the slots for its host are all taken?)
  * Could it abort early if there is no incoming headroom?

* Notes on refactoring of grant mechanism:
  * Need to reimplement FIFO grants
  * Replace fifo_grant_increment with fifo_grant_interval
  * Refactor so that the msgin structure is always properly initialized?

* The current implementation can execute RPCs multiple times on the server:
  * The initial RPC arrives, but takes the slow path to SoftIRQ, which can
    take many ms.
  * Meantime the client retries, and the retry succeeds: the request is
    processed, the response is returned to the client, and the RPC is
    freed.
  * Eventually SoftIRQ wakes up to handle the original packet, which re-creates
    the RPC and it gets serviced a second time.

* SoftIRQ processing can lock out kernel-to-user copies; add a preemption
  mechanism where the copying code can set a flag that it needs the lock,
  then SoftIRQ releases the lock until the flag is clear?

* For W3, throttle_min_bytes is a problem: a significant fraction of all
  transmitted bytes aren't being counted; as a result, the NIC queue
  can build up. Reducing throttle_min_bytes from 1000 to 200 reduced P99
  short message latency from 250 us to 200 us.

* Don't understand why W3 performance is so variable under Gen3. Also,
  it's worth comparing tthoma output for W4 and W3 under Gen2; e.g.,
  W3 has way more active outgoing messages than W4.

* When requesting ACKs, must eventually give up and delete RPC.
* Eliminate msgin.scheduled: just compute when needed?
* Replace msgin.bytes_remaining with bytes_received?
* Include the GSO packet size in the initial packet, so the receiver
  can predict how much grants will get rounded up.
* Out-of-order incoming grants seem to be pretty common.
* Round up bpage fragments to 64 byte boundaries

* CloudLab cluster issues:
  * amd272 had a problem where all xmits from core 47 incurred a 1-2 ms
    delay; power-cycling it fixed the problem.

* Notes on performance data from buffer benchmarking 8/2023:
  * Restricting buffer space ("buffers"):
    * Homa W5: unrestricted usage 3.8 MB/20 nodes
      Performance degrades significantly below 80% of this (3 MB/20 nodes)
      Homa still has better performance than DCTCP down to 1.9 MB
    * DCTCP W5: unrestricted usage 5.8 MB/20 nodes
      Performance degrades significantly below 25% of this (1.4 MB/20 nodes)
    * TCP W5: unrestricted usage 13.1 MB/20 nodes (all available space)
      Performance degrades significantly below 80% of this (10.5 MB/20 nodes)
  * Varying the number of nodes ("nodes"):
    * Homa usage seems to increase gradually with number of nodes
    * TCP/DCTCP usage decreases with number of nodes
  * Varying the link utilization ("link_util"):
    * Homa: 50% utilization drops min. buffer usage by ~4x relative to 80%
      Large messages suffer more at higher utilization
    * DCTCP benefits less from lower link utilization: min buffer usage
      drops by 35% at 50% utilization vs. 80% utilization
      All message sizes benefit as utilization drops
    * At 50% utilization, Homa's min buffer usage is less than DCTCP! (recheck)
    * TCP benefits more than DCTCP: 3x reduction in min buffers at 50% vs. 80%
  * Varying the DCTCP marking threshold:
    * Min buffer usage increases monotonically with marking threshold
    * Slowdown is relatively stable (a bit higher at largest and smallest
      thresholds).
    * For small messages, P50 is better with low threshold, but P99 is higher
  * Dynamic windows for Homa:
    * Min buffer usage increases monotonically with larger windows
    * Slowdown drops, then increases as window size increases
    * Unclear whether there is a point where dwin has a better combination
      of performance and buffer usage; maybe in the range of 3-4x rttbytes?
  * Varying Homa's overcommitment:
    * Results don't seem consistent with other benchmarks
    * For W5, dropping overcommitment to 4 increases slowdown by 3x, drops
      min buffer usage by 2x
    * Even an overcommit of 7 increases slowdown by 2x?
  * Varying unsched_bytes:
    * For W4 and W5, increasing unsched_bytes from 40K to 120K increases
      min buffer usage by 2x

* Issues for next round of full-rack testing:
  * Verify that unrestricted slowdown curves for the buffer benchmarks match
    the "standard" curves.
  * Some metrics seem to vary a lot:
    * Unrestricted buffer usage in "nodes" vs. "buffers" vs. "link_util"
    * 10% drop in Homa performance: "nodes" vs. "buffers" vs. "link_util"
  * DCTCP buffer usage in "nodes" first, drops, then starts to rise;
    recheck?
  * Need more nodes to see if buffer usage/node eventually plateaus (need to
    run all tests at 40 nodes)
  * Explore dynamic windows more thoroughly
  * Overcommit results look fishy
  * Lower the threshold for assuming 0 min buffer usage (perhaps base on us
    rather than MB?)
  * Create a "scan_metrics" method to scan .metrics files and look for
    outliers in multiple different ways?
  * Compute a "P99 average" by computing average of worst 1% slowdowns
    in 10 buckets (100?) according to message length?
  * unsched for w3 has huge variaton in rtts; 40k is particularly bad. It
    isn't keeping up on BW either; try raising client-max? Also, there seem
    to be too many metrics files (the problem is with active_nodes in cperf.py?)

* recvmsg doesn't seem to return an address if there is an error? May
  need to return the address in a different place?

* IPv6 issues:
  * See if error checking made syscalls slower.
  * GSO always uses SKB_GSO_TCPV6; sometimes it should be V4.

* Refactor of granting mechanism:
  * Eliminate grant_increment: change to fifo_grant_increment instead
  * grant_non_fifo may need to grant to a message that is also receiving
    regular grants
  * What if a message receives data beyond incoming, which completes the
    message?

* Pinning memory: see mm.h and mm/gup.c
  * get_user_page
  * get_user_pages
  * pin_user_page (not sure the difference from get_user_page)

* Performance-related tasks:
  * Rework FIFO granting so that it doesn't consider homa->max_overcommit
    (just find the oldest message that doesn't have a pity grant)? Also,
    it doesn't look like homa_grant_fifo is keeping track of pity grants
    precisely; perhaps add another RPC field for this?
  * Re-implement the duty-cycle mechanism. Use a generalized pacer to
    control grants:
    * Parameters:
      * Allowable throughput
      * Max accumulation of credits
    * Methods:
      * Request (current time, amount) (possibly 2 stages: isItOk and doIt?)
    * Or, just reduce the link speed and let the pacer handler this?
  * Perhaps limit the number of polling threads per socket, to solve
    the problems with having lots of receiver threads?
  * Move some reaping to the pacer? It has time to spare
  * Figure out why TCP W2 P99 gets worse with higher --client-max
  * See if turning off c-states allows shorter polling intervals?
  * Modify cp_node's TCP to use multiple connections per client-server pair
  * Why is TCP beating Homa on cp_server_ports? Perhaps TCP servers are getting
    >1 request per kernel call?

* Things to do:
  * If a socket is closed with unacked RPCs, the peer will have to send a
    long series of NEED_ACKS (which must be ignored because the socket is
    gone) before finally reaping the RPCs. Perhaps have a "no such socket"
    packet type?
  * Reap most of a message before getting an ack? To do this, sender
    must include "received offset" in grants; then sender can free
    everything up to the latest received offset.
  * Try more aggressive retries (e.g. if a missing packet is sufficiently
    long ago, don't wait for timeout).
  * Eliminate hot spots involving NAPI:
    * Arrange for incoming bursts to be divided into batches where
      alternate batches do their NAPI on 2 different cores.
    * To do this, use TCP for Homa!
      * Send Homa packets using TCP, and use different ports to force
        different NAPI cores
      * Interpose on the TCP packet reception hooks, and redirect
        real TCP packets back to TCP.
  * Consider replacing grantable list with a heap?
  * Unimplemented interface functions.
  * Learn about CONFIG_COMPAT and whether it needs to be supported in
    struct proto and struct proto_ops.
  * Learn about security stuff, and functions that need to be called for this.
  * Learn about memory management for sk_buffs: how many is it OK to have?
    * See tcp_out_of_memory.
  * Eventually initialize homa.next_client_port to something random
  * Define a standard mechanism for returning errors:
    * Socket not supported on server (or server process ends while
      processing request).
    * Server timeout
  * Is it safe to use non-locking skb queue functions?
  * Is the RCU usage for sockets safe? In particular, how long is it safe
    to use a homa_sock returned by homa_find_socket? Could it be deleted from
    underneath us? This question may no longer be relevant, given the
    implementation of homa_find_socket.
  * Can a packet input handler be invoked multiple times concurrently?
  * What is audit_sockaddr? Do I need to invoke it when I read sockaddrs
    from user space?
  * When a struct homa is destroyed, all of its sockets end up in an unsafe
    state in terms of their socktab links.
  * Clean up ports and ips in unit_homa_incoming.c
  * Plug into Linux capability mechanism (man(7) capabilities)
  * Don't return any errors on sends?
  * Homa-RAMCloud doesn't retransmit bytes if it transmitted other bytes
    recently; should HomaModule do the same? Otherwise, will retransmit
    for requests whose service time is just about equal to the resend timer.
  * Check tcp_transmit_skb to make sure we are doing everything we need to
    do with skbuffs (e.g., update sk_wmem_alloc?)
  * Add support for cgroups (e.g. to manage memory allocation)

* Questions for Linux experts:
  * If an interrupt arrives after a thread has been woken up to receive an
    incoming message, but before the kernel call returns, is it possible
    for the kernel call to return EINTR, such that the message isn't received
    and no one else has woken up to handle it?
  * OK to call kmalloc at interrupt level?
    Yes, but must specify GFP_ATOMIC as argument, not GFP_KERNEL; the operation
    will not sleep, which means it could fail more easily.
  * Is it OK to retain struct dst_entry pointers for a long time? Can they
    ever become obsolete (e.g. because routes change)? It looks like the
    "obsolete" field will take care of this. However, a socket is used to
    create a dst entry; what if that socket goes away?
  * Can flows and struct_dst's be shared across sockets? What information
    must be considered to make these things truly safe for sharing (e.g.
    source network port?)?
  * Source addresses for things like creating flows: can't just use a single
    value for this host? Could be different values at different times?
  * How to lock between user-level and bottom-half code?
    * Must use a spin lock
    * Must invoked spin_lock_bh and spin_lock_bh, which disable interrupts
      as well as acquire the lock.
    * What's the difference between bh_lock_sock and bh_lock_sock_nested?
  * Is there a platform-independent way to read a high-frequency clock?
    * get_cycles appears to perform a RDTSC
    * cpu_khz holds the clock frequency
    * do_gettimeofday takes 750 cycles!
    * current_kernel_time takes 120 cycles
    * sched_clock returns ns, takes 70 cycles
    * jiffies variable, plus HZ variable:  HZ is 250
  * What is the purpose of skbuff clones? Appears that cloning is recommended
    to transmit packet while retaining copy for retransmission.
  * If there is an error in ip_queue_xmit, does it free the packet?
    * The answer appears to be "yes", and Homa contains code to check this
      and log if not.
  * Is there a better way to compute packet hashes than Homa's approach
    in gro_complete?

* Notes on timers:
  * hrtimers execute at irq level, not softirq
  * Functions to tell what level is current: in_irq(), in_softirq(), in_task()

* Notes on managing network buffers:
  * tcp_sendmsg_locked (tcp.c) invokes sk_stream_alloc_skb, which returns 0
    if memory running short.  It this happens, it invokes sk_stream_wait_memory
  * tcp_stream_memory_free: its result indicates if there's enough memory for
    a stream to accept more data
  * Receiving packets (tcp_v4_rcv -> tcp_v4_do_rcv -> tcp_rcv_state_process
    in tcp_ipv4.c)
  * There is a variable tcp_memory_allocated, but I can't find where it
    is increased; unclear exactly what this variable means.
  * There is a variable tcp_memory_pressure, plus functions
    tcp_enter_memory_pressure and tcp_leave_memory_pressure. The variable
    appears to be modified only by those 2 functions.
    * Couldn't find any direct calls to tcp_enter_memory_pressure, but a
      pointer is stored in the struct proto.
    * That pointer is invoked from sk_stream_alloc_skb and
      sk_enter_memory_pressure.
    * sk_enter_memory_pressure is     invoked from sk_page_frag_refill and
      __sk_mem_raise_allocated.
    * __sk_mem_raise_allocated is invoked from __sk_mem_schedule
    * __sk_mem_schedule is invoked from sk_wmem_schedule and sk_rmem_schedule

* Miscellaneous information:
  * For raw sockets: "man 7 raw"
  * Per-cpu data structures: linux/percpu.h, percpu-defs.h

* What happens when a socket is closed?
  * socket.c:sock_close
    * socket.c:sock_release
      * proto_ops.release -> af_inet.c:inet_release)
      * af_inet.c:inet_release doesn't appear to do anything relevant to Homa
        * proto.close -> sock.c:sk_common_release?)
          * proto.unhash
          * sock_orphan
          * sock_put (decrements ref count, frees)

* What happens in a sendmsg syscall (UDP)?
  * socket.c:sys_sendmsg
    * socket.c:__sys_sendmsg
      * socket.c:___sys_sendmsg
        * Copy to msghdr and control info to kernel space
        * socket.c:sock_sendmsg
          * socket.c:sock_sendmsg_nosec
          * proto_ops.sendmsg -> afinet.c:inet_sendmsg
            * Auto-bind socket, if not bound
            * proto.sendmsg -> udp.c:udp_sendmsg
              * Long method ...
              * ip_output.c:ip_make_skb
                * Seems to collect data for the datagram?
                * __ip_append_data
              * udp.c:udp_send_skb
                * Creates UDP header
                * ip_output.c:ip_send_skb
                  * ip_local_out

* Call stack down to driver for TCP sendmsg
  tcp.c:             tcp_sendmsg
  tcp.c:             tcp_sendmsg_locked
  tcp_output.c:      tcp_push
  tcp_output.c:      __tcp_push_pending_frames
  tcp_output.c:      tcp_write_xmit
  tcp_output.c:      __tcp_transmit_skb
  ip_output.c:       ip_queue_xmit
  ip_output.c:       ip_local_out
  ip_output.c:       __ip_local_out
  ip_output.c:       ip_output
  ip_output.c:       ip_finish_output
  ip_output.c:       ip_finish_output_gso
  ip_output.c:       ip_finish_output2
  neighbor.h:        neigh_output
  neighbor.c:        neigh_resolve_output
  dev.c:             dev_queue_xmit
  dev.c:             __dev_queue_xmit
  dev.c:             dev_hard_start_xmit
  dev.c:             xmit_one
  netdevice.h:       netdev_start_xmit
  netdevice.h:       __netdev_start_xmit
  vlan_dev.c:        vlan_dev_hard_start_xmit
  dev.c:             dev_queue_xmit
  dev.c:             __dev_queue_xmit
  dev.c:             __dev_xmit_skb
  sch_generic.c:     sch_direct_xmit
  dev.c:             dev_hard_start_xmit
  dev.c:             xmit_one
  netdevice.h:       netdev_start_xmit
  netdevice.h:       __netdev_start_xmit
  en_tx.c:           mlx5e_xmit

* Call stack for packet input handling (this is only approximate):
  en_txrc.c:         mlx5e_napi_poll
  en_rx.c:           mlx5e_poll_rx_cq
  en_rx.c:           mlx5e_handle_rx_cqe
  dev.c:             napi_gro_receive
  dev.c:             dev_gro_receive
  ???                protocol-specific handler
  dev.c:             napi_skb_finish
  dev.c:             napi_gro_complete
  dev.c:             netif_receive_skb_internal
  dev.c:             enqueue_to_backlog
  .... switch to softirq core ....
  dev.c:             process_backlog
  dev.c:             __netif_receive_skb
  dev.c:             __netif_receive_skb_core
  dev.c:             deliver_skb
  ip_input.c:        ip_rcv
  ip_input.c:        ip_rcv_finish
  ip_input.c:        dst_input
  homa_plumbing.c:   homa_softirq


 gcc -g -Wp,-MMD,/users/ouster/homaModule/.homa_offload.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -fmacro-prefix-map=./= -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix -mfunction-return=thunk-extern -fno-jump-tables -fpatchable-function-entry=16,16 -fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls -fno-stack-clash-protection -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -falign-functions=16 -fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs -Wno-frame-address -Wno-address-of-packed-member -Wmissing-declarations -Wmissing-prototypes -Wframe-larger-than=2048 -Wno-main -Wvla -Wno-pointer-sign -Wcast-function-type -Wno-stringop-overflow -Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wextra -Wunused -Wno-unused-but-set-variable -Wno-unused-const-variable -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation -Wno-stringop-truncation -Wno-override-init -Wno-missing-field-initializers -Wno-type-limits -Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -Wno-unused-parameter -g -gdwarf-4 -g  -DMODULE  -DKBUILD_BASENAME='"homa_offload"' -DKBUILD_MODNAME='"homa"' -D__KBUILD_MODNAME=kmod_homa -E -o /users/ouster/homaModule/homa_offload.e /users/ouster/homaModule/homa_offload.c   ; ./tools/objtool/objtool --hacks=jump_label --hacks=noinstr --hacks=skylake --retpoline --rethunk --stackval --static-call --uaccess --prefix=16   --module /users/ouster/homaModule/homa_offload.o