diff --git a/docs/README.md b/docs/README.md index c862198f2..8933802e2 100644 --- a/docs/README.md +++ b/docs/README.md @@ -17,12 +17,13 @@ support for TCMalloc. All users of TCMalloc should consult the following documentation resources: -* The [TCMalloc Quickstart](quickstart.md) covers downloading, installing, - building, and testing TCMalloc, including incorporating within your codebase. -* The [TCMalloc Overview](overview.md) covers the basic architecture of - TCMalloc, and how that may affect configuration choices. -* The [TCMalloc Reference](reference.md) covers the C and C++ TCMalloc API - endpoints. +* The [TCMalloc Quickstart](quickstart.md) covers downloading, installing, + building, and testing TCMalloc, including incorporating within your + codebase. +* The [TCMalloc Overview](overview.md) covers the basic architecture of + TCMalloc, and how that may affect configuration choices. +* The [TCMalloc Reference](reference.md) covers the C and C++ TCMalloc API + endpoints. More advanced usages of TCMalloc may find the following documentation useful: @@ -51,7 +52,7 @@ We've published several papers relating to TCMalloc optimizations: ## License -The TCMalloc library is licensed under the terms of the Apache -license. See LICENSE for more information. +The TCMalloc library is licensed under the terms of the Apache license. See +LICENSE for more information. Disclaimer: This is not an officially supported Google product. diff --git a/docs/compatibility.md b/docs/compatibility.md index 25f361d19..0b6e24bc5 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -4,41 +4,41 @@ This document details what we expect from well-behaved users. Any usage of TCMalloc libraries outside of these technical boundaries may result in breakage when upgrading to newer versions of TCMalloc. -Put another way: don't do things that make TCMalloc API maintenance -tasks harder. If you misuse TCMalloc APIs, you're on your own. +Put another way: don't do things that make TCMalloc API maintenance tasks +harder. If you misuse TCMalloc APIs, you're on your own. -Additionally, because TCMalloc depends on Abseil, Abseil's [compatibility -guidelines](https://abseil.io/about/compatibility) also apply. +Additionally, because TCMalloc depends on Abseil, Abseil's +[compatibility guidelines](https://abseil.io/about/compatibility) also apply. ## What Users Must (And Must Not) Do -* **Do not depend on a compiled representation of TCMalloc.** We do not - promise any ABI compatibility — we intend for TCMalloc to be built from - source, hopefully from head. The internal layout of our types may change at - any point, without notice. Building TCMalloc in the presence of different C++ - standard library types may change Abseil types, especially for pre-adopted - types (`string_view`, `variant`, etc) — these will become typedefs and - their ABI will change accordingly. -* **Do not rely on dynamic loading/unloading.** TCMalloc does not support - dynamic loading and unloading. -* **You may not open namespace `tcmalloc`.** You are not allowed to define - additional names in namespace `tcmalloc`, nor are you allowed to specialize - anything we provide. -* **You may not depend on the signatures of TCMalloc APIs.** You cannot take the - address of APIs in TCMalloc (that would prevent us from adding overloads - without breaking you). You cannot use metaprogramming tricks to depend on - those signatures either. (This is also similar to the restrictions in the C++ - standard.) -* **You may not forward declare TCMalloc APIs.** This is actually a sub-point of - "do not depend on the signatures of TCMalloc APIs" as well as "do not open - namespace `tcmalloc`", but can be surprising. Any refactoring that changes - template parameters, default parameters, or namespaces will be a breaking - change in the face of forward-declarations. -* **Do not depend upon internal details.** This should go without saying: if - something is in a namespace or filename/path that includes the word - "internal", you are not allowed to depend upon it. It's an implementation - detail. You cannot friend it, you cannot include it, you cannot mention it or - refer to it in any way. -* **Include What You Use.** We may make changes to the internal `#include` graph - for TCMalloc headers - if you use an API, please include the relevant header - file directly. +* **Do not depend on a compiled representation of TCMalloc.** We do not + promise any ABI compatibility — we intend for TCMalloc to be built + from source, hopefully from head. The internal layout of our types may + change at any point, without notice. Building TCMalloc in the presence of + different C++ standard library types may change Abseil types, especially for + pre-adopted types (`string_view`, `variant`, etc) — these will become + typedefs and their ABI will change accordingly. +* **Do not rely on dynamic loading/unloading.** TCMalloc does not support + dynamic loading and unloading. +* **You may not open namespace `tcmalloc`.** You are not allowed to define + additional names in namespace `tcmalloc`, nor are you allowed to specialize + anything we provide. +* **You may not depend on the signatures of TCMalloc APIs.** You cannot take + the address of APIs in TCMalloc (that would prevent us from adding overloads + without breaking you). You cannot use metaprogramming tricks to depend on + those signatures either. (This is also similar to the restrictions in the + C++ standard.) +* **You may not forward declare TCMalloc APIs.** This is actually a sub-point + of "do not depend on the signatures of TCMalloc APIs" as well as "do not + open namespace `tcmalloc`", but can be surprising. Any refactoring that + changes template parameters, default parameters, or namespaces will be a + breaking change in the face of forward-declarations. +* **Do not depend upon internal details.** This should go without saying: if + something is in a namespace or filename/path that includes the word + "internal", you are not allowed to depend upon it. It's an implementation + detail. You cannot friend it, you cannot include it, you cannot mention it + or refer to it in any way. +* **Include What You Use.** We may make changes to the internal `#include` + graph for TCMalloc headers - if you use an API, please include the relevant + header file directly. diff --git a/docs/design.md b/docs/design.md index 45b17fe7c..595ca68fe 100644 --- a/docs/design.md +++ b/docs/design.md @@ -18,8 +18,7 @@ allocator that has the following characteristics: ## Usage -You use TCMalloc by specifying it as the `malloc` attribute on your binary rules -in Bazel. +You use TCMalloc by specifying it as the `malloc` attribute on your binary rules in Bazel. ## Overview @@ -80,15 +79,15 @@ size-class. The size-classes are designed to minimize the amount of memory that is wasted when rounding to the next largest size-class. When compiled with `__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`, we use a set of -sizes aligned to 8 bytes for raw storage allocated with `::operator new`. This +sizes aligned to 8 bytes for raw storage allocated with `::operator new`. This smaller alignment minimizes wasted memory for many common allocation sizes (24, 40, etc.) which are otherwise rounded up to a multiple of 16 bytes. On many compilers, this behavior is controlled by the `-fnew-alignment=...` flag. -When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not -specified (or is larger than 8 bytes), we use standard 16 byte alignments for -`::operator new`. However, for allocations under 16 bytes, we may return an -object with a lower alignment, as no object with a larger alignment requirement -can be allocated in the space. +When +`__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not specified (or is larger than 8 bytes), +we use standard 16 byte alignments for `::operator new`. However, for +allocations under 16 bytes, we may return an object with a lower alignment, as +no object with a larger alignment requirement can be allocated in the space. When an object of a given size is requested, that request is mapped to a request of a particular size-class using the @@ -297,9 +296,9 @@ available objects in the spans, more spans are requested from the back-end. When objects are [returned to the central free list](https://github.com/google/tcmalloc/blob/master/tcmalloc/central_freelist.cc), each object is mapped to the span to which it belongs (using the -[pagemap](#pagemap-and-spans)) and then released into that span. If all the objects that -reside in a particular span are returned to it, the entire span gets returned to -the back-end. +[pagemap](#pagemap-and-spans)) and then released into that span. If all the +objects that reside in a particular span are returned to it, the entire span +gets returned to the back-end. ### Pagemap and Spans diff --git a/docs/gperftools.md b/docs/gperftools.md index b05c8e4bd..1100b84ac 100644 --- a/docs/gperftools.md +++ b/docs/gperftools.md @@ -12,12 +12,12 @@ implementation itself. ## History Google open-sourced its memory allocator as part of "Google Performance Tools" -in 2005. At the time, it became easy to externalize code, but more difficult to -keep it in-sync with our internal usage, as discussed by Titus Winters’ in [his -2017 CppCon Talk](https://www.youtube.com/watch?v=tISy7EJQPzI) and the "Software -Engineering at Google" book. Subsequently, our internal implementation diverged -from the code externally. This project eventually was adopted by the community -as "gperftools." +in 2005. At the time, it became easy to externalize code, but more difficult to +keep it in-sync with our internal usage, as discussed by Titus Winters’ in +[his 2017 CppCon Talk](https://www.youtube.com/watch?v=tISy7EJQPzI) and the +"Software Engineering at Google" book. Subsequently, our internal implementation +diverged from the code externally. This project eventually was adopted by the +community as "gperftools." ## Differences @@ -68,4 +68,3 @@ exceptions: Over time, we have found that configurability carries a maintenance burden. While a knob can provide immediate flexibility, the increased complexity can cause subtle problems for more rarely used combinations. - diff --git a/docs/gwp-asan.md b/docs/gwp-asan.md index 8f0b17277..2b75604b4 100644 --- a/docs/gwp-asan.md +++ b/docs/gwp-asan.md @@ -7,8 +7,7 @@ GWP-ASan is a recursive acronym: "**G**WP-ASan **W**ill **P**rovide ## Why not just use ASan? -For many cases you **should** use -[ASan](https://clang.llvm.org/docs/AddressSanitizer.html) +For many cases you **should** use [ASan](https://clang.llvm.org/docs/AddressSanitizer.html) (e.g., on your tests). However, ASan comes with average execution slowdown of 2x (compared to `-O2`), binary size increase of 2x, and significant memory overhead. For these reasons, ASan is generally impractical for use in production @@ -17,23 +16,21 @@ designed for widespread use in production. ## How to use GWP-ASan -You can enable GWP-ASan by calling -`tcmalloc::MallocExtension::ActivateGuardedSampling()`. +You can enable GWP-ASan by calling `tcmalloc::MallocExtension::ActivateGuardedSampling()`. To adjust GWP-ASan's sampling rate, see [below](#what-should-i-set-the-sampling-rate-to). When GWP-ASan detects a heap memory error, it prints stack traces for the point of the memory error, as well as the points where the memory was allocated and (if applicable) freed. These stack traces can then be -symbolized -offline to get file names and line numbers. +symbolized offline to get file names and line +numbers. GWP-ASan will crash after printing stack traces. ## CPU and RAM Overhead -For guarded sampling rates above 100M (the default), CPU overhead is negligible. -For sampling rates as low as 8M, CPU overhead is under 0.5%. +For guarded sampling rates above 100M (the default), CPU overhead is negligible. For sampling rates as low as 8M, CPU overhead is under 0.5%. RAM overhead is up to 512 KB on x86\_64, or 4 MB on PowerPC. @@ -56,10 +53,10 @@ CPU overhead, we recommend a sampling rate of 8MB. - GWP-ASan has limited diagnostic information for buffer overflows within alignment padding, since overflows of this type will not touch a guard - page. - For write-overflows, GWP-ASan will still be able to detect the overflow - during deallocation by checking whether magic bytes have been overwritten, - but the stack trace of the overflow itself will not be available. + page. For write-overflows, + GWP-ASan will still be able to detect the overflow during deallocation by + checking whether magic bytes have been overwritten, but the stack trace of + the overflow itself will not be available. ## FAQs @@ -71,7 +68,7 @@ always a true bug, or a sign of hardware failure (see below). ### How do I know a GWP-ASan report isn't caused by hardware failure? The vast majority of GWP-ASan reports we see are true bugs, but occasionally -faulty hardware will be the actual cause of the crash. In general, if you see +faulty hardware will be the actual cause of the crash. In general, if you see the same GWP-ASan crash on multiple machines, it is very likely there's a true software bug. diff --git a/docs/overview.md b/docs/overview.md index d2a663f8e..8b4948d32 100644 --- a/docs/overview.md +++ b/docs/overview.md @@ -8,24 +8,25 @@ TCMalloc is designed to be more efficient at scale than other implementations. Specifically, TCMalloc provides the following benefits: -* Performance scales with highly parallel applications. -* Optimizations brought about with recent C++14 and C++17 standard enhancements, - and by diverging slightly from the standard where performance benefits - warrant. (These are noted within the [TCMalloc Reference](reference.md).) -* Extensions to allow performance improvements under certain architectures, and - additional behavior such as metric gathering. +* Performance scales with highly parallel applications. +* Optimizations brought about with recent C++14 and C++17 standard + enhancements, and by diverging slightly from the standard where performance + benefits warrant. (These are noted within the + [TCMalloc Reference](reference.md).) +* Extensions to allow performance improvements under certain architectures, + and additional behavior such as metric gathering. ## TCMalloc Cache Operation Mode TCMalloc may operate in one of two fashions: -* (default) per-CPU caching, where TCMalloc maintains memory caches local to - individual logical cores. Per-CPU caching is enabled when running TCMalloc on - any Linux kernel that utilizes restartable sequences (RSEQ). Support for RSEQ - was merged in Linux 4.18. -* per-thread caching, where TCMalloc maintains memory caches local to - each application thread. If RSEQ is unavailable, TCMalloc reverts to using - this legacy behavior. +* (default) per-CPU caching, where TCMalloc maintains memory caches local to + individual logical cores. Per-CPU caching is enabled when running TCMalloc + on any Linux kernel that utilizes restartable sequences (RSEQ). Support for + RSEQ was merged in Linux 4.18. +* per-thread caching, where TCMalloc maintains memory caches local to each + application thread. If RSEQ is unavailable, TCMalloc reverts to using this + legacy behavior. NOTE: the "TC" in TCMalloc refers to Thread Caching, which was originally a distinguishing feature of TCMalloc; the name remains as a legacy. @@ -35,21 +36,21 @@ locks for most memory allocations and deallocations. ## TCMalloc Features -TCMalloc provides APIs for dynamic memory allocation: `malloc()` using the C +TCMalloc provides APIs for dynamic memory allocation: `malloc()` using the C API, and `::operator new` using the C++ API. TCMalloc, like most allocation frameworks, manages this memory better than raw memory requests (such as through `mmap()`) by providing several optimizations: -* Performs allocations from the operating system by managing - specifically-sized chunks of memory (called "pages"). Having all of these - chunks of memory the same size allows TCMalloc to simplify bookkeeping. -* Devoting separate pages (or runs of pages called "Spans" in TCMalloc) to - specific object sizes. For example, all 16-byte objects are placed within - a "Span" specifically allocated for objects of that size. Operations to get or - release memory in such cases are much simpler. -* Holding memory in *caches* to speed up access of commonly-used objects. - Holding such caches even after deallocation also helps avoid costly system - calls if such memory is later re-allocated. +* Performs allocations from the operating system by managing + specifically-sized chunks of memory (called "pages"). Having all of these + chunks of memory the same size allows TCMalloc to simplify bookkeeping. +* Devoting separate pages (or runs of pages called "Spans" in TCMalloc) to + specific object sizes. For example, all 16-byte objects are placed within a + "Span" specifically allocated for objects of that size. Operations to get or + release memory in such cases are much simpler. +* Holding memory in *caches* to speed up access of commonly-used objects. + Holding such caches even after deallocation also helps avoid costly system + calls if such memory is later re-allocated. The cache size can also affect performance. The larger the cache, the less any given cache will overflow or get exhausted, and therefore require a lock to get @@ -58,7 +59,7 @@ default behavior should be preferred in most cases. For more information, consult the [TCMalloc Tuning Guide](tuning.md). Additionally, TCMalloc exposes telemetry about the state of the application's -heap via `MallocExtension`. This can be used for gathering profiles of the live +heap via `MallocExtension`. This can be used for gathering profiles of the live heap, as well as a snapshot taken near the heap's highwater mark size (a peak heap profile). @@ -87,8 +88,8 @@ The TCMalloc API obeys the behavior of C90 DR075 and [DR445](http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_445) which states: - The alignment requirement still applies even if the size is too small for - any object requiring the given alignment. +> The alignment requirement still applies even if the size is too small for any +> object requiring the given alignment. In other words, `malloc(1)` returns `alignof(std::max_align_t)`-aligned pointer. Based on the progress of diff --git a/docs/platforms.md b/docs/platforms.md index 9dc834d6e..8d4250d69 100644 --- a/docs/platforms.md +++ b/docs/platforms.md @@ -1,7 +1,7 @@ # TCMalloc Platforms -The TCMalloc code is supported on the following platforms. By "platforms", -we mean the union of operating system, architecture (e.g. little-endian vs. +The TCMalloc code is supported on the following platforms. By "platforms", we +mean the union of operating system, architecture (e.g. little-endian vs. big-endian), compiler, and standard library. ## Language Requirements @@ -13,7 +13,7 @@ We guarantee that our code will compile under the following compilation flags: Linux: -* gcc 9.2+, clang 9.0+: `-std=c++17` +* gcc 9.2+, clang 9.0+: `-std=c++17` (TL;DR; All code at this time must be built under C++17. We will update this list if circumstances change.) diff --git a/docs/quickstart.md b/docs/quickstart.md index 5d55aca95..cdfc71842 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -13,8 +13,8 @@ starting development using TCMalloc at least run through this quick tutorial. Running the code within this tutorial requires: -* A compatible platform (E.g. Linux). Consult the [Platforms Guide](platforms.md) - for more information. +* A compatible platform (E.g. Linux). Consult the + [Platforms Guide](platforms.md) for more information. * A compatible C++ compiler *supporting at least C++17*. Most major compilers are supported. * [Git](https://git-scm.com/) for interacting with the Abseil source code @@ -45,8 +45,8 @@ Resolving deltas: 100% (1083/1083), done. $ ``` -Git will create the repository within a directory named `tcmalloc`. -Navigate into this directory and run all tests: +Git will create the repository within a directory named `tcmalloc`. Navigate +into this directory and run all tests: ``` $ cd tcmalloc @@ -136,6 +136,7 @@ local_repository( path = "/PATH_TO_SOURCE/Source/tcmalloc", ) ``` + The "name" in the `WORKSPACE` file identifies the name you will use in Bazel `BUILD` files to refer to the linked repository (in this case "com_google_tcmalloc"). diff --git a/docs/reference.md b/docs/reference.md index c57d9bddb..7c3446e98 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -41,14 +41,14 @@ void* operator new[](std::size_t count, std::align_val_t al, const std::nothrow_t&) noexcept; // C++17 ``` -`operator new`/`operator new[]` allocates `count` bytes. They may be invoked +`operator new`/`operator new[]` allocates `count` bytes. They may be invoked directly but are more commonly invoked as part of a *new*-expression. When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not specified (or is larger than 8 bytes), we use standard 16 byte alignments for `::operator new` without a `std::align_val_t` argument. However, for allocations under 16 bytes, we may return an object with a lower alignment, as no object with a larger alignment -requirement can be allocated in the space. When compiled with +requirement can be allocated in the space. When compiled with `__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`, we use a set of sizes aligned to 8 bytes for raw storage allocated with `::operator new`. @@ -61,9 +61,9 @@ requested alignment. If the allocation is unsuccessful, a failure terminates the program. NOTE: unlike in the C++ standard, we do not throw an exception in case of -allocation failure, or invoke `std::get_new_handler()` repeatedly in an -attempt to successfully allocate, but instead crash directly. Such behavior can -be used as a performance optimization for move constructors not currently marked +allocation failure, or invoke `std::get_new_handler()` repeatedly in an attempt +to successfully allocate, but instead crash directly. Such behavior can be used +as a performance optimization for move constructors not currently marked `noexcept`; such move operations can be allowed to fail directly due to allocation failures. Within Abseil code, these direct allocation failures are enabled with the Abseil build-time configuration macro @@ -89,7 +89,7 @@ void operator delete[](void* ptr, std::size_t sz, ``` `::operator delete`/`::operator delete[]` deallocate memory previously allocated -by a corresponding `::operator new`/`::operator new[]` call respectively. It is +by a corresponding `::operator new`/`::operator new[]` call respectively. It is commonly invoked as part of a *delete*-expression. Sized delete is used as a critical performance optimization, eliminating the @@ -110,18 +110,18 @@ the `` header file. Implementations require C11 or greater. TCMalloc provides implementation for the following C API functions: -* `malloc()` -* `calloc()` -* `realloc()` -* `free()` -* `aligned_alloc()` +* `malloc()` +* `calloc()` +* `realloc()` +* `free()` +* `aligned_alloc()` For `malloc`, `calloc`, and `realloc`, we obey the behavior of C90 DR075 and [DR445](http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_445) which states: - The alignment requirement still applies even if the size is too small for - any object requiring the given alignment. +> The alignment requirement still applies even if the size is too small for any +> object requiring the given alignment. In other words, `malloc(1)` returns `alignof(std::max_align_t)`-aligned pointer. Based on the progress of @@ -131,15 +131,15 @@ this alignment in the future. Additionally, TCMalloc provides an implementation for the following POSIX standard library function, available within glibc: -* `posix_memalign()` +* `posix_memalign()` TCMalloc also provides implementations for the following obsolete functions typically provided within libc implementations: -* `cfree()` -* `memalign()` -* `valloc()` -* `pvalloc()` +* `cfree()` +* `memalign()` +* `valloc()` +* `pvalloc()` Documentation is not provided for these obsolete functions. The implementations are provided only for compatibility purposes. @@ -178,7 +178,7 @@ void* realloc(void *ptr, size_t new_size); ``` `realloc()` re-allocates memory for an existing region of memory by either -expanding or contracting the memory based on the passed `new_size` in bytes, +expanding or contracting the memory based on the passed `new_size` in bytes, returning a `void*` pointer to the start of that memory (which may not change); it does not perform any initialization of new areas of memory. @@ -196,11 +196,11 @@ void* aligned_alloc(size_t alignment, size_t size); not perform any initialization. The `size` parameter must be an integral multiple of `alignment` and `alignment` -must be a power of two. If either of these cases is not satisfied, +must be a power of two. If either of these cases is not satisfied, `aligned_alloc()` will fail and return a NULL pointer. -`aligned_alloc` with `size=0` returns a non-NULL zero-sized pointer. -(Attempting to access memory at this location is undefined.) +`aligned_alloc` with `size=0` returns a non-NULL zero-sized pointer. (Attempting +to access memory at this location is undefined.) ### `posix_memalign()` @@ -215,7 +215,7 @@ type of data pointer in order to be dereferenceable. If the alignment allocation succeeds, `posix_memalign()` returns `0`; otherwise it returns an error value. `posix_memalign` is similar to `aligned_alloc()` but `alignment` be a power of -two multiple of `sizeof(void *)`. If the constraints are not satisfied, +two multiple of `sizeof(void *)`. If the constraints are not satisfied, `posix_memalign()` will fail. `posix_memalign` with `size=0` returns a non-NULL zero-sized pointer. diff --git a/docs/regions-are-not-optional.md b/docs/regions-are-not-optional.md index 7dca969e9..f36103910 100644 --- a/docs/regions-are-not-optional.md +++ b/docs/regions-are-not-optional.md @@ -28,9 +28,9 @@ introduce a significant amount of overhead for allocations between 1 and 10 hugepages, and the overhead could still be considered significant for allocations larger than that.) -* We _cannot_ unback the unused tail of the last hugepage (requirement (2) +* We *cannot* unback the unused tail of the last hugepage (requirement (2) would be violated). -* We _cannot_ assume these requests are necessarily rare and we will have many +* We *cannot* assume these requests are necessarily rare and we will have many smaller ones to fill the unused tail (requirement (1) would be violated). Moreover this is **empirically false** for widely used binaries. @@ -75,7 +75,7 @@ for large allocations, backing and unbacking hugepages on demand. When one region fills, obtain another; fill from the most fragmented to bound total overhead (a policy derived from `HugePageFiller`). -That is _really it_. We do not see this as particularly complicated. The only +That is *really it*. We do not see this as particularly complicated. The only thing left is the implementation of that policy: We used `RangeTracker` because it was convenient, supported exactly the API we needed, and fast enough (even though we're tracking fairly large bitsets). diff --git a/docs/rseq.md b/docs/rseq.md index 36309d0a2..120af1dee 100644 --- a/docs/rseq.md +++ b/docs/rseq.md @@ -8,9 +8,8 @@ freshness: { owner: 'ckennelly' reviewed: '2021-06-15' } ## per-CPU Caches TCMalloc implements its per-CPU caches using restartable sequences (`man -rseq(2)`) on Linux. This kernel feature was developed by [Paul Turner and -Andrew Hunter at -Google](http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf) +rseq(2)`) on Linux. This kernel feature was developed by +[Paul Turner and Andrew Hunter at Google](http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf) and Mathieu Desnoyers at EfficiOS. Restartable sequences let us execute a region to completion (atomically with respect to other threads on the same CPU) or to be aborted if interrupted by the kernel by preemption, interrupts, or signal @@ -18,8 +17,8 @@ handling. Choosing to restart on migration across cores or preemption allows us to optimize the common case - we stay on the same core - by avoiding atomics, over -the more rare case - we are actually preempted. As a consequence of this -tradeoff, we need to make our code paths actually support being restarted. The +the more rare case - we are actually preempted. As a consequence of this +tradeoff, we need to make our code paths actually support being restarted. The entire sequence, except for its final store to memory which *commits* the change, must be capable of starting over. @@ -40,17 +39,17 @@ This carries a few implementation challenges: ## Structure of the `TcmallocSlab` -In per-CPU mode, we allocate an array of `N` `TcmallocSlab::Slabs`. For all +In per-CPU mode, we allocate an array of `N` `TcmallocSlab::Slabs`. For all operations, we index into the array with the logical CPU ID. Each slab is has a header region of control data (one 8-byte header per-size -class). These index into the remainder of the slab, which contains pointers to +class). These index into the remainder of the slab, which contains pointers to free listed objects. ![Memory layout of per-cpu data structures](images/per-cpu-cache-internals.png "Memory layout of per-cpu data structures") -In [C++ -code](https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/percpu_tcmalloc.h), +In +[C++ code](https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/percpu_tcmalloc.h), these are represented as: ``` @@ -82,17 +81,17 @@ struct Header { The atomic `header` allows us to read the state (esp. for telemetry purposes) of a core without undefined behavior. -The fields in `Header` are indexed in `sizeof(void*)` strides into the slab. -For the default value of `Shift=18`, this allows us to cache nearly 32K objects -per CPU. +The fields in `Header` are indexed in `sizeof(void*)` strides into the slab. For +the default value of `Shift=18`, this allows us to cache nearly 32K objects per +CPU. We have allocated capacity for `end-begin` objects for a given size-class. -`begin` is chosen via static partitioning at initialization time. `end` is +`begin` is chosen via static partitioning at initialization time. `end` is chosen dynamically at a higher-level (in `tcmalloc::CPUCache`), as to: -* Avoid running into the next size-classes' `begin` -* Balance cached object capacity across size-classes, according to the specified - byte limit. +* Avoid running into the next size-classes' `begin` +* Balance cached object capacity across size-classes, according to the + specified byte limit. ## Usage: Allocation @@ -149,7 +148,7 @@ pointer is between `[start, commit)`, it returns control to a specified, per-sequence restart header at `abort`. Since the *next* object is frequently allocated soon after the current object, -so the allocation path prefetches the pointed-to object. To avoid prefetching a +so the allocation path prefetches the pointed-to object. To avoid prefetching a wild address, we populate `slabs[cpu][begin]` for each CPU/size-class with a pointer-to-self. @@ -185,17 +184,17 @@ This ensures that the 4 bytes prior to `abort` match up with the signature that was configured with the `rseq` syscall. On x86, we can represent this with a nop which would allow for interleaving in -the main implementation. On other platforms - with fixed width instructions - +the main implementation. On other platforms - with fixed width instructions - the signature is often chosen to be an illegal/trap instruction, so it has to be disjoint from the function's body. -## Usage: Deallocation +## Usage: Deallocation Deallocation uses two stores, one to store the deallocated object and another to -update `current`. This is still compatible with the restartable sequence -technique, as there is a *single* commit step, updating `current`. Any -preempted sequences will overwrite the value of the deallocated object until a -successful sequence commits it by updating `current`. +update `current`. This is still compatible with the restartable sequence +technique, as there is a *single* commit step, updating `current`. Any preempted +sequences will overwrite the value of the deallocated object until a successful +sequence commits it by updating `current`. ``` int TcmallocSlab_Push( @@ -233,15 +232,15 @@ kernel to provide zeroed pages from the `mmap` call to obtain memory for the slab metadata. At startup, this leaves the `Header` of each initialized to `current = begin = -end = 0`. The initial push or pop will trigger the overflow or underflow paths +end = 0`. The initial push or pop will trigger the overflow or underflow paths (respectively), so that we can populate these values. -## More Complex Operations: Batches +## More Complex Operations: Batches When the cache under or overflows, we populate or remove a full batch of objects -obtained from inner caches. This amortizes some of the lock acquisition/logic -for those caches. Using a similar approach to push and pop, we update a batch -of `N` items and we update `current to commit the update. +obtained from inner caches. This amortizes some of the lock acquisition/logic +for those caches. Using a similar approach to push and pop, we update a batch of +`N` items and we update `current to commit the update. ## Kernel API and implementation diff --git a/docs/sampling.md b/docs/sampling.md index 8cb6187fe..a0cf37b0d 100644 --- a/docs/sampling.md +++ b/docs/sampling.md @@ -20,11 +20,12 @@ When we pick an allocation such as [Sampler::RecordAllocationSlow()](https://github.com/google/tcmalloc/blob/master/tcmalloc/sampler.cc) to sample we do some additional processing around that allocation using [SampleifyAllocation()](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc) - -recording stack, alignment, request size, and allocation size. Then we go through -all the active samplers using [ReportMalloc()](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc) -and tell them about the allocation. We also tell the span that we're sampling it -- we can do this because we do sampling at tcmalloc page sizes, so each sample -corresponds to a particular page in the pagemap. +recording stack, alignment, request size, and allocation size. Then we go +through all the active samplers using +[ReportMalloc()](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc) +and tell them about the allocation. We also tell the span that we're sampling +it - we can do this because we do sampling at tcmalloc page sizes, so each +sample corresponds to a particular page in the pagemap. ## How We Free Sampled Objects diff --git a/docs/stats.md b/docs/stats.md index ed600a561..89717093f 100644 --- a/docs/stats.md +++ b/docs/stats.md @@ -65,13 +65,13 @@ MALLOC: = 12942756418 (12343.2 MiB) Virtual address space used tracking memory allocation. This will grow as the amount of memory used grows. * **Bytes in malloc metadata Arena unallocated:** Metadata is allocated in an - internal Arena. Memory requests to the OS are made in blocks which amortize + internal Arena. Memory requests to the OS are made in blocks which amortize several Arena allocations and this captures memory that is not yet allocated but could be by future Arena allocations. -* **Bytes in malloc metadata Arena unavailable:** The Arena allocator may - fail to allocate a block fully when a subsequent Arena allocation request is - made that is larger than the block's remaining space. This memory is - currently unavailable for allocation. +* **Bytes in malloc metadata Arena unavailable:** The Arena allocator may fail + to allocate a block fully when a subsequent Arena allocation request is made + that is larger than the block's remaining space. This memory is currently + unavailable for allocation. There's a couple of summary lines: @@ -255,9 +255,9 @@ class 10 [ 80 bytes ] : 132 insert hits; 330 insert misses ( ``` As of July 2021, the `TransferCache` misses when inserting or removing a -non-batch size number of objects from the cache. These are reflected in the -"partial" column. The insert and remove miss column is *inclusive* of misses -for both batch size and non-batch size numbers of objects. +non-batch size number of objects from the cache. These are reflected in the +"partial" column. The insert and remove miss column is *inclusive* of misses for +both batch size and non-batch size numbers of objects. ### Per-CPU Information @@ -360,7 +360,7 @@ caches, or to directly satisfy requests that are larger than the sizes supported by the per-thread or per-cpu caches. **Note:** TCMalloc cannot tell whether a span of memory is actually backed by -physical memory, but it uses _unmapped_ to indicate that it has told the OS that +physical memory, but it uses *unmapped* to indicate that it has told the OS that the span is not used and does not need the associated physical memory. For this reason the physical memory of an application may be larger that the amount that TCMalloc reports. diff --git a/docs/temeraire.md b/docs/temeraire.md index 4620e1ec8..4f4c45515 100644 --- a/docs/temeraire.md +++ b/docs/temeraire.md @@ -2,9 +2,9 @@ Andrew Hunter, [Chris Kennelly](ckennelly@google.com) -_Notes on the name_[^cutie]_: the french word for "reckless" or "rash" :), and also -the name of several large and powerful English warships. So: giant and powerful, -but maybe a little dangerous. :)_ +*Notes on the name*[^cutie]*: the french word for "reckless" or "rash" :), and +also the name of several large and powerful English warships. So: giant and +powerful, but maybe a little dangerous. :)* This is a description of the design of the Hugepage-Aware Allocator. We have also published ["Beyond malloc efficiency to fleet efficiency: a hugepage-aware @@ -17,7 +17,7 @@ Temeraire. What do we want out of this redesign? * Dramatic reduction in pageheap size. The pageheap in TCMalloc holds - substantial amounts of memory _after_ its attempts to `MADV_DONTNEED` memory + substantial amounts of memory *after* its attempts to `MADV_DONTNEED` memory back to the OS, due to internal fragmentation. We can recover a useful fraction of this. In optimal cases, we see savings of over 90%. We do not expect to achieve this generally, but a variety of synthetic loads suggest @@ -28,7 +28,7 @@ What do we want out of this redesign? hugepages. Services have seen substantial performance gains from **from disabling release** (and going to various other lengths to maximize hugepage usage). -* _reasonable_ allocation speed. This is really stating a non-goal: speed +* *reasonable* allocation speed. This is really stating a non-goal: speed parity with `PageHeap::New`. PageHeap is a relatively light consumer of cycles. We are willing to accept a speed hit in actual page allocation in exchange for better hugepage usage and space overhead. This is not free but @@ -36,7 +36,7 @@ What do we want out of this redesign? regressions in speed. We intentionally accept two particular time hits: * much more aggressive releasing (of entire hugepages), leading to - increased costs for _backing_ memory. + increased costs for *backing* memory. * much more detailed (and expensive) choices of where to fulfill a particular request. @@ -61,8 +61,8 @@ corresponding object we allocated from as free. We will sketch the purpose and approach of each important part. Note that we have fairly detailed unit tests for each of these; one consequence on the implementations is that most components are templated on the -`tcmalloc::SystemRelease` functions[^templated] as we make a strong attempt to be zero -initializable where possible (sadly not everywhere). +`tcmalloc::SystemRelease` functions[^templated] as we make a strong attempt to +be zero initializable where possible (sadly not everywhere). ### `RangeTracker` @@ -92,7 +92,7 @@ and to provide memory for the other components to break up into smaller chunks. `HugeAllocator` is (nearly) trivial: it requests arbitrarily large hugepage-sized chunks from `SysAllocator`, keeps them unbacked, and tracks the available (unbacked) regions. Note that we do not need to be perfectly space -efficient here: we only pay virtual memory and metadata, since _none_ of the +efficient here: we only pay virtual memory and metadata, since *none* of the contents are backed. (We do make our best efforts to be relatively frugal, however, since there’s no need to inflate VSS by large factors.) Nor do we have to be particularly fast; this is well off any hot path, and we’re going to incur @@ -110,7 +110,7 @@ fiddly, but reasonably efficient and not stunningly complicated. #### `HugeCache` This is a very simple wrapper on top of HugeAllocator. It's only purpose is to -store some number of backed _single_ hugepage ranges as a hot cache (in case we +store some number of backed *single* hugepage ranges as a hot cache (in case we rapidly allocate and deallocate a 2 MiB chunk). It is not clear whether the cache is necessary, but we have it and it's not @@ -118,14 +118,14 @@ costing us much in complexity, and will help significantly in some potential antagonistic scenarios, so we favor keeping it. It currently attempts to estimate the optimal cache size based on past behavior. -This may not really be needed, but it's a very minor feature to keep _or_ drop. +This may not really be needed, but it's a very minor feature to keep *or* drop. ### `HugePageFiller` (the core…) `HugePageFiller` takes small requests (less than a hugepage) and attempts to pack them efficiently into hugepages. The vast majority of binaries use almost -entirely small allocations[^conditional], so this is the dominant consumer of space and -the most important component. +entirely small allocations[^conditional], so this is the dominant consumer of +space and the most important component. Our goal here is to make our live allocations fit within the smallest set of hugepages possible, so that we can afford to keep all used hugepages fully @@ -136,10 +136,10 @@ requests for 1 page are (usually) the most common, but 4, 8, or even 50+ page requests aren't unheard of. Many 1-page free regions won’t be useful here, and we'll have to request enormous numbers of new hugepages for anything large. -Our solution is to build a heap-ordered data structure on _fragmentation_, not +Our solution is to build a heap-ordered data structure on *fragmentation*, not total amount free, in each hugepage. We use the **longest free range** (the biggest allocation a hugepage can fulfill!) as a measurement of fragmentation. -In other words: if a hugepage has a free range of length 8, we _never_ allocate +In other words: if a hugepage has a free range of length 8, we *never* allocate from it for a smaller request (unless all hugepages available have equally long ranges). This carefully husbands long ranges for the requests that need them, and allows them to grow (as neighboring allocations are freed). @@ -148,10 +148,10 @@ Inside each equal-longest-free-range group, we order our heap by the **number of allocations** (chunked logarithmically). This helps favor allocating from fuller hugepages (of equal fragmented status). Number of allocations handily outperforms the total number of allocated pages here; our hypothesis is that -since allocations of any size are equally likely[^radioactive] to become free at any given -time, and we need all allocations on a hugepage to become free to make the -hugepage empty, we’re better off hoping for 1 10-page allocation to become free -(with some probability P) than 5 1-page allocations (with probability P^5). +since allocations of any size are equally likely[^radioactive] to become free at +any given time, and we need all allocations on a hugepage to become free to make +the hugepage empty, we’re better off hoping for 1 10-page allocation to become +free (with some probability P) than 5 1-page allocations (with probability P^5). The `HugePageFiller` contains support for releasing parts of mostly-empty hugepages as a last resort. @@ -196,7 +196,7 @@ A few important details: given binary, which means we can be less careful about how we organize the set of regions. -* We don’t make _any_ attempt, when allocating from a given region, to find an +* We don’t make *any* attempt, when allocating from a given region, to find an already-backed but unused range. Nor do we prefer regions that have such ranges. @@ -222,7 +222,7 @@ discussing. 1. Small allocations are handed directly to the filler; we add hugepages to the filler as needed. - 1. For slightly larger allocations (still under a full hugepage), we _try_ + 1. For slightly larger allocations (still under a full hugepage), we *try* the filler, but don’t grow it if there’s not currently space. Instead, we look in the regions for free space. If neither the regions or the filler has space, we prefer growing the filler (since it comes in @@ -259,11 +259,9 @@ application. [^cutie]: Also the name of [this cutie](https://lh3.googleusercontent.com/VXENOSfqH1L84VMwLVAUA7JIqQh7TYH-IZHLBalvVVuMUeD3w5rOVHPsIp97nYEgmKpQoxsHO-lieGouheNmifA2X6tOPTBleTbQc_WCZIrI_roU2K37iiHg9go6omp2ys0Y7cxYc9c6EWNaCYtKG1dEPyyYLULUarCex4oqwt8KgRl95rd3yKXC6YQeW-TWkDpK786ZaAA3vKJXqT5E-ArPxQccyPH13EAmHrltKatqihC7L4Ym5IfP42u58IJwC5bRnKMczm2WwUfipGDEOvymf63mPNKmGMka50AQV4VGrE7hW_Ateb2roCTGISgZIooBSRwK0PMjqV9hBLP5DmUG4ITSV4FlOI5iWOyMSNZV6Gz5T2FgNez08Wdn98tsEsN4_lPcjdZXyJuHeVRKxAawDwjkbWP3aieXDckHY-bJMt0QfyDhPWzSOpTxTALcZiwoC069K9SrBDVKEKowJ2Zag7OlbpROhqbagM5Wuo_nn6O27yWXpihc8Lptt-Vo_e8kQZ4N2RReby3bxNPdRyv2L8BrDCIWBO-iFk7GcYRd9ox7HSD-7Y0yH1FtMP0FZKD5a2raVmabMQrolhsjc-AfYHgD3xBkNo-uTJ8YnFpqjpTdZz_1=w2170-h1446-no), the real reason for the choice. - -[^templated]: It will be possible, given recent improvements in constexpr usage, to - eliminate this in followups. - -[^conditional]: Here we mean "requests to the pageheap as filtered through sampling, the - central cache, etc" - -[^radioactive]: Well, no, this is false in our empirical data, but to first order. +[^templated]: It will be possible, given recent improvements in constexpr usage, + to eliminate this in followups. +[^conditional]: Here we mean "requests to the pageheap as filtered through + sampling, the central cache, etc" +[^radioactive]: Well, no, this is false in our empirical data, but to first + order. diff --git a/docs/tuning.md b/docs/tuning.md index e5f9bf3ae..eca6566de 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -67,7 +67,7 @@ worth looking at using large page sizes. **Suggestion:** Small-but-slow is *extremely* slow and should be used only where it is absolutely vital to minimize memory footprint over performance at all -costs. Small-but-slow works by turning off and shrinking several of TCMalloc's +costs. Small-but-slow works by turning off and shrinking several of TCMalloc's caches, but this comes at a significant performance penalty. **Note:** Size-classes are determined on a per-page-size basis. So changing the @@ -97,7 +97,7 @@ Releasing memory held by unuable CPU caches is handled by `tcmalloc::MallocExtension::ProcessBackgroundActions`. In contrast `tcmalloc::MallocExtension::SetMaxTotalThreadCacheBytes` controls -the _total_ size of all thread caches in the application. +the *total* size of all thread caches in the application. **Suggestion:** The default cache size is typically sufficient, but cache size can be increased (or decreased) depending on the amount of time spent in @@ -126,15 +126,15 @@ There are two disadvantages of releasing memory aggressively: **Note:** Release rate is not a panacea for memory usage. Jobs should be provisioned for peak memory usage to avoid OOM errors. Setting a release rate -may enable an application to exceed the memory limit for short periods of -time without triggering an OOM. A release rate is also a good citizen behavior -as it will enable the system to use spare capacity memory for applications -which are are under provisioned. However, it is not a substitute for setting -appropriate memory requirements for the job. +may enable an application to exceed the memory limit for short periods of time +without triggering an OOM. A release rate is also a good citizen behavior as it +will enable the system to use spare capacity memory for applications which are +are under provisioned. However, it is not a substitute for setting appropriate +memory requirements for the job. -**Note:** Memory is released from the `PageHeap` and stranded per-cpu caches. -It is not possible to release memory from other internal structures, like -the `CentralFreeList`. +**Note:** Memory is released from the `PageHeap` and stranded per-cpu caches. It +is not possible to release memory from other internal structures, like the +`CentralFreeList`. **Suggestion:** The default release rate is probably appropriate for most applications. In situations where it is tempting to set a faster rate it is @@ -143,7 +143,7 @@ cause an OOM at some point. ## System-Level Optimizations -* TCMalloc heavily relies on Transparent Huge Pages (THP). As of February +* TCMalloc heavily relies on Transparent Huge Pages (THP). As of February 2020, we build and test with ``` @@ -158,7 +158,7 @@ cause an OOM at some point. ``` * TCMalloc makes assumptions about the availability of virtual address space, - so that we can layout allocations in cetain ways. We build and test with + so that we can layout allocations in cetain ways. We build and test with ``` /proc/sys/vm/overcommit_memory: @@ -170,34 +170,37 @@ cause an OOM at some point. TCMalloc is built and tested in certain ways. These build-time options can improve performance: -* Statically-linking TCMalloc reduces function call overhead, by obviating the - need to call procedure linkage stubs in the procedure linkage table (PLT). -* Enabling [sized deallocation from - C++14](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3778.html) - reduces deallocation costs when the size can be determined. Sized deallocation - is enabled with the `-fsized-deallocation` flag. This behavior is enabled by - default in GCC), but as of early 2020, is not enabled by default on Clang even - when compiling for C++14/C++17. - - Some standard C++ libraries (such as - [libc++](https://reviews.llvm.org/rCXX345214)) will take advantage of sized - deallocation for their allocators as well, improving deallocation performance - in C++ containers. -* Aligning raw storage allocated with `::operator new` to 8 bytes by compiling - with `__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`. This smaller alignment minimizes - wasted memory for many common allocation sizes (24, 40, etc.) which are - otherwise rounded up to a multiple of 16 bytes. On many compilers, this - behavior is controlled by the `-fnew-alignment=...` flag. - - When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not specified (or is larger than 8 - bytes), we use standard 16 byte alignments for `::operator new`. However, for - allocations under 16 bytes, we may return an object with a lower alignment, as - no object with a larger alignment requirement can be allocated in the space. -* Optimizing failures of `operator new` by directly failing instead of throwing - exceptions. Because TCMalloc does not throw exceptions when `operator new` - fails, this can be used as a performance optimization for many move - constructors. - - Within Abseil code, these direct allocation failures are enabled with the - Abseil build-time configuration macro - [`ABSL_ALLOCATOR_NOTHROW`](https://abseil.io/docs/cpp/guides/base#abseil-exception-policy). +* Statically-linking TCMalloc reduces function call overhead, by obviating the + need to call procedure linkage stubs in the procedure linkage table (PLT). +* Enabling + [sized deallocation from C++14](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3778.html) + reduces deallocation costs when the size can be determined. Sized + deallocation is enabled with the `-fsized-deallocation` flag. This behavior + is enabled by default in GCC), but as of early 2020, is not enabled by + default on Clang even when compiling for C++14/C++17. + + Some standard C++ libraries (such as + [libc++](https://reviews.llvm.org/rCXX345214)) will take advantage of sized + deallocation for their allocators as well, improving deallocation + performance in C++ containers. + +* Aligning raw storage allocated with `::operator new` to 8 bytes by compiling + with `__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`. This smaller alignment + minimizes wasted memory for many common allocation sizes (24, 40, etc.) + which are otherwise rounded up to a multiple of 16 bytes. On many compilers, + this behavior is controlled by the `-fnew-alignment=...` flag. + + When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not specified (or is larger than + 8 bytes), we use standard 16 byte alignments for `::operator new`. However, + for allocations under 16 bytes, we may return an object with a lower + alignment, as no object with a larger alignment requirement can be allocated + in the space. + +* Optimizing failures of `operator new` by directly failing instead of + throwing exceptions. Because TCMalloc does not throw exceptions when + `operator new` fails, this can be used as a performance optimization for + many move constructors. + + Within Abseil code, these direct allocation failures are enabled with the + Abseil build-time configuration macro + [`ABSL_ALLOCATOR_NOTHROW`](https://abseil.io/docs/cpp/guides/base#abseil-exception-policy).