A playground/benchmarking program for testing out alternative implementations of low level math routines within Blender.
The following scenarios were profiled to determine which math-heavy code paths within Blender could be improved. From those code paths, select mathematical functions were tested out.
- 2 different cloth simulations, yielding very different profiles, were created
- Baking of the simulation was done on the commandline to reduce interaction and aid consistency
- Open blender as follows (keeps drawing overhead low):
blender --window-geometry 0 0 960 540
- Download the blenderman demo file
- Open the downloaded
blenderman.blend
withLoad UI
unchecked - Switch to camera view (NUMPAD-0)
- Profile: Hit play on the timeline
- Open blender as follows (keeps drawing overhead low):
blender --window-geometry 0 0 960 540
- Create a Suzanne monkey
- Array the monkey 40 times in X, 40 times in Y (1600 total)
- Animate the monkey position
- Profile: Hit play on the timeline
- Open blender as follows (keeps drawing overhead low):
blender --window-geometry 0 0 960 540
- Create a UV Sphere with 12 segments, 6 rings
- Apply 7 levels of subdivision
- Go to edit mode, use vertex select, and select 8k faces worth of vertices
- Profile: Hit G and attempt to move the selected vertices around
Interesting code paths found:
- cloth:
muladd_fmatrix_fvector
,mul_bfmatrix_lfvector
,closest_on_tri_to_point_v3
,isect_seg_seg_v3
- general:
split_loop_nor_fan_do
,mesh_calc_normals_poly_prepare_cb
,mesh_verts_calc_normals_accum_cb
,mesh_edges_calc_vectors_cb
,mesh_edges_sharp_tag
,extract_lnor_loop_mesh
, andextract_pos_nor_loop_mesh
-
The file, lib structure, and even file names, are lifted directly from Blender where possible. This is to help those already familar with Blender know exactly what is what.
-
The
bf_bmesh
project under Blender is currently compiled asC
code; notC++
. As such, class-like types and parameter references are not used in the API here when reimplementingbf_blenlib
functions since they need to remain consumable inbf_bmesh
. -
For now, naming conventions for APIs like sub_v3_v3v3(...) is kept even though the SSE variant of this API may now be working with a SSE type (effectively 4 floats).
-
Passing the SSE types, like
__m128
, by reference is not necessary in function signatures. No speedup was observed when doing so. Even the intrinsic functions themselves take their parameters as value types. -
Usage of the _VECTORCALL calling convention will benefit some non-inlined methods.
-
The focus is on SSE2. Known downsides include:
- Dot products are slower than necessary (requires SSE4 for better performance)
- A useful FMA intrinsic cannot be used (requires FMA CPU support for better performance)
- The most optimal set of compiler flags for this benchmark program deviates from what Blender uses. Rather than using the optimal flags, the flags are instead kept as close as possible with Blender. See the CMakeLists.txt for some notes there.
- To simulate a realistic workflow, a quad-sphere is used for geometry data where applicable (a cube with sub-d level 1 applied). This yields 48 triangles worth of data to loop over in a given benchmark iteration. For example, the timing output from the "normal_tri" tests is effectively the time it takes to calculate the normals for all 48 triangles.
Command: blender_bench.exe --benchmark_report_aggregates_only=true --benchmark_repetitions=10
Test | i7-8750H Time | Xeon E5-1650 v4 Time | Notes |
---|---|---|---|
BB_dot_v3v3_mean | 121 ns (1x) | 152 ns (1x) | Baseline |
BB_dot_m128m128_mean | 131 ns (0.91x) | 152 ns (1x) | Bad |
BB_dot_m128m128_lf3_mean | 131 ns (0.91x) | 160 ns (0.95x) | Bad |
BB_dot_m128m128_lf4_mean | 109 ns (1.09x) | 130 ns (1.17x) | Good |
BB_dot_m128m128_native_mean | 109 ns (1.09x) | 127 ns (1.20x) | Best |
BB_cross_tri_v3_mean | 178 ns (1x) | 201 ns (1x) | Baseline |
BB_cross_tri_m128_mean | 158 ns (1.12x) | 183 ns (1.1x) | Good |
BB_cross_tri_m128_lf3sf3_mean | 189 ns (0.94x) | 210 ns (0.96x) | Bad |
BB_cross_tri_m128_lf3sf4_mean | 174 ns (1.02x) | 180 ns (1.12x) | Ok |
BB_cross_tri_m128_lf4sf4_mean | 168 ns (1.06x) | 187 ns (1.08x) | Ok |
BB_cross_tri_m128_native_mean | 145 ns (1.23x) | 156 ns (1.29x) | Best |
BB_normal_tri_v3_mean | 408 ns (1x) | 481 ns (1x) | Baseline |
BB_normal_tri_m128_mean | 265 ns (1.54x) | 305 ns (1.58x) | Good |
BB_normal_tri_m128_lf3sf3_mean | 286 ns (1.43x) | 320 ns (1.5x) | Good |
BB_normal_tri_m128_lf3sf4_mean | 259 ns (1.58x) | 297 ns (1.62x) | Good |
BB_normal_tri_m128_lf4sf4_mean | 237 ns (1.72x) | 271 ns (1.78x) | Good |
BB_normal_tri_m128_native_mean | 227 ns (1.80x) | 254 ns (1.9x) | Best |
BB_is_quad_flip_v3_mean | 155 ns (1x) | 200 ns (1x) | Baseline |
BB_is_quad_flip_m128_mean | 128 ns (1.21x) | 153 ns (1.31x) | Good |
BB_is_quad_flip_m128_lf3_mean | 136 ns (1.14x) | 163 ns (1.23x) | Good |
BB_is_quad_flip_m128_lf4_mean | 117 ns (1.32x) | 146 ns (1.37x) | Good |
BB_is_quad_flip_m128_native_mean | 116 ns (1.34x) | 144 ns (1.39x) | Best |
BB_GPU_normal_convert_i10_v3_mean | 142 ns (1x) | 159 ns (1x) | Baseline |
BB_GPU_normal_convert_i10_m128_mean | 141 ns (1.01x) | 142 ns (1.12x) | Ok |
BB_GPU_normal_convert_i10_m128_lf3_mean | 141 ns (1.01x) | 150 ns (1.06x) | Ok |
BB_GPU_normal_convert_i10_m128_lf4_mean | 130 ns (1.09x) | 140 ns (1.14x) | Good |
BB_GPU_normal_convert_i10_m128_native_mean | 128 ns (1.11x) | 135 ns (1.18x) | Best |
The *_sse variations above should not be any faster/slower than the *_lf3sf3 variants. However, this is not the case; they are substantially faster in some cases. It's surprising that even Clang shows codegen differences between the 2 variants.
This would allow for quick integration back into the main Blender codebase as all the callers would not notice. The functions would just get faster; for "free".
Speedups of any magnitude are worthwhile if:
- Maintainability is enhanced or not made any worse
- Readability is enhanced or not made any worse
- Chances of additional bugs are low
- The speedups are meaningful for the human operating the software
- There's few, if any, cases where things become slower
If this benchmark's code were to be used, it would score favorably enough in some of those categories but not so much in others.
Is it worth it? Maybe.
Functions with three or more blocks of mathematical operations would stand to gain the most (like cross_tri_v3 and normal_tri_v3 etc.) especially if intermediate results can be kept in SSE types instead of loading/storing to float arrays.
The current scorecard speaks for itself: The usage of 3 floats in the APIs, DNA structs,and RNA interfaces is suboptimal when compared against either 4 floats or the usage of SSE types directly. This should not come as a surprise.
- Performance is affected so much due to this that, in some simple cases, SSE becomes slower than normal code (dot product especially).
- To fully leverage SSE, the usage of SSE types would be ideal, followed closely by 4 float storage.
- SoA, data-oriented, designs in theory can be used for further speedups. However, those designs often run counter to the access patterns that Blender needs (e.g. during editing).
Unfortunately, the memory (extra 25% at minimum) and 16 byte alignment requirements would bloat runtime memory usage well beyond what would be tolerable.
TODO: Figure out what changes are allowed in DNA and explore ways of testing a better format out in isolated cases. For example, for calculating normals:
- Could just the
CD_NORMAL
layer be changed as an experiment? - Could a
CD_NORMAL_SSE
layer be created and enlighten a few critical code paths to see if performance can be gained?
You can use vcpkg to install it like this:
Windows:
> vcpkg install benchmark --triplet x64-windows
Linux:
~ vcpkg install benchmark --triplet x64-linux
See integration for how to attain the location of the vcpkg.cmake file.
Windows:
> makedir build
> cd build
> cmake .. -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_TOOLCHAIN_FILE=<LOCATION OF vcpkg.cmake FILE>
> cmake --build . --config Release
Linux:
~ makedir build
~ cd build
~ cmake .. -DVCPKG_TARGET_TRIPLET=x64-linux -DCMAKE_TOOLCHAIN_FILE=<LOCATION OF vcpkg.cmake FILE>
~ cmake --build . --config Release
Help
blender_bench.exe --help
For quick spot checks:
blender_bench.exe --benchmark_repetitions=5 --benchmark_filter=normal_tri
For reporting and better averages:
blender_bench.exe --benchmark_report_aggregates_only=true --benchmark_repetitions=10
ctest --output-on-failure -C Release
- Microsoft Visual C++ 2019 (and likely earlier)
- Clang 5.0+ (and likely earlier)
- GCC 6.1+ (and likely earlier)
Licensed under the GNU General Public License version 3 https://opensource.org/licenses/GPL-3.0.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
This software would not be possible without the help of these great resources. Thanks a lot!
- benchmark for benchmarking support and reporting