Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

pcwalton · 2025-01-17T02:57:23Z

Occlusion culling allows the GPU to skip the vertex and fragment shading overhead for objects that can be quickly proved to be invisible because they're behind other geometry. A depth prepass already eliminates most fragment shading overhead for occluded objects, but the vertex shading overhead, as well as the cost of testing and rejecting fragments against the Z-buffer, is presently unavoidable for standard meshes. We currently perform occlusion culling only for meshlets. But other meshes, such as skinned meshes, can benefit from occlusion culling too in order to avoid the transform and skinning overhead for unseen meshes.

This commit adapts the same two-phase occlusion culling technique that meshlets use to Bevy's standard 3D mesh pipeline when the new OcclusionCulling component, as well as the DepthPrepass component, are present on the camera. It has these steps:

Early depth prepass: We use the hierarchical Z-buffer from the previous frame to cull meshes for the initial depth prepass, effectively rendering only the meshes that were visible in the last frame.
Early depth downsample: We downsample the depth buffer to create another hierarchical Z-buffer, this time with the current view transform.
Late depth prepass: We use the new hierarchical Z-buffer to test all meshes that weren't rendered in the early depth prepass. Any meshes that pass this check are rendered.
Late depth downsample: Again, we downsample the depth buffer to create a hierarchical Z-buffer in preparation for the early depth prepass of the next frame. This step is done after all the rendering, in order to account for custom phase items that might write to the depth buffer.

Note that this patch has no effect on the per-mesh CPU overhead for occluded objects, which remains high for a GPU-driven renderer due to the lack of cold-specialization and retained bins. If cold-specialization and retained bins weren't on the horizon, then a more traditional approach like potentially visible sets (PVS) or low-res CPU rendering would probably be more efficient than the GPU-driven approach that this patch implements for most scenes. However, at this point the amount of effort required to implement a PVS baking tool or a low-res CPU renderer would probably be greater than landing cold-specialization and retained bins, and the GPU driven approach is the more modern one anyway. It does mean that the performance improvements from occlusion culling as implemented in this patch today are likely to be limited, because of the high CPU overhead for occluded meshes.

Note also that this patch currently doesn't implement occlusion culling for 2D objects or shadow maps. Those can be addressed in a follow-up. Additionally, note that the techniques in this patch require compute shaders, which excludes support for WebGL 2.

This PR is marked experimental because of known precision issues with the downsampling approach when applied to non-power-of-two framebuffer sizes (i.e. most of them). These precision issues can, in rare cases, cause objects to be judged occluded that in fact are not. (I've never seen this in practice, but I know it's possible; it tends to be likelier to happen with small meshes.) As a follow-up to this patch, we desire to switch to the SPD-based hi-Z buffer shader from the Granite engine, which doesn't suffer from these problems, at which point we should be able to graduate this feature from experimental status. I opted not to include that rewrite in this patch for two reasons: (1) @JMS55 is planning on doing the rewrite to coincide with the new availability of image atomic operations in Naga; (2) to reduce the scope of this patch.

A new example, occlusion_culling, has been added. It demonstrates objects becoming quickly occluded and disoccluded by dynamic geometry and shows the number of objects that are actually being rendered. Also, a new --occlusion-culling switch has been added to scene_viewer, in order to make it easy to test this patch with large scenes like Bistro.

Migration guide

When enqueuing a custom mesh pipeline, work item buffers are now created with bevy::render::batching::gpu_preprocessing::get_or_create_work_item_buffer, not PreprocessWorkItemBuffers::new. See the specialized_mesh_pipeline example.

Showcase

Occlusion culling example:

Bistro zoomed out, before occlusion culling:

Bistro zoomed out, after occlusion culling:

In this scene, occlusion culling reduces the number of meshes Bevy has to render from 1591 to 585.

@JMS55

3D mesh pipeline. *Occlusion culling* allows the GPU to skip the vertex and fragment shading overhead for objects that can be quickly proved to be invisible because they're behind other geometry. A depth prepass already eliminates most fragment shading overhead for occluded objects, but the vertex shading overhead, as well as the cost of testing and rejecting fragments against the Z-buffer, is presently unavoidable for standard meshes. We currently perform occlusion culling only for meshlets. But other meshes, such as skinned meshes, can benefit from occlusion culling too in order to avoid the transform and skinning overhead for unseen meshes. This commit adapts the same [*two-phase occlusion culling*] technique that meshlets use to Bevy's standard 3D mesh pipeline when the new `OcclusionCulling` component, as well as the `DepthPrepass` component, are present on the camera. It has these steps: 1. *Early depth prepass*: We use the hierarchical Z-buffer from the previous frame to cull meshes for the initial depth prepass, effectively rendering only the meshes that were visible in the last frame. 2. *Early depth downsample*: We downsample the depth buffer to create another hierarchical Z-buffer, this time with the current view transform. 3. *Late depth prepass*: We use the new hierarchical Z-buffer to test all meshes that weren't rendered in the early depth prepass. Any meshes that pass this check are rendered. 4. *Late depth downsample*: Again, we downsample the depth buffer to create a hierarchical Z-buffer in preparation for the early depth prepass of the next frame. This step is done after all the rendering, in order to account for custom phase items that might write to the depth buffer. Note that this patch has no effect on the per-mesh CPU overhead for occluded objects, which remains high for a GPU-driven renderer due to the lack of `cold-specialization` and retained bins. If `cold-specialization` and retained bins weren't on the horizon, then a more traditional approach like potentially visible sets (PVS) or low-res CPU rendering would probably be more efficient than the GPU-driven approach that this patch implements for most scenes. However, at this point the amount of effort required to implement a PVS baking tool or a low-res CPU renderer would probably be greater than landing `cold-specialization` and retained bins, and the GPU driven approach is the more modern one anyway. It does mean that the performance improvements from occlusion culling as implemented in this patch *today* are likely to be limited, because of the high CPU overhead for occluded meshes. Note also that this patch currently doesn't implement occlusion culling for 2D objects or shadow maps. Those can be addressed in a follow-up. Additionally, note that the techniques in this patch require compute shaders, which excludes support for WebGL 2. This PR is marked experimental because of known precision issues with the downsampling approach when applied to non-power-of-two framebuffer sizes (i.e. most of them). These precision issues can, in rare cases, cause objects to be judged occluded that in fact are not. (I've never seen this in practice, but I know it's possible; it tends to be likelier to happen with small meshes.) As a follow-up to this patch, we desire to switch to the [SPD-based hi-Z buffer shader from the Granite engine], which doesn't suffer from these problems, at which point we should be able to graduate this feature from experimental status. I opted not to include that rewrite in this patch for two reasons: (1) @JMS55 is planning on doing the rewrite to coincide with the new availability of image atomic operations in Naga; (2) to reduce the scope of this patch. [*two-phase occlusion culling*]: https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501 [Aaltonen SIGGRAPH 2015]: https://www.advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf [Some literature]: https://gist.github.com/reduz/c5769d0e705d8ab7ac187d63be0099b5?permalink_comment_id=5040452#gistcomment-5040452 [SPD-based hi-Z buffer shader from the Granite engine]: https://github.com/Themaister/Granite/blob/master/assets/shaders/post/hiz.comp

bushrat011899

Code looks good, just a minor comment around the experimental module and marking it as doc(hidden) for Sem-Ver reasons. I unfortunately couldn't get the new occlusion_culling example to run on my laptop (Intel i5-1240p iGPU Windows 10) with either DX12 or the Vulkan backends.

bushrat011899 · 2025-01-17T04:01:38Z

crates/bevy_core_pipeline/src/experimental/mip_generation/downsample_depth.wgsl

+#endif  // MULTISAMPLE
+#endif  // MESHLET
+#endif  // MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT


I am reminded of how spoiled I am getting to just write Rust.

bushrat011899 · 2025-01-17T04:04:46Z

crates/bevy_core_pipeline/src/lib.rs

@@ -14,6 +14,7 @@ pub mod core_2d;
 pub mod core_3d;
 pub mod deferred;
 pub mod dof;
+pub mod experimental;


Might be good to annotate this #[doc(hidden)]. This makes it sem-ver compatible to include breaking changes in this module.

Do we care about semver compatibility here though if we aren't shipping this in a point release? My concern about #[doc(hidden)] is that it makes the feature less discoverable, and we want testing on it as it's the kind of thing that could have a lot of bugs.

Fair point! While Bevy is pre-1.0 it's probably not important anyway, since every release is a breaking release.

around Intel Iris Xe restrictions

bushrat011899

Can confirm the example now runs on my i5-1240p. In the DX12 backend it says my platform doesn't support occlusion culling, but runs the example fine otherwise. On Vulkan it works as expected, culling approximately 30 meshes. Nice work!

tychedelia

Seeing a panic on my M2 MBP:

2025-01-20T01:09:54.639116Z ERROR wgpu::backend::wgpu_core: Handling wgpu errors as fatal by default
thread 'Compute Task Pool (4)' panicked at /Users/char/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-23.0.1/src/backend/wgpu_core.rs:996:18:
wgpu error: Validation Error

Caused by:
  In Device::create_bind_group, label = 'preprocess_late_indexed_gpu_occlusion_culling_bind_group'
    Buffer offset 320 does not respect device's requested `min_storage_buffer_offset_alignment` limit 256


note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Encountered a panic in system `bevy_pbr::render::gpu_preprocess::prepare_preprocess_bind_groups`!

crates/bevy_core_pipeline/src/experimental/mip_generation/mod.rs

tychedelia · 2025-01-20T00:42:10Z

crates/bevy_core_pipeline/src/experimental/mip_generation/mod.rs

+                texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
+                texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
+                texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::WriteOnly),
+                texture_storage_2d(TextureFormat::R32Float, StorageTextureAccess::ReadWrite),


Is there a reason this one is marked ReadWrite?

We call textureStore on it. See mip_6 in downsample_depth.wgsl.

Ah yup, I see it's the handoff point between first and second.

crates/bevy_pbr/src/meshlet/visibility_buffer_raster_node.rs

crates/bevy_pbr/src/prepass/mod.rs

pcwalton · 2025-01-22T20:50:54Z

Bevy example runner output looks good; all the changed references seem to be false positives.

Fixes a regression that was causing meshes to flicker sometimes.

alice-i-cecile · 2025-01-26T22:50:57Z

Doing a final example runner test and then merging.

`FromWorld` implementation. This allows the creation of the pipeline to gracefully fail if the current platform doesn't support compute shaders.

@JMS55

…rd 3D mesh pipeline. (bevyengine#17413) *Occlusion culling* allows the GPU to skip the vertex and fragment shading overhead for objects that can be quickly proved to be invisible because they're behind other geometry. A depth prepass already eliminates most fragment shading overhead for occluded objects, but the vertex shading overhead, as well as the cost of testing and rejecting fragments against the Z-buffer, is presently unavoidable for standard meshes. We currently perform occlusion culling only for meshlets. But other meshes, such as skinned meshes, can benefit from occlusion culling too in order to avoid the transform and skinning overhead for unseen meshes. This commit adapts the same [*two-phase occlusion culling*] technique that meshlets use to Bevy's standard 3D mesh pipeline when the new `OcclusionCulling` component, as well as the `DepthPrepass` component, are present on the camera. It has these steps: 1. *Early depth prepass*: We use the hierarchical Z-buffer from the previous frame to cull meshes for the initial depth prepass, effectively rendering only the meshes that were visible in the last frame. 2. *Early depth downsample*: We downsample the depth buffer to create another hierarchical Z-buffer, this time with the current view transform. 3. *Late depth prepass*: We use the new hierarchical Z-buffer to test all meshes that weren't rendered in the early depth prepass. Any meshes that pass this check are rendered. 4. *Late depth downsample*: Again, we downsample the depth buffer to create a hierarchical Z-buffer in preparation for the early depth prepass of the next frame. This step is done after all the rendering, in order to account for custom phase items that might write to the depth buffer. Note that this patch has no effect on the per-mesh CPU overhead for occluded objects, which remains high for a GPU-driven renderer due to the lack of `cold-specialization` and retained bins. If `cold-specialization` and retained bins weren't on the horizon, then a more traditional approach like potentially visible sets (PVS) or low-res CPU rendering would probably be more efficient than the GPU-driven approach that this patch implements for most scenes. However, at this point the amount of effort required to implement a PVS baking tool or a low-res CPU renderer would probably be greater than landing `cold-specialization` and retained bins, and the GPU driven approach is the more modern one anyway. It does mean that the performance improvements from occlusion culling as implemented in this patch *today* are likely to be limited, because of the high CPU overhead for occluded meshes. Note also that this patch currently doesn't implement occlusion culling for 2D objects or shadow maps. Those can be addressed in a follow-up. Additionally, note that the techniques in this patch require compute shaders, which excludes support for WebGL 2. This PR is marked experimental because of known precision issues with the downsampling approach when applied to non-power-of-two framebuffer sizes (i.e. most of them). These precision issues can, in rare cases, cause objects to be judged occluded that in fact are not. (I've never seen this in practice, but I know it's possible; it tends to be likelier to happen with small meshes.) As a follow-up to this patch, we desire to switch to the [SPD-based hi-Z buffer shader from the Granite engine], which doesn't suffer from these problems, at which point we should be able to graduate this feature from experimental status. I opted not to include that rewrite in this patch for two reasons: (1) @JMS55 is planning on doing the rewrite to coincide with the new availability of image atomic operations in Naga; (2) to reduce the scope of this patch. A new example, `occlusion_culling`, has been added. It demonstrates objects becoming quickly occluded and disoccluded by dynamic geometry and shows the number of objects that are actually being rendered. Also, a new `--occlusion-culling` switch has been added to `scene_viewer`, in order to make it easy to test this patch with large scenes like Bistro. [*two-phase occlusion culling*]: https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501 [Aaltonen SIGGRAPH 2015]: https://www.advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf [Some literature]: https://gist.github.com/reduz/c5769d0e705d8ab7ac187d63be0099b5?permalink_comment_id=5040452#gistcomment-5040452 [SPD-based hi-Z buffer shader from the Granite engine]: https://github.com/Themaister/Granite/blob/master/assets/shaders/post/hiz.comp ## Migration guide * When enqueuing a custom mesh pipeline, work item buffers are now created with `bevy::render::batching::gpu_preprocessing::get_or_create_work_item_buffer`, not `PreprocessWorkItemBuffers::new`. See the `specialized_mesh_pipeline` example. ## Showcase Occlusion culling example: ![Screenshot 2025-01-15 175051](https://github.com/user-attachments/assets/1544f301-68a3-45f8-84a6-7af3ad431258) Bistro zoomed out, before occlusion culling: ![Screenshot 2025-01-16 185425](https://github.com/user-attachments/assets/5114bbdf-5dec-4de9-b17e-7aa77e7b61ed) Bistro zoomed out, after occlusion culling: ![Screenshot 2025-01-16 184949](https://github.com/user-attachments/assets/9dd67713-656c-4276-9768-6d261ca94300) In this scene, occlusion culling reduces the number of meshes Bevy has to render from 1591 to 585.

change. This patch fixes a bug whereby we're re-extracting every mesh every frame. It's a regression from PR bevyengine#17413. The code in question has actually been in the tree with this bug for quite a while; it's that just the code didn't actually run unless the renderer considered the previous view transforms necessary. Occlusion culling expanded the set of circumstances under which Bevy computes the previous view transforms, causing this bug to appear more often. This patch fixes the issue by checking to see if the previous transform of a mesh actually differs from the current transform before copying the current transform to the previous transform.

… change. (#17688) This patch fixes a bug whereby we're re-extracting every mesh every frame. It's a regression from PR #17413. The code in question has actually been in the tree with this bug for quite a while; it's that just the code didn't actually run unless the renderer considered the previous view transforms necessary. Occlusion culling expanded the set of circumstances under which Bevy computes the previous view transforms, causing this bug to appear more often. This patch fixes the issue by checking to see if the previous transform of a mesh actually differs from the current transform before copying the current transform to the previous transform.

… change. (bevyengine#17688) This patch fixes a bug whereby we're re-extracting every mesh every frame. It's a regression from PR bevyengine#17413. The code in question has actually been in the tree with this bug for quite a while; it's that just the code didn't actually run unless the renderer considered the previous view transforms necessary. Occlusion culling expanded the set of circumstances under which Bevy computes the previous view transforms, causing this bug to appear more often. This patch fixes the issue by checking to see if the previous transform of a mesh actually differs from the current transform before copying the current transform to the previous transform.

pcwalton force-pushed the occlusion-culling-4 branch from 7cd3abd to fd03dd0 Compare January 17, 2025 03:04

pcwalton requested review from JMS55, atlv24, bushrat011899 and IceSentry January 17, 2025 03:04

pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jan 17, 2025

pcwalton added this to the 0.16 milestone Jan 17, 2025

pcwalton force-pushed the occlusion-culling-4 branch 5 times, most recently from f2c4a5e to 357d4ad Compare January 17, 2025 04:24

pcwalton force-pushed the occlusion-culling-4 branch from 357d4ad to 6aec99d Compare January 17, 2025 05:21

bushrat011899 reviewed Jan 17, 2025

View reviewed changes

pcwalton added 4 commits January 16, 2025 21:55

Doc check police

34f693d

Widen the LatePreprocessWorkItemIndirectParameters to 64 bytes to work

b3bd9c8

around Intel Iris Xe restrictions

Add some missing docs

6ff7c04

Fix DX12

75ee16e

pcwalton requested a review from bushrat011899 January 17, 2025 21:01

Internal import police

fc068fb

bushrat011899 approved these changes Jan 17, 2025

View reviewed changes

pcwalton self-assigned this Jan 18, 2025

BenjaminBrienen added D-Complex Quite challenging from either a design or technical perspective. Ask for help! D-Shaders This code uses GPU shader languages labels Jan 19, 2025

tychedelia reviewed Jan 20, 2025

View reviewed changes

JMS55 reviewed Jan 20, 2025

View reviewed changes

crates/bevy_pbr/src/meshlet/visibility_buffer_raster_node.rs Outdated Show resolved Hide resolved

JMS55 reviewed Jan 20, 2025

View reviewed changes

crates/bevy_pbr/src/prepass/mod.rs Outdated Show resolved Hide resolved

Address review comment

d08b195

github-actions bot mentioned this pull request Jan 22, 2025

17413 TheBevyFlock/bevy-example-runner#91

Closed

pcwalton added 8 commits January 22, 2025 15:31

Set the push constant offset for the late mesh preprocessing phase too.

77bfc6a

Fixes a regression that was causing meshes to flicker sometimes.

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

795ddc4

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

129f12d

Warning police

c1e5053

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

64cb9b7

Update for Bevy changes

c5df9c8

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

a57d2a5

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

9d16350

alice-i-cecile added the M-Needs-Release-Note Work that should be called out in the blog due to impact label Jan 26, 2025

github-actions bot mentioned this pull request Jan 26, 2025

17413 TheBevyFlock/bevy-example-runner#97

Closed

alice-i-cecile added this pull request to the merge queue Jan 27, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 27, 2025

pcwalton added 4 commits January 26, 2025 18:17

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

76e4f8a

Fix DX12

f025875

Fix WebGL 2 by moving the depth downsample pipeline creation out of a

dd8e93b

`FromWorld` implementation. This allows the creation of the pipeline to gracefully fail if the current platform doesn't support compute shaders.

Merge remote-tracking branch 'origin/main' into occlusion-culling-4

6911ddf

github-actions bot mentioned this pull request Jan 27, 2025

17413 TheBevyFlock/bevy-example-runner#99

Closed

alice-i-cecile added this pull request to the merge queue Jan 27, 2025

Merged via the queue into bevyengine:main with commit dda9788 Jan 27, 2025
32 checks passed

mockersf mentioned this pull request Jan 28, 2025

GPU 2-phase occlusion broke rendering on some Android devices #17591

Open

pcwalton mentioned this pull request Feb 5, 2025

Don't mark a previous mesh transform as changed if it didn't actually change. #17688

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

pcwalton commented Jan 17, 2025

bushrat011899 left a comment

bushrat011899 Jan 17, 2025

bushrat011899 Jan 17, 2025

pcwalton Jan 17, 2025 •

edited

Loading

bushrat011899 Jan 17, 2025

bushrat011899 left a comment

tychedelia left a comment

tychedelia Jan 20, 2025

pcwalton Jan 21, 2025

tychedelia Jan 22, 2025

pcwalton commented Jan 22, 2025 •

edited

Loading

alice-i-cecile commented Jan 26, 2025

Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

Implement experimental GPU two-phase occlusion culling for the standard 3D mesh pipeline. #17413

Conversation

pcwalton commented Jan 17, 2025

Migration guide

Showcase

bushrat011899 left a comment

Choose a reason for hiding this comment

bushrat011899 Jan 17, 2025

Choose a reason for hiding this comment

bushrat011899 Jan 17, 2025

Choose a reason for hiding this comment

pcwalton Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

bushrat011899 Jan 17, 2025

Choose a reason for hiding this comment

bushrat011899 left a comment

Choose a reason for hiding this comment

tychedelia left a comment

Choose a reason for hiding this comment

tychedelia Jan 20, 2025

Choose a reason for hiding this comment

pcwalton Jan 21, 2025

Choose a reason for hiding this comment

tychedelia Jan 22, 2025

Choose a reason for hiding this comment

pcwalton commented Jan 22, 2025 • edited Loading

alice-i-cecile commented Jan 26, 2025

pcwalton Jan 17, 2025 •

edited

Loading

pcwalton commented Jan 22, 2025 •

edited

Loading