Add rs_allocate_closure free function. #1944

curtisblack · 2025-02-18T02:48:23Z

Description

Adds a new free function rs_allocate_closure which allows the renderer services to provide memory for closure storage, for both inbuilt (add/mul) and user defined closures.

The existing osl_* closure handling functions now use this new free function.

The CPU side fallback is to call through to the existing closure pool implementation. The GPU side fallback returns null.

testshade/testrender provide an example implementation showing how to implement a stack based closure pool for the GPU.

Tests

Checklist:

I have read the contribution guidelines.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
My code follows the prevailing code style of this project. If I haven't
already run clang-format v17 before submitting, I definitely will look at
the CI test that runs clang-format and fix anything that it highlights as
being nonconforming.

Signed-off-by: Curtis Black <[email protected]>

AlexMWells · 2025-02-21T01:18:58Z

src/liboslexec/rs_fallback.cpp

+rs_allocate_closure(OSL::OpaqueExecContextPtr exec_ctx, size_t size,
+                    size_t alignment)
+{
+#ifndef __CUDA_ARCH__


As this is rs_fallback.cpp, it should never be compiled by CUDA, can we remove the #ifndef CUDA_ARCH

Everything in rs_fallback is done this way. It is compiled by Cuda, is it not?

Its is not, rs_fallback.cpp only exists in
set (lib_src
which is only compiled for the host, it turns around and just calls the virtual functions of renderer services (which are only on the host) and has no way to be customized for other targets.
More of a rs_legacy_adapter.cpp

AlexMWells · 2025-02-21T01:24:13Z

src/testshade/rs_simplerend.cpp

+    auto sg       = (OSL::ShaderGlobals*)ec;
+    uintptr_t ptr = OIIO::round_to_multiple_of_pow2((uintptr_t)sg->renderstate,
+                                                    alignment);
+    sg->renderstate = (void*)(ptr + size);


Suggest making use of the testshade/render_state.h RenderState object and add a stack_buffer data member. Although I think RenderState is uniform accross all shades, so we might need a per thread renderstate in the ExecContext to properly handle this.

Maybe something like

auto *pst = get_rs_per_shade<TestShadePerShadeState>(ec); pst->stackbuf = (void*)(ptr + size);

Would require another pointer in ShaderGlobals (but hopefully we can remove some in the future).
But then a Renderer could have their own PerShadeState to implement some of this stuff.

Signed-off-by: Curtis Black <[email protected]>

AlexMWells · 2025-02-21T22:01:54Z

src/testshade/render_state.h

+    }
+};
+
+struct RenderState {


Ok, so with this you are defining "RenderState" to be PerShade, and introduce RenderContext to be for all threads.
And arguably ShaderGlobals always was PerShade. So perhaps that is fine for terminology.

Stepping back a minute, if I were to want a kernel to execute 100,000 shades, when I launch that kernel I don't have a PerShade RenderState to pass, only a RenderContext. That kernel would then use the RenderContext and other kernel arguments to build a RenderState to pass into the JIT'd shader.

Ultimately I think we were looking to get the JIT to produce that ShaderGroup specific kernel_adapter interacting with rs_* functions, like rs_population_shader_globals(...) that could be launched on the GPU (passing RenderContext, start index, end index), and we end up with a bit of chicken and egg between the RenderState and RendererContext. As the OSL Jit doesn't know the concrete type/size of the PerThread/Shade RenderState object to create on the stack. And the Renderer doesn't know the JIT'd shader function (just its prototype). Ideally we want them all in the same llvm module so it can all get inlined / optimized together.

But perhaps this is simpler, the final Renderer would have to craft a kernel that sets up a RenderState and call the JIT'd Shader function (not sure how it got it). Or for CPU side, renderer would have to do its own loop over shades configuring/reusing RenderState. Although in this example, its really PerThread state vs. PerShade.

So for state/context we have:

All Threads

Per Thread

Per Shade (per shade could be subset/section of PerThread that is updated per shade).

I guess its a Render provided a custom ShadingContext with rs free functions to interact with (which makes sense as we are trying to remove/replace the CPU side ShadingContext by asking the renderer to take on it responsibilities).

As we have a rs_bitcode module to integrate, the OSL Jit could just directly use Renderer provided types and alloca a RenderState on the stack, then call a rs_init(OpaqueExecutionContext *oec, RenderState *) function to let the renderer populate it.

curtisblack added 8 commits February 18, 2025 13:39

Add rs_allocate_closure free function.

59a8ab0

Signed-off-by: Curtis Black <[email protected]>

Add missing include

dd273dd

Signed-off-by: Curtis Black <[email protected]>

Add missing include

b4a17e2

Signed-off-by: Curtis Black <[email protected]>

simplerend use rs_fallback

097f5a7

Signed-off-by: Curtis Black <[email protected]>

clang format

6b0add9

Signed-off-by: Curtis Black <[email protected]>

match host/device function decls

8332615

Signed-off-by: Curtis Black <[email protected]>

match host/device function decls

00e3d5e

Signed-off-by: Curtis Black <[email protected]>

restore cuda code

b112bb5

Signed-off-by: Curtis Black <[email protected]>

AlexMWells reviewed Feb 21, 2025

View reviewed changes

curtisblack added 4 commits February 21, 2025 14:06

Move closure pool to render state.

c0f8462

Signed-off-by: Curtis Black <[email protected]>

Update optix code to use closure pool.

d73a310

Signed-off-by: Curtis Black <[email protected]>

clang format

aef2a80

Signed-off-by: Curtis Black <[email protected]>

cleanup

ae990c3

Signed-off-by: Curtis Black <[email protected]>

AlexMWells reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rs_allocate_closure free function. #1944

Add rs_allocate_closure free function. #1944

curtisblack commented Feb 18, 2025

AlexMWells Feb 21, 2025

lgritz Feb 21, 2025

AlexMWells Feb 21, 2025

AlexMWells Feb 21, 2025

AlexMWells Feb 21, 2025

AlexMWells Feb 21, 2025

Add rs_allocate_closure free function. #1944

Are you sure you want to change the base?

Add rs_allocate_closure free function. #1944

Conversation

curtisblack commented Feb 18, 2025

Description

Tests

Checklist:

AlexMWells Feb 21, 2025

Choose a reason for hiding this comment

lgritz Feb 21, 2025

Choose a reason for hiding this comment

AlexMWells Feb 21, 2025

Choose a reason for hiding this comment

AlexMWells Feb 21, 2025

Choose a reason for hiding this comment

AlexMWells Feb 21, 2025

Choose a reason for hiding this comment

AlexMWells Feb 21, 2025

Choose a reason for hiding this comment