softmax with EVT (draft) #177

jiyang1011 · 2024-12-27T02:06:02Z

No description provided.

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

t4c1 · 2025-01-03T11:04:58Z

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

+    size_t workspace_size = Gemm::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+    gemm_op.can_implement(arguments);


Calling can_implement without checking and acting on result is pointless.

t4c1 · 2025-01-03T11:28:02Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

      }
    }
+    auto synchronize = [&] () {};
+    cst_callbacks.reduce(nullptr, synchronize, 0, 0, true, trD);


Looking at Nvidia implementation it looks like the reduce call should be within the epi_n/epi_m loops and epi_n/epi_m should be passed to the call instead of the zeroes.

This version is a draft. I do know epi_n / epi_m should be passed to the call, but the performance will be not good

This will break all other implementations that use reduce.

Why won’t the performance be good? Is it because we need to check for the last iteration? If so, the compiler should be able to optimise that since all the information is known at compile time.

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

mehdi-goli · 2025-01-13T16:18:35Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+      constexpr auto dim1 = decltype(size<1>(visit_results))::value;
+      constexpr auto dim2 = decltype(size<2>(visit_results))::value;
+
+      auto t1 = make_tensor(static_cast<decltype(visit_results) &&>(visit_results).data(),


Can you put better naming here

mehdi-goli · 2025-01-13T16:19:24Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+      constexpr auto m0 = decltype(size<0>(t1))::value;
+      constexpr auto m1 = decltype(size<1>(t1))::value;


Are these two dimensions try to define the reduce dimension and no reduce dimension

mehdi-goli · 2025-01-13T16:20:08Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+
+      auto smem = syclcompat::local_mem<float[Sg_Nums * vec_size]>();
+
+      auto t =


Is t a slice/reshape of t1? and can we have better naming here.
Also, according to your example the t1 is 16 x2 but your t has the 16x1 slice of the first row of M. So what happened to the next row of N there?

mehdi-goli · 2025-01-13T16:26:50Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+
+      CUTLASS_PRAGMA_UNROLL
+      for (int loop = 0; loop < loop_cnt; loop++) {
+        auto loop_t = t(_, loop, _);


Same here if we have better naming

aacostadiaz · 2025-01-13T17:16:31Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+
+template <uint32_t sg_num, class mem_t, class RTensor>
+CUTLASS_DEVICE
+auto group_reduce_max1(mem_t smem, RTensor const &t, float *out) {


What does max1 mean?

rolandschulz · 2025-01-04T00:55:19Z

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

+  //
+
+  bool verify(const ProblemShapeType& problem_size, ElementCompute alpha, ElementCompute beta) {
+    return true;


This should have a TODO comment.

rolandschulz · 2025-01-04T00:56:49Z

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

+        );
+
+    syclcompat::wait();
+#define IDX (l * M * N + i * N + j)


no macros for computation (use function or lambda if you want it to be inline)

rolandschulz · 2025-01-04T00:59:26Z

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

+      syclcompat::wait();
+    double hbm =
+        L *
+        (M * K * sizeof(ElementA) + K * N * sizeof(ElementB) +


inconsistent. on line 329 we use options.m

rolandschulz · 2025-01-04T01:02:00Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+#include "cutlass/workspace.h"
+
+#include "cute/tensor.hpp"
+#include "sm90_visitor_tma_warpspecialized.hpp"


why is that include needed?

rolandschulz · 2025-01-04T01:02:39Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+  inline x { assert(false); }
+#endif
+
+SYCL_DEVICE_OCL(float sub_group_reduce_add(float i));


why is this using the OCL function?

rolandschulz · 2025-01-04T01:04:51Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+#undef EXP
+#undef DIV
+
+#define MAX sycl::max


Suggested change

#define MAX sycl::max

using sycl::max;

rolandschulz · 2025-01-04T01:11:03Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+
+template<uint32_t sg_num, uint32_t N, class mem_t>
+CUTLASS_DEVICE
+void work_group_reduce_max(mem_t &mem, float* vec) {


this is identical to work_group_reduce_sum. Please template on the op and avoid duplication.

rolandschulz · 2025-01-04T01:13:33Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+    }
+  }
+
+  work_group_reduce_sum<sg_per_wg_n, N, decltype(slm_base)>(slm_base, out);


this seems like an odd why to split this into 2 functions. Because writing the data to slm is an essential part of the work_group reduce. I suggest to move it into that function.

rolandschulz · 2025-01-04T01:18:43Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+    auto base = sg_local_id * N * step + sg_group_id_n;
+    auto local_max = mem[base];
+
+    if (sg_group_id_n < N) {


This if-statement is the only reason to have the if statement on line 175? If so why not:

Suggested change

if (sg_group_id_n < N) {

if (sg_num <= N || sg_group_id_n < N) {

Because of short-circuit the or condition isn't evaluated if not needed. And that way you don't need the duplication of the if-else branches starting on line 175.

rolandschulz · 2025-01-04T01:20:12Z

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

+    template<class STensor, class SyncFn, class VTensor>
+    CUTLASS_DEVICE void
+    reduce(STensor&& smem_buffer, SyncFn const& sync_fn, int epi_m, int epi_n, bool is_last_iteration, VTensor visit_results) {
+      constexpr auto dim0 = decltype(size<0>(visit_results))::value;


why not just:

Suggested change

constexpr auto dim0 = decltype(size<0>(visit_results))::value;

constexpr auto dim0 = size<0>(visit_results);

jiyang1011 requested review from aacostadiaz, muhammad-tanvir-1211, taozha2 and t4c1 December 27, 2024 05:14

jiyang1011 force-pushed the jiyang/softmax branch 3 times, most recently from a435631 to ad00c03 Compare December 31, 2024 01:20

jiyang1011 requested review from tdeng5, mehdi-goli and rolandschulz December 31, 2024 02:05

t4c1 reviewed Jan 3, 2025

View reviewed changes

jiyang1011 force-pushed the jiyang/softmax branch from ad00c03 to bf7c5e0 Compare January 7, 2025 01:50

mehdi-goli reviewed Jan 13, 2025

View reviewed changes

aacostadiaz reviewed Jan 13, 2025

View reviewed changes

rolandschulz reviewed Jan 14, 2025

View reviewed changes

softmax with EVT (draft)

a9a90bc

jiyang1011 force-pushed the jiyang/softmax branch from bf7c5e0 to a9a90bc Compare January 15, 2025 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

softmax with EVT (draft) #177

softmax with EVT (draft) #177

jiyang1011 commented Dec 27, 2024

t4c1 Jan 3, 2025

t4c1 Jan 3, 2025

jiyang1011 Jan 7, 2025

aacostadiaz Jan 13, 2025

aacostadiaz Jan 13, 2025

mehdi-goli Jan 13, 2025

mehdi-goli Jan 13, 2025

mehdi-goli Jan 13, 2025 •

edited

Loading

mehdi-goli Jan 13, 2025

aacostadiaz Jan 13, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

rolandschulz Jan 4, 2025

		constexpr auto m0 = decltype(size<0>(t1))::value;
		constexpr auto m1 = decltype(size<1>(t1))::value;


		auto smem = syclcompat::local_mem<float[Sg_Nums * vec_size]>();

		auto t =

	if (sg_group_id_n < N) {
	if (sg_num <= N \|\| sg_group_id_n < N) {

	constexpr auto dim0 = decltype(size<0>(visit_results))::value;
	constexpr auto dim0 = size<0>(visit_results);

softmax with EVT (draft) #177

Are you sure you want to change the base?

softmax with EVT (draft) #177

Conversation

jiyang1011 commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 13, 2025 •

edited

Loading