Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liger-kernel tests fail on XPU with triton-xpu #3237

Open
faaany opened this issue Jan 23, 2025 · 1 comment
Open

liger-kernel tests fail on XPU with triton-xpu #3237

faaany opened this issue Jan 23, 2025 · 1 comment
Assignees
Labels
bug Something isn't working community tests: e2e

Comments

@faaany
Copy link

faaany commented Jan 23, 2025

Describe the bug

While running liger-kernel UTs on XPU with triton-xpu, following tests fail:

========================================================================== short test summary info ===========================================================================
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] - AssertionError: N                                                                                              umber of mismatched elements: 969414
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] - AssertionError: Numb                                                                                              er of mismatched elements: 978905
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] - AssertionError: Num                                                                                              ber of mismatched elements: 1147311
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] - AssertionError:                                                                                               Number of mismatched elements: 950104
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] - AssertionError: Num                                                                                              ber of mismatched elements: 1207229
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] - AssertionError: Nu                                                                                              mber of mismatched elements: 1284423
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] - AssertionError                                                                                              : Number of mismatched elements: 916363
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] - AssertionError: N                                                                                              umber of mismatched elements: 1259389
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] - AssertionError:                                                                                               Number of mismatched elements: 1343092
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] - AssertionErro                                                                                              r: Number of mismatched elements: 907703
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] - AssertionError:                                                                                               Number of mismatched elements: 1228503
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] - AssertionError:                                                                                               Number of mismatched elements: 1284332
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype0-0.005-0.05-8-128-1024-4096] - AssertionError: Number of mismatched elements: 24
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype1-0.005-0.05-8-128-1024-4096] - AssertionError: Number of mismatched elements: 52
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype2-0.005-0.05-8-128-1024-4096] - AssertionError: Number of mismatched elements: 56
FAILED test/transformers/test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype3-0.005-0.05-8-128-1024-4096] - AssertionError: Number of mismatched elements: 23

The failed reason is that the gradients of the linear weight don't match after computing the cross-entry losses..

In addition to that, 2 convergence tests also fail:

========================================================================================================================= short test summary info ==========================================================================================================================
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: Number of mismatched elements: 1
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AssertionError: Number of mismatched elements: 1
============================================================================================================ 2 failed, 2 passed, 9 warnings in 86.61s (0:01:26) ============================================================================================================

reproduce

git clone https://github.com/faaany/Liger-Kernel.git && cd Liger-Kernel 
pip install -e .[dev] --extra-index-url https://download.pytorch.org/whl/test/xpu
make test 
make test-convergence

Environment details

pytorch-triton-xpu 3.2.0
torch 2.6.0+xpu
liger_kernel 0.5.2

@faaany faaany added the bug Something isn't working label Jan 23, 2025
@alexbaden alexbaden self-assigned this Jan 23, 2025
@vlad-penkin vlad-penkin added this to the 4.6 [Performance] E2E milestone Jan 27, 2025
@mgrabban
Copy link

mgrabban commented Jan 28, 2025

@faaany I could not reproduce your fails; all tests pass for me

test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-mean-1.0-dtype0-0.005-0.05-8-128-1024-4096] PASSED                                         [  1%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-mean-1.0-dtype0-0.005-0.05-4-47-31-123] PASSED                                             [  2%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] PASSED                                       [  4%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-mean-1.0-dtype1-1e-05-0.0005-4-47-31-123] PASSED                                           [  5%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-sum-1.0-dtype2-5.0-50.0-8-128-1024-4096] PASSED                                            [  6%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-sum-1.0-dtype2-5.0-50.0-4-47-31-123] PASSED                                                [  8%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] PASSED                                          [  9%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-sum-1.0-dtype3-0.001-0.05-4-47-31-123] PASSED                                              [ 11%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-none-1.0-dtype4-5.0-50.0-8-128-1024-4096] PASSED                                           [ 12%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-none-1.0-dtype4-5.0-50.0-4-47-31-123] PASSED                                               [ 13%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] PASSED                                         [ 15%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-True-none-1.0-dtype5-0.001-0.05-4-47-31-123] PASSED                                             [ 16%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-mean-1.0-dtype0-0.005-0.05-8-128-1024-4096] PASSED                                        [ 18%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-mean-1.0-dtype0-0.005-0.05-4-47-31-123] PASSED                                            [ 19%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] PASSED                                      [ 20%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-mean-1.0-dtype1-1e-05-0.0005-4-47-31-123] PASSED                                          [ 22%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-sum-1.0-dtype2-5.0-50.0-8-128-1024-4096] PASSED                                           [ 23%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-sum-1.0-dtype2-5.0-50.0-4-47-31-123] PASSED                                               [ 25%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] PASSED                                         [ 26%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-sum-1.0-dtype3-0.001-0.05-4-47-31-123] PASSED                                             [ 27%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-none-1.0-dtype4-5.0-50.0-8-128-1024-4096] PASSED                                          [ 29%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-none-1.0-dtype4-5.0-50.0-4-47-31-123] PASSED                                              [ 30%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] PASSED                                        [ 31%]
test_fused_linear_cross_entropy.py::test_correctness[False-0--100-0-None-False-False-none-1.0-dtype5-0.001-0.05-4-47-31-123] PASSED                                            [ 33%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-mean-1.0-dtype0-0.005-0.05-8-128-1024-4096] PASSED                                      [ 34%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-mean-1.0-dtype0-0.005-0.05-4-47-31-123] PASSED                                          [ 36%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] PASSED                                    [ 37%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-mean-1.0-dtype1-1e-05-0.0005-4-47-31-123] PASSED                                        [ 38%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-sum-1.0-dtype2-5.0-50.0-8-128-1024-4096] PASSED                                         [ 40%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-sum-1.0-dtype2-5.0-50.0-4-47-31-123] PASSED                                             [ 41%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] PASSED                                       [ 43%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-sum-1.0-dtype3-0.001-0.05-4-47-31-123] PASSED                                           [ 44%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-none-1.0-dtype4-5.0-50.0-8-128-1024-4096] PASSED                                        [ 45%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-none-1.0-dtype4-5.0-50.0-4-47-31-123] PASSED                                            [ 47%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] PASSED                                      [ 48%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-True-none-1.0-dtype5-0.001-0.05-4-47-31-123] PASSED                                          [ 50%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-mean-1.0-dtype0-0.005-0.05-8-128-1024-4096] PASSED                                     [ 51%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-mean-1.0-dtype0-0.005-0.05-4-47-31-123] PASSED                                         [ 52%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-mean-1.0-dtype1-1e-05-0.0005-8-128-1024-4096] PASSED                                   [ 54%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-mean-1.0-dtype1-1e-05-0.0005-4-47-31-123] PASSED                                       [ 55%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-sum-1.0-dtype2-5.0-50.0-8-128-1024-4096] PASSED                                        [ 56%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-sum-1.0-dtype2-5.0-50.0-4-47-31-123] PASSED                                            [ 58%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-sum-1.0-dtype3-0.001-0.05-8-128-1024-4096] PASSED                                      [ 59%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-sum-1.0-dtype3-0.001-0.05-4-47-31-123] PASSED                                          [ 61%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-none-1.0-dtype4-5.0-50.0-8-128-1024-4096] PASSED                                       [ 62%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-none-1.0-dtype4-5.0-50.0-4-47-31-123] PASSED                                           [ 63%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-none-1.0-dtype5-0.001-0.05-8-128-1024-4096] PASSED                                     [ 65%]
test_fused_linear_cross_entropy.py::test_correctness[True-0.1-42-0.0001-30.0-True-False-none-1.0-dtype5-0.001-0.05-4-47-31-123] PASSED                                         [ 66%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-True-1.0-dtype0-0.005-0.05-2-2-8-8] PASSED                                                                [ 68%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-True-1.0-dtype0-0.005-0.05-9-7-41-41] PASSED                                                              [ 69%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-True-1.0-dtype1-1e-05-0.0005-2-2-8-8] PASSED                                                              [ 70%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-True-1.0-dtype1-1e-05-0.0005-9-7-41-41] PASSED                                                            [ 72%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-False-1.0-dtype0-0.005-0.05-2-2-8-8] PASSED                                                               [ 73%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-False-1.0-dtype0-0.005-0.05-9-7-41-41] PASSED                                                             [ 75%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-False-1.0-dtype1-1e-05-0.0005-2-2-8-8] PASSED                                                             [ 76%]
test_fused_linear_cross_entropy.py::test_correctness_functional[True-False-1.0-dtype1-1e-05-0.0005-9-7-41-41] PASSED                                                           [ 77%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-True-1.0-dtype0-0.005-0.05-2-2-8-8] PASSED                                                               [ 79%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-True-1.0-dtype0-0.005-0.05-9-7-41-41] PASSED                                                             [ 80%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-True-1.0-dtype1-1e-05-0.0005-2-2-8-8] PASSED                                                             [ 81%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-True-1.0-dtype1-1e-05-0.0005-9-7-41-41] PASSED                                                           [ 83%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-False-1.0-dtype0-0.005-0.05-2-2-8-8] PASSED                                                              [ 84%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-False-1.0-dtype0-0.005-0.05-9-7-41-41] PASSED                                                            [ 86%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-False-1.0-dtype1-1e-05-0.0005-2-2-8-8] PASSED                                                            [ 87%]
test_fused_linear_cross_entropy.py::test_correctness_functional[False-False-1.0-dtype1-1e-05-0.0005-9-7-41-41] PASSED                                                          [ 88%]
test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype0-0.005-0.05-8-128-1024-4096] PASSED                                                                               [ 90%]
test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype0-0.005-0.05-4-47-31-123] PASSED                                                                                   [ 91%]
test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype1-0.005-0.05-8-128-1024-4096] PASSED                                                                               [ 93%]
test_fused_linear_cross_entropy.py::test_amp[True-cast_dtype1-0.005-0.05-4-47-31-123] PASSED                                                                                   [ 94%]
test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype2-0.005-0.05-8-128-1024-4096] PASSED                                                                              [ 95%]
test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype2-0.005-0.05-4-47-31-123] PASSED                                                                                  [ 97%]
test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype3-0.005-0.05-8-128-1024-4096] PASSED                                                                              [ 98%]
test_fused_linear_cross_entropy.py::test_amp[False-cast_dtype3-0.005-0.05-4-47-31-123] PASSED                                                                                  [100%]

We need to figure out why they are failing for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community tests: e2e
Projects
None yet
Development

No branches or pull requests

4 participants