Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] Fix and enable 10 tests disabled with PR 789 #794

Open
10 tasks
hukoyu opened this issue Aug 15, 2020 · 14 comments
Open
10 tasks

[Test] Fix and enable 10 tests disabled with PR 789 #794

hukoyu opened this issue Aug 15, 2020 · 14 comments
Labels
p1 Medium priority

Comments

@hukoyu
Copy link
Collaborator

hukoyu commented Aug 15, 2020

Below tests disabled with PR:
https://github.com/lsds/sgx-lkl/pull/789/files

Fix the failure reason and enable back.
cc @KenGordon @SeanTAllen @davidchisnall @vtikoo @paulcallen

@hukoyu hukoyu added the needs-triage Bug does not yet have a priority assigned label Aug 15, 2020
@hukoyu hukoyu changed the title Fix and enable 8 tests disabled with PR 789 [Test] Fix and enable 8 tests disabled with PR 789 Aug 15, 2020
@hukoyu hukoyu changed the title [Test] Fix and enable 8 tests disabled with PR 789 [Test] Fix and enable 9 tests disabled with PR 789 Aug 15, 2020
@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 17, 2020

gettimeofday02 passing previously would appear to be happenstance related to async signal bugs. Any test that depends on the delivery of an async signal is likely to fail. gettimeofday02 should either be heavily patched or left disabled with a note to enable once async signal handling is fixed (which is a p0 issue).

The test in its current states hangs as the alarm to stop the test "isn't being delivered" which is a known open issue.

#209

Is there some place we want to record that when #209 is closed, that we should enable gettimeofday02?

@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 17, 2020

mmap11 "failure" is unrelated to the functionality under test. It appears to be a deterministic shutdown hang. For me, it happens with both hw and sw modes.

Given that #788 exists and will change the shutdown sequence, I propose waiting to address the mmap11 failure until 788 is merged.

from @vtikoo:

mmap11 creates a detached pthread - https://github.com/lsds/ltp/blob/sgx-lkl/testcases/kernel/syscalls/mmap/mmap11.c#L101. Theres an open p0 for fixing detached thread support #779.

@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 17, 2020

futex_cmp_requeue01 hangs because

while (thread_cnt < tc->num_waiters) {
    sched_yield();
}

never exits.

here's the full-test:

https://github.com/lsds/ltp/blob/sgx-lkl/testcases/kernel/syscalls/futex/futex_cmp_requeue01.c

@SeanTAllen
Copy link
Contributor

a PR has been opened to address the write05 test:

lsds/ltp#73

@lsds lsds deleted a comment from vtikoo Aug 17, 2020
@vtikoo
Copy link
Contributor

vtikoo commented Aug 17, 2020

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 17, 2020

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

@vtikoo it is for me.

@prp
Copy link
Member

prp commented Aug 17, 2020

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

I indeed believe that this is causing problems. I think it's one of the reasons why I see shutdown issues with DotNet here: #788 (comment)

@SeanTAllen
Copy link
Contributor

getcwd04 exits because it checks to make sure there is not 1 cpu.

  if (tst_ncpus() == 1)
     tst_brk(TCONF, "This test needs two cpus at least");

If that was fixed by removing the test for CPUs (I don't think it is needed given how we patched to to use threads), it then fails because it is relying on the delivery of an asynchronous signal. It should be able to be re-enabled once #209 is fixed.

@SeanTAllen
Copy link
Contributor

setresuid04 and setreuid07 are working and can be re-enabled.

@vtikoo
Copy link
Contributor

vtikoo commented Aug 17, 2020

@prp there seem to be multiple ways which can cause cloned host tasks hangups. Could you clarify whether you think the DotNet failures are specifically related to sched_yield or cloned host task hangups in general?

@prp
Copy link
Member

prp commented Aug 17, 2020

In most cases, I see a deadlock in which the termination thread fails to obtain a CPU lock for syscalls but nothing else is running. The DotNet hang is different: one of the DotNet userspace threads keeps invoking sched_yield() and making futex calls, while the termination thread is waiting for the CPU lock.

@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 17, 2020

send01 is failing because it hangs. there's a call in a thread to select that never returns. I'm not sure why it was passing previously.

a couple things that won't work to fix right now:

  • using pthread_cancel to cancel the thread. currently it segfaults. pthread_cancel is using signals so its problematic.
  • setting a timeout on the select call. it doesn't always timeout for reasons that I havent' looked into yet.

@hukoyu hukoyu changed the title [Test] Fix and enable 9 tests disabled with PR 789 [Test] Fix and enable 8 tests disabled with PR 789 Aug 17, 2020
@hukoyu
Copy link
Collaborator Author

hukoyu commented Aug 17, 2020

write05 patched with lsds/ltp#73 and then with lsds/ltp#74

@hukoyu hukoyu changed the title [Test] Fix and enable 8 tests disabled with PR 789 [Test] Fix and enable 10 tests disabled with PR 789 Aug 20, 2020
@hukoyu
Copy link
Collaborator Author

hukoyu commented Aug 20, 2020

Tests fstat03 and and chroot03 are also failing due to SIGSEGV (page fault) signal issue. These two were not caught before since there were build errors and test binaries were not generated (Created issue #810 to track that). Fixed the build issue with lsds/ltp#74 and disabled these two tests in PR #812

@bodzhang bodzhang added p1 Medium priority and removed needs-triage Bug does not yet have a priority assigned labels Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Medium priority
Projects
None yet
Development

No branches or pull requests

5 participants