Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forced power off (Hyper-V/VMware Workstation) creates page cycle #701

Closed
tjungblu opened this issue Feb 27, 2024 · 21 comments
Closed

forced power off (Hyper-V/VMware Workstation) creates page cycle #701

tjungblu opened this issue Feb 27, 2024 · 21 comments

Comments

@tjungblu
Copy link
Contributor

tjungblu commented Feb 27, 2024

I have a very curious bbolt corruption at my hand and could use some of your brains to pick on it.

One partner of ours runs RedHat Device Edge (single node "micro" version of openshift) as a VM on top of a Hyper-V server, which suffered from issues after reboots/hibernation. This is running with etcd 3.5.10, so bbolt 1.3.8 and RHEL9 in the VM on top of XFS.


(more info from further down the thread)

this was reported by two devs running microshift in their VMs on their laptops

  • they use windows 10 (with ssds)
  • one uses HyperV, the other VMware Workstation
  • it's very rare, one VM we have logs from went through about 50 reboots before it hit this case
  • in this case, XFS was crashing and had to repair itself (Dirty bit is set. Fs was not properly unmounted and some data may be corrupt., kernel: XFS (nvme0n1p2): Starting recovery (logdev: internal))

It was either laptop hibernation (closing lid) or VM restart (no crash or hardware issue detected), this issue was previously created with a BSoD in mind.


Upon starting up again, etcd refuses to read the db file with:

Jan 24 13:48:21 hostname[2329]: panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [921 2733 2213 1929]))

I was able to get the db file now and quickly ran it through a debugger, which revealed a page cycle:
image

BTW, the forEachPageInternal itself runs out of stack space at some point, so we might also add some cycle detection as a follow-up.

The page with the cycle contains this:

$ bbolt page db 2898
Page ID:    2898
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 7

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2348>
0000000000081b685f0000000000000000: <pgid=2733> <-- cycle
0000000000081da25f0000000000000000: <pgid=2897>

As you can see, 2733 is actually a parent page some levels higher in the tree. I'll add more information here as I get to it, but this certainly shouldn't happen on a power-down scenario.

Here's the full path from root page 921:


$ bbolt page db 921
Page ID:    921
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 7

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2348>
0000000000081b685f0000000000000000: <pgid=2733>
0000000000081da25f0000000000000000: <pgid=2799>

$ bbolt page db 2733
Page ID:    2733
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 111

0000000000081b685f0000000000000000: <pgid=2869>
0000000000081b6e5f0000000000000000: <pgid=2461>
0000000000081b745f0000000000000000: <pgid=2309>
0000000000081b795f0000000000000000: <pgid=1260>
0000000000081b7e5f0000000000000000: <pgid=1967>
0000000000081b845f0000000000000000: <pgid=2729>
0000000000081b895f0000000000000000: <pgid=2213>
0000000000081b8f5f0000000000000000: <pgid=2332>
0000000000081b945f0000000000000000: <pgid=1246>
0000000000081b995f0000000000000000: <pgid=2404>
0000000000081b9f5f0000000000000000: <pgid=766>
0000000000081ba45f0000000000000000: <pgid=1815>
0000000000081baa5f0000000000000000: <pgid=1874>
0000000000081bb05f0000000000000000: <pgid=1316>
0000000000081bb55f0000000000000000: <pgid=2744>
0000000000081bbb5f0000000000000000: <pgid=2795>
0000000000081bc05f0000000000000000: <pgid=2387>
0000000000081bc55f0000000000000000: <pgid=2745>
0000000000081bcb5f0000000000000000: <pgid=2583>
0000000000081bd05f0000000000000000: <pgid=2584>
0000000000081bd55f0000000000000000: <pgid=1375>
0000000000081bdb5f0000000000000000: <pgid=1054>
0000000000081be15f0000000000000000: <pgid=2821>
0000000000081be65f0000000000000000: <pgid=2820>
0000000000081beb5f0000000000000000: <pgid=1899>
0000000000081bf05f0000000000000000: <pgid=993>
0000000000081bf55f0000000000000000: <pgid=1900>
0000000000081bfa5f0000000000000000: <pgid=2440>
0000000000081bff5f0000000000000000: <pgid=768>
0000000000081c045f0000000000000000: <pgid=2519>
0000000000081c095f0000000000000000: <pgid=871>
0000000000081c0e5f0000000000000000: <pgid=872>
0000000000081c145f0000000000000000: <pgid=1046>
0000000000081c195f0000000000000000: <pgid=1045>
0000000000081c1e5f0000000000000000: <pgid=873>
0000000000081c245f0000000000000000: <pgid=2409>
0000000000081c295f0000000000000000: <pgid=1999>
0000000000081c2e5f0000000000000000: <pgid=2747>
0000000000081c335f0000000000000000: <pgid=1998>
0000000000081c385f0000000000000000: <pgid=2798>
0000000000081c3d5f0000000000000000: <pgid=1938>
0000000000081c425f0000000000000000: <pgid=1981>
0000000000081c475f0000000000000000: <pgid=2241>
0000000000081c4c5f0000000000000000: <pgid=1058>
0000000000081c515f0000000000000000: <pgid=1980>
0000000000081c565f0000000000000000: <pgid=1936>
0000000000081c585f0000000000000000: <pgid=1937>
0000000000081c5a5f0000000000000000: <pgid=2090>
0000000000081c5c5f0000000000000000: <pgid=1346>
0000000000081c615f0000000000000000: <pgid=2095>
0000000000081c665f0000000000000000: <pgid=2007>
0000000000081c6b5f0000000000000000: <pgid=994>
0000000000081c705f0000000000000000: <pgid=2653>
0000000000081c755f0000000000000000: <pgid=2355>
0000000000081c7a5f0000000000000000: <pgid=2910>
0000000000081c7f5f0000000000000000: <pgid=1376>
0000000000081c835f0000000000000000: <pgid=2909>
0000000000081c885f0000000000000000: <pgid=2060>
0000000000081c8d5f0000000000000000: <pgid=875>
0000000000081c925f0000000000000000: <pgid=874>
0000000000081c975f0000000000000000: <pgid=1059>
0000000000081c9c5f0000000000000000: <pgid=1374>
0000000000081ca25f0000000000000000: <pgid=1262>
0000000000081ca75f0000000000000000: <pgid=2792>
0000000000081cac5f0000000000000000: <pgid=767>
0000000000081cb15f0000000000000000: <pgid=2205>
0000000000081cb65f0000000000000000: <pgid=1949>
0000000000081cbb5f0000000000000000: <pgid=2596>
0000000000081cc15f0000000000000000: <pgid=1896>
0000000000081cc65f0000000000000000: <pgid=1966>
0000000000081ccc5f0000000000000000: <pgid=1952>
0000000000081cd15f0000000000000000: <pgid=2336>
0000000000081cd65f0000000000000000: <pgid=2743>
0000000000081cdc5f0000000000000000: <pgid=1898>
0000000000081ce15f0000000000000000: <pgid=2388>
0000000000081ce65f0000000000000000: <pgid=2654>
0000000000081ceb5f0000000000000000: <pgid=2794>
0000000000081cf05f0000000000000000: <pgid=2405>
0000000000081cf55f0000000000000000: <pgid=1814>
0000000000081cfa5f0000000000000000: <pgid=2513>
0000000000081cff5f0000000000000000: <pgid=1873>
0000000000081d045f0000000000000000: <pgid=2868>
0000000000081d095f0000000000000000: <pgid=2746>
0000000000081d0e5f0000000000000000: <pgid=1245>
0000000000081d145f0000000000000000: <pgid=812>
0000000000081d195f0000000000000000: <pgid=1315>
0000000000081d1e5f0000000000000000: <pgid=1257>
0000000000081d245f0000000000000000: <pgid=1060>
0000000000081d295f0000000000000000: <pgid=2518>
0000000000081d2e5f0000000000000000: <pgid=1055>
0000000000081d345f0000000000000000: <pgid=2598>
0000000000081d395f0000000000000000: <pgid=1875>
0000000000081d3e5f0000000000000000: <pgid=2107>
0000000000081d435f0000000000000000: <pgid=2008>
0000000000081d485f0000000000000000: <pgid=1052>
0000000000081d4e5f0000000000000000: <pgid=2516>
0000000000081d535f0000000000000000: <pgid=1982>
0000000000081d575f0000000000000000: <pgid=2515>
0000000000081d5c5f0000000000000000: <pgid=1053>
0000000000081d625f0000000000000000: <pgid=1317>
0000000000081d675f0000000000000000: <pgid=2009>
0000000000081d6d5f0000000000000000: <pgid=3>
0000000000081d725f0000000000000000: <pgid=2797>
0000000000081d775f0000000000000000: <pgid=2>
0000000000081d7d5f0000000000000000: <pgid=2796>
0000000000081d825f0000000000000000: <pgid=2108>
0000000000081d885f0000000000000000: <pgid=2408>
0000000000081d8e5f0000000000000000: <pgid=995>
0000000000081d935f0000000000000000: <pgid=1983>
0000000000081d985f0000000000000000: <pgid=2096>
0000000000081d9d5f0000000000000000: <pgid=1838>


$ bbolt page db 2213
Page ID:    2213
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 7

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2349>
0000000000081c5a5f0000000000000000: <pgid=2109>
0000000000081da25f0000000000000000: <pgid=1815>


$ bbolt page db 1815
Page ID:    1815
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 39

0000000000081da25f0000000000000000: <pgid=1968>
0000000000081da85f0000000000000000: <pgid=2407>
0000000000081dad5f0000000000000000: <pgid=2010>
0000000000081db25f0000000000000000: <pgid=2865>
0000000000081db85f0000000000000000: <pgid=1057>
0000000000081dbd5f0000000000000000: <pgid=1056>
0000000000081dc35f0000000000000000: <pgid=813>
0000000000081dc85f0000000000000000: <pgid=2866>
0000000000081dcd5f0000000000000000: <pgid=1061>
0000000000081dd35f0000000000000000: <pgid=2885>
0000000000081dd85f0000000000000000: <pgid=1839>
0000000000081ddd5f0000000000000000: <pgid=939>
0000000000081de35f0000000000000000: <pgid=2393>
0000000000081de85f0000000000000000: <pgid=2242>
0000000000081dee5f0000000000000000: <pgid=2705>
0000000000081df35f0000000000000000: <pgid=814>
0000000000081df85f0000000000000000: <pgid=2521>
0000000000081dfd5f0000000000000000: <pgid=2520>
0000000000081e025f0000000000000000: <pgid=1063>
0000000000081e075f0000000000000000: <pgid=2424>
0000000000081e0d5f0000000000000000: <pgid=2037>
0000000000081e135f0000000000000000: <pgid=1984>
0000000000081e155f0000000000000000: <pgid=2220>
0000000000081e175f0000000000000000: <pgid=1340>
0000000000081e1d5f0000000000000000: <pgid=1062>
0000000000081e225f0000000000000000: <pgid=2369>
0000000000081e275f0000000000000000: <pgid=2706>
0000000000081e2d5f0000000000000000: <pgid=2368>
0000000000081e315f0000000000000000: <pgid=769>
0000000000081e365f0000000000000000: <pgid=1264>
0000000000081e3c5f0000000000000000: <pgid=2231>
0000000000081e415f0000000000000000: <pgid=1064>
0000000000081e465f0000000000000000: <pgid=1065>
0000000000081e4b5f0000000000000000: <pgid=2898>
0000000000081e505f0000000000000000: <pgid=2348>
0000000000081e565f0000000000000000: <pgid=1965>
0000000000081e5b5f0000000000000000: <pgid=2641>
0000000000081e605f0000000000000000: <pgid=2729>
0000000000081e655f0000000000000000: <pgid=1953>


$ bbolt page db 2898
Page ID:    2898
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 7

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2348>
0000000000081b685f0000000000000000: <pgid=2733>
0000000000081da25f0000000000000000: <pgid=2897>


EDIT3: I'm removing the surgery for now, as breaking the cycle does not entirely fix the issue. You can see there's a fairly big chunk of page 2898 duplicated in page 2213, which eventually makes its surgery a little more complicated than I initially thought.

PS: Thanks for the surgery tools, even though I had to load it into a debugger to find the cycle <3

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

after breaking the cycle I was able to restore the file again

edit2: also it seems that removing the cycle does not fix initially reported issue on multiple references:

The first sentence seems that you have already resolved the corrupted db file. But the second one seems that you haven't. Can you clarify whether have you fixed the corrupted db file or not?

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 27, 2024

Sorry, I was entirely fooled by the stat output, check still complains about multi refs:

EDIT3: I'm removing the surgery for now, as breaking the cycle does not entirely fix the issue. You can see there's a fairly big chunk of page 2898 duplicated in page 2213, which eventually makes its surgery a little more complicated than I initially thought.

The page duplication is also something I just noticed, the whole page introducing the cycle seems repeated from the root page:

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2348>

which then leads me to the path of a torn write during a tx?

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

The OS is Windows, correct?

@tjungblu
Copy link
Contributor Author

Nope, the OS is RHEL, but it's virtualized over Windows via HyperV.

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

So the host is Windows, and the VM on top of the host is RHEL.

Can you please dump the two meta pages?

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 27, 2024

bbolt page db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=2342>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=2921>
Txn ID:     39234
Checksum:   c69c456bc6a1959f

$ bbolt page db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=2748>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=2921>
Txn ID:     39233
Checksum:   53258dd4fd8613af

just going through the roots now:

$ bbolt page db 2342
Page ID:    2342
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=921,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>
$ bbolt page db 2748
Page ID:    2748
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 7

00000000000000055f0000000000000000: <pgid=1929>
00000000000003585f0000000000000000: <pgid=1930>
000000000000241c5f0000000000000000: <pgid=1837>
0000000000023c7a5f0000000000000000: <pgid=2472>
000000000007be125f0000000000000000: <pgid=2349>
0000000000081c5a5f0000000000000000: <pgid=2109>
0000000000081da25f0000000000000000: <pgid=1307>

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

Please run bbolt surgery meta validate db-file.

@tjungblu
Copy link
Contributor Author

$ bbolt surgery meta validate db
The meta page 0 is valid!
The meta page 1 is valid!

:-)

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

Could you also try bbolt surgery revert-meta-page db-file --output new-db-file command?

@tjungblu
Copy link
Contributor Author

Sure:


$ bbolt surgery revert-meta-page db --output db_reverted
The meta page is reverted.

$ bbolt page db_reverted 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=2748>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=2921>
Txn ID:     39233
Checksum:   53258dd4fd8613af

$ bbolt page db_reverted 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=2748>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=2921>
Txn ID:     39233
Checksum:   53258dd4fd8613af

$ bbolt check db_reverted 
panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [2748 1307 2898 1929]))

goroutine 6 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
	/home/tjungblu/git/bbolt/db.go:1196 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 1
	/home/tjungblu/git/bbolt/db.go:1194 +0x1e5

Seems even the previous transaction is already in bad shape.

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [2748 1307 2898 1929]))

You need to run bbolt surgery freelist abandon ... on the db_reverted file

@tjungblu
Copy link
Contributor Author

sorry, forgot - not helping either. I think it's also not really the freelist that's the culprit here unfortunately....


$ bbolt surgery freelist abandon db_reverted --output db_reverted_freed
The freelist was abandoned in both meta pages.
It may cause some delay on next startup because bbolt needs to scan the whole db to reconstruct the free list.

$ bbolt check db_reverted_freed 
panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [2748 1307 2898 1929]))

goroutine 6 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
	/home/tjungblu/git/bbolt/db.go:1196 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 1
	/home/tjungblu/git/bbolt/db.go:1194 +0x1e5


@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

Also what's the linux kernel version? FYI. #675

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

In order to fix the corrupted db file, you need to remove the entry below from page 2733. Obviously it may lose some data.

0000000000081b895f0000000000000000: <pgid=2213>

Please let me know whether it works. Please do not forget to run bbolt surgery freelist abandon ....

Please also feedback the exact linux kernel version mentioned above.

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

Commands:

./bbolt  surgery clear-page-elements ./db --from-index 6 --to-index 7 --output new1.db --pageId 2733
./bbolt surgery freelist abandon ./new1.db --output ./new2.db

@tjungblu
Copy link
Contributor Author

Thanks, I'm in a couple of meetings now. I've asked our support folks for more information on kernel and fs, will get back to you on that and with the clear pages too.

@tjungblu
Copy link
Contributor Author

So the kernel version fits the one described in #675 (Linux version 5.14.0-284.30.1.el9_2.x86_64).

I've tried something similar already by removing the loop with:

$ bbolt surgery clear-page-elements --pageId 2898 --from-index 5 --to-index 6 --output db_fixed1 db

This makes the stats work again because the forEachInternal will not infinitely recurse and run out of stack space. Check fails on the same duplicated free list issue, even after abandon.

As for:

./bbolt surgery clear-page-elements ./db --from-index 6 --to-index 7 --output new1.db --pageId 2733

which effectively removes the link to page 2213, causes same headaches:


$ bbolt surgery clear-page-elements db --from-index 6 --to-index 7 --output db_rm_2733 --pageId 2733
WARNING: The clearing has abandoned some pages that are not yet referenced from free list.
Please consider executing `./bbolt surgery abandon-freelist ...`
All elements in [6, 7) in page 2733 were cleared

$ bbolt surgery freelist abandon db_rm_2733 --output db_rm_2733_fl 
The freelist was abandoned in both meta pages.
It may cause some delay on next startup because bbolt needs to scan the whole db to reconstruct the free list.

$ bbolt check db_rm_2733_fl 
panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [921 2733 2332 1929]))

goroutine 19 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
	/home/tjungblu/git/bbolt/db.go:1196 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 1
	/home/tjungblu/git/bbolt/db.go:1194 +0x1e5

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

So the kernel version fits the one described in #675 (Linux version 5.14.0-284.30.1.el9_2.x86_64).

Thanks for the feedback. You should upgrade the linux kernel version to 5.17+ or the latest patch of 5.15.x. FYI. #695

Usually when we see such issue, there are lots of unsynced pages. Please also run ./bbolt pages db to provide a file something like https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

I've tried something similar already by removing the loop with:

$ bbolt surgery clear-page-elements --pageId 2898 --from-index 5 --to-index 6 --output db_fixed1 db

The data has already broken the b+tree's invariant property at a higher level at page 2213. Its key in the branch page (page 2733) is even smaller than the first item in the page it points to. So it doesn't make sense to fix it in a lower level.

0000000000081b895f0000000000000000: <pgid=2213>

$ bbolt check db_rm_2733_fl
panic: freepages: failed to get all reachable pages (page 1929: multiple references (stack: [921 2733 2332 1929]))

You have to do similar analysis and run bbolt surgery clear-page-elements accordingly. Again, there may be some data loss. :(


Actually I have been suspecting that there are some bugs, such as

  • a normal used page is mistakenly reused as being regarded as a free page. So the data may be mistakenly overwritten.
  • or there is also any potential memory stomp bugs.

But we have kept the robustness testing running for 4+ weeks, and never see any issue.

@tjungblu
Copy link
Contributor Author

Thanks for your input. I've relayed the kernel information back to the issue where the team's owning our single-node SKUs participate in.

Here's the pages dump, as mentioned, this has the loopy page at 2898 removed and I've removed the cmd panic in favor of outputting all errors encountered. There are several invariant issues, indeed:

pages_dump.txt

I suspect it's also related to the hosts BSoD, maybe some RAM issues, not necessarily related to our use of the kernel even.
Will keep you posted as I can gather more information, but I reckon we can close this one here as a rather diffuse issue to fix from bbolts perspective. If we have some repro in our lab I can ask those guys also to run our robustness tests over it.

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2024

but I reckon we can close this one here

OK. Please feel free to ping me if you have any clue or thoughts on the data corruption issue.

@tjungblu
Copy link
Contributor Author

tjungblu commented Mar 7, 2024

Just a quick update, because we got some more information for repro in the meantime.

  • this was reported by two devs running microshift in their VMs on their laptops
  • they use windows 10 (with ssds)
  • one uses HyperV, the other VMware Workstation
  • it's very rare, one VM we have logs from went through about 50 reboots before it hit this case
  • in this case, XFS was crashing and had to repair itself (Dirty bit is set. Fs was not properly unmounted and some data may be corrupt., kernel: XFS (nvme0n1p2): Starting recovery (logdev: internal))

It was either laptop hibernation (closing lid) or VM restart (no crash or hardware issue detected).


Not sure where the BSoD report came from, but I'll revise the title to be less sensationalist.

@tjungblu tjungblu changed the title forced power off (Hyper-V BSOD) creates page cycle forced power off (Hyper-V/VMware Workstation) creates page cycle Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants