mtd/w25q: ensure the correct behavior if erase sector fails #15812

michallenc · 2025-02-11T10:43:20Z

Summary

Function w25qxxxjv_erase wasn't correctly handling an error in w25qxxxjv_erase_sector call and was returning success even on failure. Moreover this change does not return EBUSY but waits for the previous operation to finish. Returning EBUSY doesn't make much sense here as we approximately know how long to wait based on flash's parameters.

Impact

This fixes wrong error handling and ensures the erase waits for previous operation to finish.

Testing

Tested on SAMv7 custom board with W25Q512 and W25Q01 flashes (512 Mbits and 1 Gbit, the latter has two dies, so slightly different behavior).

nuttxpr · 2025-02-11T11:07:41Z

[Experimental Bot, please feedback here]

Yes, this PR appears to meet the NuttX requirements, although it could be improved with more specific information.

Here's a breakdown of what's good and what could be better:

Strengths:

Clear Summary: The summary explains the problem, the solution, and why the previous behavior was undesirable.
Impact Description: Addresses the user impact (correct error handling) and implicitly acknowledges the lack of other impacts.
Testing Information: Provides target architecture, board, and specific flash chips used for testing. Mentions different behavior between single and dual-die flashes, suggesting thorough testing.

Weaknesses:

Missing References: No NuttX issue references are provided. Even if there isn't a directly related issue, it's good practice to create one to track bugs or feature requests.
Vague Testing Logs: "Testing logs before/after change" sections are empty. Include actual log output demonstrating the incorrect behavior before the change and the correct behavior afterward. Be specific about what commands/tests were run.
Limited Build Host Info: Only the target is specified. Include details about the build host OS, CPU architecture, and compiler version. This is crucial for reproducibility.
Lack of Explicit "NO" Answers: For the impact section, explicitly state "NO" for the categories where there's no impact. This improves clarity and shows you considered each point. Example: "Impact on build (will build process change)? NO"
Missing Detail on "How it Works": The summary briefly mentions the fix, but the "How does the change exactly work" part of the requirements could be expanded. Briefly describe the code changes made to fix the error handling and wait for operation completion (e.g., "Modified w25qxxxjv_erase to check the return code of w25qxxxjv_erase_sector and retry until success or timeout. Removed the EBUSY return code and implemented a wait loop using the flash's timing parameters.").

Recommendation:

Add the missing information and details described above to strengthen the PR. Clear, concise, and complete information makes it much easier for reviewers to understand and approve the changes.

acassis · 2025-02-11T13:30:49Z

drivers/mtd/w25qxxxjv.c

@@ -976,10 +976,10 @@ static int w25qxxxjv_erase_sector(FAR struct w25qxxxjv_dev_s *priv,
    }

  status = w25qxxxjv_read_status(priv);
-  if ((status & STATUS_BUSY_MASK) != STATUS_READY)
+  while ((status & STATUS_BUSY_MASK) != STATUS_READY)


@michallenc should be nice to have a timeout here. Could be just a counter, it will avoid the system get blocked forever case something bad happens

@acassis Probably yes. The advantage is we know the longest possible delay (entire flash erase time), so I put a counter there with this time.

Function w25qxxxjv_erase wasn't correctly handling an error in w25qxxxjv_erase_sector call and was returning success even on failure. Moreover this change does not immediately return EBUSY but waits for the previous operation to finish. If the timeout is significant (more than erase time of the entire flash), then it returns EBUSY. Signed-off-by: Michal Lenc <[email protected]>

michallenc · 2025-02-12T13:44:03Z

@acassis any more concerns?

cederom · 2025-02-13T02:07:10Z

@anchao did you see there is ongoing discussion in progress before you merged?

lupyuen · 2025-02-13T02:39:49Z

Sorry I approved the PR too quickly. Next time I'll mark it as "Change Requested" until all pending concerns are resolved.

anchao · 2025-02-13T02:48:13Z

@anchao did you see there is ongoing discussion in progress before you merged?

I saw @acassis approved this PR, this is a valid response

anchao · 2025-02-13T02:53:37Z

IMHO that maintainers should spend more time on reading and learning code and software architecture instead of commenting, which will make it more efficient. Nuttx is already a very big project. If the pressure is put on contributors blindly, then the management method of this project is problematic.

cederom · 2025-02-13T03:32:28Z

@cederom: @anchao did you see there is ongoing discussion in progress before you merged?
@anchao: I saw @acassis approved this PR, this is a valid response

Okay, I will mark "changes requested" each time I have questions.

@anchao: IMHO that maintainers should spend more time on reading and learning code and software architecture instead of commenting, which will make it more efficient. Nuttx is already a very big project. If the pressure is put on contributors blindly, then the management method of this project is problematic.

@anchao so after reading the code you saw that not only return fix is there but also introduction of new blocking behavior with arbitrary delay with no polling for success and still may return error on incomplete operation and you are okay with that? I know this is the simplest possible implementation, just wanted to make sure we are aware of the risks.

Its not that I don't understand the code change, questions are asked simple way so everyone that comes in understands what is the problem and context ;-)

Recent discussions on mailing lists reveal the problem we have is not with blind pressuring the contributors, quite the opposite, blind pushing and merging insufficiently validated breaking changes. We are trying to fix that, maybe it will work, maybe not.

anchao · 2025-02-13T04:36:44Z

@anchao so after reading the code you saw that not only return fix is there but also introduction of new blocking behavior with arbitrary delay with no polling for success and still may return error on incomplete operation and you are okay with that? I know this is the simplest possible implementation, just wanted to make sure we are aware of the risks.

Its not that I don't understand the code change, questions are asked simple way so everyone that comes in understands what is the problem and context ;-)

This fix includes the number of retries, indicating that there may be state transition issues under extreme conditions, which I think is acceptable.

Recent discussions on mailing lists reveal the problem we have is not with blind pressuring the contributors, quite the opposite, blind pushing and merging insufficiently validated breaking changes. We are trying to fix that, maybe it will work, maybe not.

Thanks for all yours efforts. I'm just stating my opinion, and it doesn't need to be accepted by everyone. Perhaps you will give some consideration to my view in certain decision - making processes.

michallenc · 2025-02-13T07:52:54Z

@cederom @anchao @lupyuen @acassis Hey guys, I am not quite sure what's the problem here. The discussion was resolved, Alan just forgot to mark it, but he approved the PR after the changes, which should be sufficient.

Regarding the possible blocking. I am aware it may be an issue under extreme circumstances, but write to the flash is currently ALWAYS blocking if you have to flush the buffer and erase page. We wait for QSPI transfer to finish, we wait for the erase to finish, there is a QSPI locking mechanism, so other thread accessing the flash may wait on the lock. BCH layer also has the lock.

Returning -EBUSY immediately is not good, you expect the write to wait if opened as blocking. And looking at the specification, this errno is not even specified in POSIX for write call. As for the non-blocking access, the BCH/flash interface doesn't seem to handle it anyway, even without this change.

Yes, we can discuss some refactor to BCH, FTL, MTD layers to correctly handle non-blocking access, but I don't think this PR breaks something. The code won't enter the while loop in most of the cases, it is a fix for probably bad timing on some flashes. I experienced it on W25Q512, but not on W25Q01. And even on W25Q512 it's just one or two retries.

github-actions bot added Area: Drivers Drivers issues Size: S The size of the change in this PR is small labels Feb 11, 2025

acassis reviewed Feb 11, 2025

View reviewed changes

michallenc force-pushed the w25-busy-fix branch from 55494dc to 4db9b6c Compare February 11, 2025 14:02

michallenc force-pushed the w25-busy-fix branch from 4db9b6c to b9c1941 Compare February 11, 2025 14:03

acassis approved these changes Feb 12, 2025

View reviewed changes

acassis requested review from btashton, liuguo09, adamfeuer, anchao, yamt, davids5, cederom, antmerlino, yf13, pussuw, raiden00pl, jerpelea, lupyuen, pkarashchenko, xiaoxiang781216, masayuki2009, tmedicci, GUIDINGLI and hartmannathan February 12, 2025 13:56

lupyuen approved these changes Feb 12, 2025

View reviewed changes

yf13 approved these changes Feb 13, 2025

View reviewed changes

anchao approved these changes Feb 13, 2025

View reviewed changes

anchao merged commit 99c7b64 into apache:master Feb 13, 2025
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtd/w25q: ensure the correct behavior if erase sector fails #15812

mtd/w25q: ensure the correct behavior if erase sector fails #15812

michallenc commented Feb 11, 2025

nuttxpr commented Feb 11, 2025

acassis Feb 11, 2025

michallenc Feb 11, 2025

michallenc commented Feb 12, 2025

cederom commented Feb 13, 2025

lupyuen commented Feb 13, 2025

anchao commented Feb 13, 2025

anchao commented Feb 13, 2025

cederom commented Feb 13, 2025

anchao commented Feb 13, 2025 •

edited

Loading

michallenc commented Feb 13, 2025 •

edited

Loading

mtd/w25q: ensure the correct behavior if erase sector fails #15812

mtd/w25q: ensure the correct behavior if erase sector fails #15812

Conversation

michallenc commented Feb 11, 2025

Summary

Impact

Testing

nuttxpr commented Feb 11, 2025

acassis Feb 11, 2025

Choose a reason for hiding this comment

michallenc Feb 11, 2025

Choose a reason for hiding this comment

michallenc commented Feb 12, 2025

cederom commented Feb 13, 2025

lupyuen commented Feb 13, 2025

anchao commented Feb 13, 2025

anchao commented Feb 13, 2025

cederom commented Feb 13, 2025

anchao commented Feb 13, 2025 • edited Loading

michallenc commented Feb 13, 2025 • edited Loading

anchao commented Feb 13, 2025 •

edited

Loading

michallenc commented Feb 13, 2025 •

edited

Loading