Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtd/w25q: ensure the correct behavior if erase sector fails #15812

Merged
merged 1 commit into from
Feb 13, 2025

Conversation

michallenc
Copy link
Contributor

Summary

Function w25qxxxjv_erase wasn't correctly handling an error in w25qxxxjv_erase_sector call and was returning success even on failure. Moreover this change does not return EBUSY but waits for the previous operation to finish. Returning EBUSY doesn't make much sense here as we approximately know how long to wait based on flash's parameters.

Impact

This fixes wrong error handling and ensures the erase waits for previous operation to finish.

Testing

Tested on SAMv7 custom board with W25Q512 and W25Q01 flashes (512 Mbits and 1 Gbit, the latter has two dies, so slightly different behavior).

@github-actions github-actions bot added Area: Drivers Drivers issues Size: S The size of the change in this PR is small labels Feb 11, 2025
@nuttxpr
Copy link

nuttxpr commented Feb 11, 2025

[Experimental Bot, please feedback here]

Yes, this PR appears to meet the NuttX requirements, although it could be improved with more specific information.

Here's a breakdown of what's good and what could be better:

Strengths:

  • Clear Summary: The summary explains the problem, the solution, and why the previous behavior was undesirable.
  • Impact Description: Addresses the user impact (correct error handling) and implicitly acknowledges the lack of other impacts.
  • Testing Information: Provides target architecture, board, and specific flash chips used for testing. Mentions different behavior between single and dual-die flashes, suggesting thorough testing.

Weaknesses:

  • Missing References: No NuttX issue references are provided. Even if there isn't a directly related issue, it's good practice to create one to track bugs or feature requests.
  • Vague Testing Logs: "Testing logs before/after change" sections are empty. Include actual log output demonstrating the incorrect behavior before the change and the correct behavior afterward. Be specific about what commands/tests were run.
  • Limited Build Host Info: Only the target is specified. Include details about the build host OS, CPU architecture, and compiler version. This is crucial for reproducibility.
  • Lack of Explicit "NO" Answers: For the impact section, explicitly state "NO" for the categories where there's no impact. This improves clarity and shows you considered each point. Example: "Impact on build (will build process change)? NO"
  • Missing Detail on "How it Works": The summary briefly mentions the fix, but the "How does the change exactly work" part of the requirements could be expanded. Briefly describe the code changes made to fix the error handling and wait for operation completion (e.g., "Modified w25qxxxjv_erase to check the return code of w25qxxxjv_erase_sector and retry until success or timeout. Removed the EBUSY return code and implemented a wait loop using the flash's timing parameters.").

Recommendation:

Add the missing information and details described above to strengthen the PR. Clear, concise, and complete information makes it much easier for reviewers to understand and approve the changes.

@@ -976,10 +976,10 @@ static int w25qxxxjv_erase_sector(FAR struct w25qxxxjv_dev_s *priv,
}

status = w25qxxxjv_read_status(priv);
if ((status & STATUS_BUSY_MASK) != STATUS_READY)
while ((status & STATUS_BUSY_MASK) != STATUS_READY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michallenc should be nice to have a timeout here. Could be just a counter, it will avoid the system get blocked forever case something bad happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acassis Probably yes. The advantage is we know the longest possible delay (entire flash erase time), so I put a counter there with this time.

Function w25qxxxjv_erase wasn't correctly handling an error in
w25qxxxjv_erase_sector call and was returning success even on failure.
Moreover this change does not immediately return EBUSY but waits for
the previous operation to finish. If the timeout is significant (more
than erase time of the entire flash), then it returns EBUSY.

Signed-off-by: Michal Lenc <[email protected]>
@michallenc
Copy link
Contributor Author

@acassis any more concerns?

@anchao anchao merged commit 99c7b64 into apache:master Feb 13, 2025
39 checks passed
@cederom
Copy link
Contributor

cederom commented Feb 13, 2025

@anchao did you see there is ongoing discussion in progress before you merged?

@lupyuen
Copy link
Member

lupyuen commented Feb 13, 2025

Sorry I approved the PR too quickly. Next time I'll mark it as "Change Requested" until all pending concerns are resolved.

@anchao
Copy link
Contributor

anchao commented Feb 13, 2025

@anchao did you see there is ongoing discussion in progress before you merged?

I saw @acassis approved this PR, this is a valid response

@anchao
Copy link
Contributor

anchao commented Feb 13, 2025

IMHO that maintainers should spend more time on reading and learning code and software architecture instead of commenting, which will make it more efficient. Nuttx is already a very big project. If the pressure is put on contributors blindly, then the management method of this project is problematic.

@cederom
Copy link
Contributor

cederom commented Feb 13, 2025

@cederom: @anchao did you see there is ongoing discussion in progress before you merged?
@anchao: I saw @acassis approved this PR, this is a valid response

Okay, I will mark "changes requested" each time I have questions.

@anchao: IMHO that maintainers should spend more time on reading and learning code and software architecture instead of commenting, which will make it more efficient. Nuttx is already a very big project. If the pressure is put on contributors blindly, then the management method of this project is problematic.

@anchao so after reading the code you saw that not only return fix is there but also introduction of new blocking behavior with arbitrary delay with no polling for success and still may return error on incomplete operation and you are okay with that? I know this is the simplest possible implementation, just wanted to make sure we are aware of the risks.

Its not that I don't understand the code change, questions are asked simple way so everyone that comes in understands what is the problem and context ;-)

Recent discussions on mailing lists reveal the problem we have is not with blind pressuring the contributors, quite the opposite, blind pushing and merging insufficiently validated breaking changes. We are trying to fix that, maybe it will work, maybe not.

@anchao
Copy link
Contributor

anchao commented Feb 13, 2025

@anchao so after reading the code you saw that not only return fix is there but also introduction of new blocking behavior with arbitrary delay with no polling for success and still may return error on incomplete operation and you are okay with that? I know this is the simplest possible implementation, just wanted to make sure we are aware of the risks.

Its not that I don't understand the code change, questions are asked simple way so everyone that comes in understands what is the problem and context ;-)

This fix includes the number of retries, indicating that there may be state transition issues under extreme conditions, which I think is acceptable.

Recent discussions on mailing lists reveal the problem we have is not with blind pressuring the contributors, quite the opposite, blind pushing and merging insufficiently validated breaking changes. We are trying to fix that, maybe it will work, maybe not.

Thanks for all yours efforts. I'm just stating my opinion, and it doesn't need to be accepted by everyone. Perhaps you will give some consideration to my view in certain decision - making processes.

@michallenc
Copy link
Contributor Author

michallenc commented Feb 13, 2025

@cederom @anchao @lupyuen @acassis Hey guys, I am not quite sure what's the problem here. The discussion was resolved, Alan just forgot to mark it, but he approved the PR after the changes, which should be sufficient.

Regarding the possible blocking. I am aware it may be an issue under extreme circumstances, but write to the flash is currently ALWAYS blocking if you have to flush the buffer and erase page. We wait for QSPI transfer to finish, we wait for the erase to finish, there is a QSPI locking mechanism, so other thread accessing the flash may wait on the lock. BCH layer also has the lock.

Returning -EBUSY immediately is not good, you expect the write to wait if opened as blocking. And looking at the specification, this errno is not even specified in POSIX for write call. As for the non-blocking access, the BCH/flash interface doesn't seem to handle it anyway, even without this change.

Yes, we can discuss some refactor to BCH, FTL, MTD layers to correctly handle non-blocking access, but I don't think this PR breaks something. The code won't enter the while loop in most of the cases, it is a fix for probably bad timing on some flashes. I experienced it on W25Q512, but not on W25Q01. And even on W25Q512 it's just one or two retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Drivers Drivers issues Size: S The size of the change in this PR is small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants