-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-116810: fix memory leak in ssl module #123249
Conversation
PySSL_get_session leaks a session object each time it is called. For programs like cherrypy (cheroot), a leak occurs with each handled web request.
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Misc/NEWS.d/next/Library/2024-08-23-15-49-10.gh-issue-116810.QLBUU8.rst
Outdated
Show resolved
Hide resolved
Modules/_ssl.c
Outdated
session = SSL_get1_session(self->ssl); | ||
if (session == NULL) { | ||
Py_RETURN_NONE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging by no test failures, this probably does not have any coverage. Any chance of including the tests in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check the tests this week. I doubt I will be able to finish that work today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change still results in returning a valid copy of the ssl session, I doubt tests would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreeing there. I think it would be difficult to test this. With that being said, are there any existing tests for checking that the session
attribute is valid? If not, then this PR should add them (otherwise, I think this looks good).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some life issues come up. I will try to get back to this in the next few weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a cacophony of mistakes here on the CPython side. Let me try to correct things:
I don't really understand how this code was nor always a memory leak in the past.
The memory leak seems to have been introduced by 39258d3. _ssl_session_dup
returns an SSL_SESSION
that needs to be freed by someone and then you all just immediately clobber the variable with something else. But test_ssl.py
clearly accesses session
. Does CPython have no way to catch memory leaks in its tests? Might I suggest you all look into LeakSanitizer?
Anyway, the problem with that commit is that you all turned #if ... #else ... #endif
into including both branches instead of just one.
That commit not only introduced a memory leak, but it also mostly broke your workaround. I haven't tried it, but I suspect you would see this if you uncleanly destroyed the other connection before calling set_session
instead of after. Now, I see uncleanly because it seems CPython never followed up on openssl/openssl#1550 (comment) because you all have been misunderstanding the nature of the workaround. It was not working around an OpenSSL bug, but a CPython bug / broken expectation.
Based on my understanding of https://docs.openssl.org/master/man3/SSL_get_session/ I believe this will work, but it does raise the question of why we have the SSL_get0_session + _ssl_session_dup combo in the first place; with modern OpenSSL 3+ why would this SSL_get1_session not just do what we wanted?
That is not what is going on here. There were, as far as I know, no changes in OpenSSL 3+ relating to this.
SSL_get0_session + _ssl_session_dup
and SSL_get1_session
do not and have never same thing in any version of OpenSSL. SSL_get0_session
and SSL_get1_session
are the exact same accessor. Just one of them increments the refcount for you and the other doesn't. _ssl_session_dup
is a (misguided) CPython function to make a deep copy of the SSL_SESSION
object.
Although the code was, prior to the regression, gated on a OPENSSL_VERSION_1_1
ifdef, that does not imply was specific to OpenSSL 1.1.x. OPENSSL_VERSION_1_1
is a CPython-invented macro (that squats OpenSSL's namespace) which means > 1.1. That is, OpenSSL 3 was also OPENSSL_VERSION_1_1
:
#if (OPENSSL_VERSION_NUMBER >= 0x10100000L) && !defined(LIBRESSL_VERSION_NUMBER)
# define OPENSSL_VERSION_1_1 1
# define PY_OPENSSL_1_1_API 1
#endif
While I have not tested it (have you all?), based on openssl/openssl#1550 (comment), I highly suspect that CPython "needs" the workaround in OpenSSL 3.x too. I say "needs" because...
I think the issue @.davidben was saying Python has in there was describing more the way Python is interacting with the available APIs (which at the time had to span a wide range of OpenSSL versions) and not using some other features instead.
No, that's not what I was saying in openssl/openssl#1550 (comment). See the immediately preceding comment:
FYI, I poked at this briefly and don't see any evidence that SSL_shutdown is being called by Python. It seems the pseudocode in the bug report is wrong and this is a Python bug.
The bug report was very confused because the minimal reproducer that CPython provided did not actually reflect what CPython does. With all the meandering with the bad reproducer, I suspect it was very confusing. But the CPython folks on the bug never followed up, so my past self assumed you all understood, were planning on fixing it, and I didn't have time to check on you all. 😞
What is actually going on here is that OpenSSL will invalidate sessions on connection destruction if you haven't called SSL_shutdown
first. To be honest, I don't think this is good behavior, but it has been OpenSSL's intentional behavior since day 1. I believe this is descended from this guidance:
close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. The session becomes
unresumable if any connection is terminated without proper
close_notify messages with level equal to warning.
https://www.rfc-editor.org/rfc/rfc2246.html#section-7.2.1
Now, I personally think this behavior is unhelpful. If a connection just got closed, that doesn't make the old key material broken or anything. Also if we were getting any security properties out of this, it wouldn't work anyway. There are loads of cases where you cannot reliably knock out the other session.
- Stateless resumption ("session tickets" in TLS 1.2) cannot possibly invalidate the session on the server
- Maybe your application serializes the session, e.g., on disk, and then deserializes it later
Interestingly, OpenSSL doesn't actually look for a full bidirectional shutdown to preserve your session. It only looks for whether you attempted to send close_notify
. Kind of pointless, to be honest, but IMO this whole thing is pointless.
TLS 1.1, incidentally, already softened this recommendation:
close_notify
This message notifies the recipient that the sender will not send
any more messages on this connection. Note that as of TLS 1.1,
failure to properly close a connection no longer requires that a
session not be resumed. This is a change from TLS 1.0 to conform
with widespread implementation practice.
https://www.rfc-editor.org/rfc/rfc4346.html#section-7.2.1
But OpenSSL retains this requirement, for better or worse. This requirement was not new in 1.1.x, nor was it removed in 3.x. The only reason you all needed to work around it later on is that, due to other quirks of how CPython was using the API (arguably incorrectly), as detailed in my comment on the OpenSSL issue, CPython was sneaking through a gap in this OpenSSL policy. OpenSSL fixed that gap, and now CPython was affected.
Now, contrary to the claim in openssl/openssl#1550 (comment), CPython did not call SSL_shutdown
. You all have a separate method to exchange close_notify and weren't calling it in that test.
So, what to do given this? I think you have two options:
- Agree with OpenSSL's policy and pass it up to the callers. Without calling
SSLSocket.shutdown()
, sessions get invalidated. Callers have to call the shutdown method if they care. Fix your tests to reflect this. - Disagree with OpenSSL's policy and effectively turn it off. To do that, call
SSL_shutdown
when destroying anSSL
. Now, that will cause OpenSSL to do I/O, so you might want toSSL_set_quiet_shutdown
. Alternatively, there is anSSL_set_shutdown
API to just fake a shutdown state.
In either case, you can remove the workaround because this workaround has been self-inflicted from day one.
If you pick option 2, please write some clear comments in the code so the next time this happens, I don't have to write another essay for you all. :-P
The OpenSSL 1.1.1 fork's I do think we should care somewhat about are BoringSSL and it's AWS-LC derivative. @WillChilds-Klein and @davidben can represent what those might need for these specific APIs.
So, per the above, this question is based on several incorrect premises:
- BoringSSL is not an OpenSSL 1.1.1 fork. We actually diverged from OpenSSL 1.0.2. We just report a version of 1.1.1 because that's what's most convenient for compatibility.
- This issue has nothing to do with OpenSSL 1.1.1. I expect you all still "need" this workaround in OpenSSL 3.x too.
- This issue has nothing to do with these specific APIs
Since this "workaround" is really a workaround for CPython's own mishaps, if you all do the thing I suggest above, it should work just fine in BoringSSL.
But also this is all moot in BoringSSL because BoringSSL does not invalidate sessions on SSL_free
in the first place. We removed it because it's just kinda silly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where OpenSSL documents their behavior, by the way:
A session will be automatically removed from the session cache and marked as non-resumable if the connection is not closed down cleanly, e.g. if a fatal error occurs on the connection or SSL_shutdown(3) is not called prior to SSL_free(3).
For completeness, there were two different bits of spec text that led to this OpenSSL behavior, and I only cited one of them. The bits I cited were about unclean shutdown and removed very early. Fatal error lasted all the way to TLS 1.2:
Error handling in the TLS Handshake protocol is very simple. When an
error is detected, the detecting party sends a message to the other
party. Upon transmission or receipt of a fatal alert message, both
parties immediately close the connection. Servers and clients MUST
forget any session-identifiers, keys, and secrets associated with a
failed connection. Thus, any connection terminated with a fatal
alert MUST NOT be resumed.
https://www.rfc-editor.org/rfc/rfc5246.html#section-7.2.2
It wasn't removed until TLS 1.3, in tlswg/tls13-spec@dc668dc, for the same reasons. In practice, you cannot actually do this consistently, for the reasons I listed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
the problem with that commit is that you all turned #if ... #else ... #endif into including both branches instead of just one.
Hah, that explains the code smell here. I'd only been glancing at current version of the code in git blame view to try and understand what this was even for without following the individual commits (git blame doing its job and not exposing history of merely deleted lines)... I agree completely on getting rid of that _ssl_session_dup
hack. (and more, but we need to examine the SOTW and pick an amenable way out of the past API sins in here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely makes sense to just start with something minimal (probably this PR) to fix the immediate leak. I think probably the thing to do is:
- This PR
- Remove
_ssl_session_dup
and confirm tests indeed now fail in OpenSSL 3.x. My expectation is that they'll still fail, but I've no idea if there were other changes that further complicated this mess. - No-op the goofy OpenSSL behavior. Probably something like
SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl); SSL_free(ssl);
would work? Confirm the tests now pass again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling SSL_set_shutdown
in the destructor sounds fine to me. Is there a reason to believe that the SSL_set_quiet_shutdown(ssl); SSL_shutdown(ssl);
dance would be better?
Oh and thanks for the detailed explanation @davidben .
I confirm that the change works as expected: it fixes a significant memory leak. I modified the reproducer to dump the RSS memory usage on Linux: import os
import socket
import ssl
import gc
host = '142.251.163.99' # some server we can connect with https as example
port = 443
session = None
context = ssl._create_unverified_context(protocol=ssl.PROTOCOL_TLSv1_2)
with socket.create_connection((host, port)) as sock:
ssock = context.wrap_socket(sock, server_hostname=host, session=session)
with ssock:
for i in range(300000):
session = ssock.session
if (i % 1000) == 0:
gc.collect()
os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") I can easily reproduce the bug:
With the PR, there is no leak anymore:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yes, I worked on |
Misc/NEWS.d/next/Library/2024-08-23-15-49-10.gh-issue-116810.QLBUU8.rst
Outdated
Show resolved
Hide resolved
Modules/_ssl.c
Outdated
session = SSL_get1_session(self->ssl); | ||
if (session == NULL) { | ||
Py_RETURN_NONE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to test anything, other than manually checking the leak is fixed as @vstinner did.
The issue was apparently that we were first duplicating the session using _ssl_session_dup
, but then we immediately forgot the duplicated session by calling SSL_get1_session
.
Modules/_ssl.c
Outdated
session = SSL_get1_session(self->ssl); | ||
if (session == NULL) { | ||
Py_RETURN_NONE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, a potential alternate fix would be to remove the SSL_get0_session
+ _ssl_session_dup
dance and keep the SSL_get1_session
call. @jeffvanvoorst Can you try to see if it works?
Ok, I've tried to removed the session duplication entirely and it works fine locally. I've pushed the change and we can wait for CI. |
Here as well. The session duplication hack was serializing the session object to ASN1, which is apparently quite expensive (looking at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'm not an OpenSSL expert but I trust other developers who were involved in this large change :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're still in "2. Disagree with OpenSSL's policy and effectively turn it off." state with this PR as is due to SSL_set_shutdown
being used to mark it as "shutdown" so that the OpenSSL implementation does cache it? There's a nice comment at least. :)
My main goal was get rid of the obvious memory leak. I left the set_session code and obviously gross slow serialize/deserialized stuff in place last night because simply removing that without figuring out the shutdown bits did lead to test failures. Thanks for picking that up and doing the rest!
Lets wait until @sethmlarson is back online (currently traveling) and can look at this before merging. It isn't going to make 3.13.0 as it clearly is not a blocker (given 3.10 and 3.11 have always had this flaw and it was only reported this year). I gather session caching for clients is not super common. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davidben for all the explanation, this change LGTM.
Thanks @jeffvanvoorst for the PR, and @gpshead for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13. |
Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization. (cherry picked from commit 7e7223e) Co-authored-by: Jeffrey R. Van Voorst <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Gregory P. Smith <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]>
GH-124800 is a backport of this pull request to the 3.13 branch. |
Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization. (cherry picked from commit 7e7223e) Co-authored-by: Jeffrey R. Van Voorst <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Gregory P. Smith <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]>
GH-124801 is a backport of this pull request to the 3.12 branch. |
gh-116810: fix memory leak in ssl module (GH-123249) Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization. (cherry picked from commit 7e7223e) Co-authored-by: Jeffrey R. Van Voorst <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Gregory P. Smith <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]>
Thanks everyone. I appreciate the cpython community support. |
gh-116810: fix memory leak in ssl module (GH-123249) Resolve a memory leak introduced in CPython 3.10's :mod:`ssl` when the :attr:`ssl.SSLSocket.session` property was accessed. Speeds up read and write access to said property by no longer unnecessarily cloning session objects via serialization. (cherry picked from commit 7e7223e) Co-authored-by: Jeffrey R. Van Voorst <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Gregory P. Smith <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]>
PySSL_get_session leaks a session object each time it is called. For programs like cherrypy (cheroot), a leak occurs with each handled web request.