New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34010: Fix tarfile read performance regression #8020
bpo-34010: Fix tarfile read performance regression #8020
Conversation
During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3.
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. When your account is ready, please add a comment in this pull request Thanks again for your contribution, we look forward to reviewing it! |
Signed the CLA. |
@@ -525,7 +525,7 @@ def read(self, size=None): | |||
if not buf: | |||
break | |||
t.append(buf) | |||
buf = "".join(t) | |||
buf = b"".join(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It never caused a problem, since this line is never called; size
is never None in the function call. But still, should be fixed, I guess.
|
||
def __read(self, size): | ||
"""Return size bytes from stream. If internal buffer is empty, | ||
read another block from the stream. | ||
""" | ||
c = len(self.buf) | ||
t = [self.buf] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this optimization is needed.
In all caller of __read()
, size
is small or ==bufsize
.
while c < size:
doesn't loop twice actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For large size
, chunked read is not needed anyway.
rem = size - len(self.buf)
if rem > 0:
self.buf += self.fileobj.read(max(self.bufsize, rem))
t = self.buf[:size]
self.buf = self.buf[size:]
return t
@@ -538,6 +538,7 @@ def _read(self, size): | |||
return self.__read(size) | |||
|
|||
c = len(self.dbuf) | |||
t = [self.dbuf] | |||
while c < size: | |||
buf = self.__read(self.bufsize) | |||
if not buf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about bypassing self.__read()
?
while c < size:
if self.buf:
buf = self.buf
self.buf = b""
else:
buf = self.fileobj.read(self.bufsize)
if not buf:
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compressed streams your suggestion works. However, there is one case where self.__read()
is called with large size
: When you open the stream uncompressed with "r|"
. For this special case, you still need the optimization in self.__read()
.
Lines 537 to 538 in 3c45240
if self.comptype == "tar": | |
return self.__read(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless optimization self.__read()
, double (and unaligned) buffering is totally useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. It is a bit twisted to handle the different cases of decompressing versus direct read. So far, this is just a minimal fix of the performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
To backport this, fix only repeated bytes += bytes and keep other behavior as-is.
Thanks @hajoscher for the PR, and @methane for merging it |
GH-8082 is a backport of this pull request to the 3.7 branch. |
GH-8083 is a backport of this pull request to the 3.6 branch. |
During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3.
During buffered read, use a list followed by join, instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
See how to test in bpo-34010.
https://bugs.python.org/issue34010