Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-34010: Fix tarfile read performance regression #8020

Merged

Conversation

hajoscher
Copy link
Contributor

@hajoscher hajoscher commented Jun 30, 2018

During buffered read, use a list followed by join, instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
See how to test in bpo-34010.

https://bugs.python.org/issue34010

During buffered read, use a list followed by join
instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
@the-knights-who-say-ni
Copy link

the-knights-who-say-ni commented Jun 30, 2018

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

When your account is ready, please add a comment in this pull request
and a Python core developer will remove the CLA not signed label
to make the bot check again.

Thanks again for your contribution, we look forward to reviewing it!

@hajoscher
Copy link
Contributor Author

hajoscher commented Jun 30, 2018

Signed the CLA.

@@ -525,7 +525,7 @@ def read(self, size=None):
if not buf:
break
t.append(buf)
buf = "".join(t)
buf = b"".join(t)
Copy link
Member

@methane methane Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch.

Copy link
Contributor Author

@hajoscher hajoscher Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess.


def __read(self, size):
"""Return size bytes from stream. If internal buffer is empty,
read another block from the stream.
"""
c = len(self.buf)
t = [self.buf]
Copy link
Member

@methane methane Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this optimization is needed.
In all caller of __read(), size is small or ==bufsize.
while c < size: doesn't loop twice actually.

Copy link
Member

@methane methane Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large size, chunked read is not needed anyway.

rem = size - len(self.buf)
if rem > 0:
    self.buf += self.fileobj.read(max(self.bufsize, rem))
t = self.buf[:size]
self.buf = self.buf[size:]
return t

@@ -538,6 +538,7 @@ def _read(self, size):
return self.__read(size)

c = len(self.dbuf)
t = [self.dbuf]
while c < size:
buf = self.__read(self.bufsize)
if not buf:
Copy link
Member

@methane methane Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about bypassing self.__read()?

        while c < size:
            if self.buf:
                buf = self.buf
                self.buf = b""
            else:
                buf = self.fileobj.read(self.bufsize)
                if not buf:
                    break

Copy link
Contributor Author

@hajoscher hajoscher Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compressed streams your suggestion works. However, there is one case where self.__read() is called with large size: When you open the stream uncompressed with "r|". For this special case, you still need the optimization in self.__read().

cpython/Lib/tarfile.py

Lines 537 to 538 in 3c45240

if self.comptype == "tar":
return self.__read(size)

Copy link
Member

@methane methane Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless optimization self.__read(), double (and unaligned) buffering is totally useless.

Copy link
Contributor Author

@hajoscher hajoscher Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It is a bit twisted to handle the different cases of decompressing versus direct read. So far, this is just a minimal fix of the performance regression.

Copy link
Member

@methane methane Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
To backport this, fix only repeated bytes += bytes and keep other behavior as-is.

@methane methane changed the title bpo-34010: improve tarfile stream read performance bpo-34010: Improve tarfile stream read performance Jul 4, 2018
@methane methane changed the title bpo-34010: Improve tarfile stream read performance bpo-34010: Fix tarfile read performance regression Jul 4, 2018
@methane methane added type-bug An unexpected behavior, bug, or error needs backport to 3.6 performance Performance or resource usage needs backport to 3.7 labels Jul 4, 2018
@methane methane merged commit 12a08c4 into python:master Jul 4, 2018
@miss-islington
Copy link
Contributor

miss-islington commented Jul 4, 2018

Thanks @hajoscher for the PR, and @methane for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6, 3.7.
🐍🍒🤖

@bedevere-bot
Copy link

bedevere-bot commented Jul 4, 2018

GH-8082 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jul 4, 2018
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
(cherry picked from commit 12a08c4)

Co-authored-by: hajoscher <hajoscher@gmail.com>
@bedevere-bot
Copy link

bedevere-bot commented Jul 4, 2018

GH-8083 is a backport of this pull request to the 3.6 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jul 4, 2018
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
(cherry picked from commit 12a08c4)

Co-authored-by: hajoscher <hajoscher@gmail.com>
miss-islington added a commit that referenced this pull request Jul 4, 2018
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
(cherry picked from commit 12a08c4)

Co-authored-by: hajoscher <hajoscher@gmail.com>
miss-islington added a commit that referenced this pull request Jul 4, 2018
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
(cherry picked from commit 12a08c4)

Co-authored-by: hajoscher <hajoscher@gmail.com>
yahya-abou-imran pushed a commit to yahya-abou-imran/cpython that referenced this pull request Nov 2, 2018
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants