bpo-34010: Fix tarfile read performance regression #8020

hajoscher · 2018-06-30T09:36:17Z

During buffered read, use a list followed by join, instead of extending a bytes object.
This is how it was done before but changed in commit b506dc3.
See how to test in bpo-34010.

https://bugs.python.org/issue34010

During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3.

the-knights-who-say-ni · 2018-06-30T09:36:20Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

When your account is ready, please add a comment in this pull request
and a Python core developer will remove the CLA not signed label
to make the bot check again.

Thanks again for your contribution, we look forward to reviewing it!

hajoscher · 2018-06-30T10:45:53Z

Signed the CLA.

methane · 2018-07-01T22:55:28Z

Lib/tarfile.py

@@ -525,7 +525,7 @@ def read(self, size=None):
                if not buf:
                    break
                t.append(buf)
-            buf = "".join(t)
+            buf = b"".join(t)


nice catch.

It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess.

methane · 2018-07-01T22:55:28Z

Lib/tarfile.py


    def __read(self, size):
        """Return size bytes from stream. If internal buffer is empty,
           read another block from the stream.
        """
        c = len(self.buf)
+        t = [self.buf]


I don't think this optimization is needed.
In all caller of __read(), size is small or ==bufsize.
while c < size: doesn't loop twice actually.

For large size, chunked read is not needed anyway.

rem = size - len(self.buf) if rem > 0: self.buf += self.fileobj.read(max(self.bufsize, rem)) t = self.buf[:size] self.buf = self.buf[size:] return t

methane · 2018-07-01T22:55:28Z

Lib/tarfile.py

@@ -538,6 +538,7 @@ def _read(self, size):
            return self.__read(size)

        c = len(self.dbuf)
+        t = [self.dbuf]
        while c < size:
            buf = self.__read(self.bufsize)
            if not buf:


How about bypassing self.__read()?

while c < size: if self.buf: buf = self.buf self.buf = b"" else: buf = self.fileobj.read(self.bufsize) if not buf: break

For compressed streams your suggestion works. However, there is one case where self.__read() is called with large size: When you open the stream uncompressed with "r|". For this special case, you still need the optimization in self.__read().

cpython/Lib/tarfile.py

Lines 537 to 538 in 3c45240

if self.comptype == "tar":

return self.__read(size)

Regardless optimization self.__read(), double (and unaligned) buffering is totally useless.

Yes, I agree. It is a bit twisted to handle the different cases of decompressing versus direct read. So far, this is just a minimal fix of the performance regression.

You're right.
To backport this, fix only repeated bytes += bytes and keep other behavior as-is.

miss-islington · 2018-07-04T08:13:20Z

Thanks @hajoscher for the PR, and @methane for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6, 3.7.
🐍🍒⛏🤖

bedevere-bot · 2018-07-04T08:14:44Z

GH-8082 is a backport of this pull request to the 3.7 branch.

During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3. (cherry picked from commit 12a08c4) Co-authored-by: hajoscher <hajoscher@gmail.com>

bedevere-bot · 2018-07-04T08:15:29Z

GH-8083 is a backport of this pull request to the 3.6 branch.

During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3. (cherry picked from commit 12a08c4) Co-authored-by: hajoscher <hajoscher@gmail.com>

During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3.

bpo-34010 improve tarfile stream read performance

77a54a3

During buffered read, use a list followed by join instead of extending a bytes object. This is how it was done before but changed in commit b506dc3.

the-knights-who-say-ni added the CLA not signed label Jun 30, 2018

bedevere-bot added the awaiting review label Jun 30, 2018

methane removed the CLA not signed label Jul 1, 2018

the-knights-who-say-ni added the CLA not signed label Jul 1, 2018

methane reviewed Jul 1, 2018

View changes

added NEWS entry

4bb56b1

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jul 4, 2018

Update 2018-07-04-07-36-53.bpo-34010.VNDkde.rst

b3e32cb

methane changed the title ~~bpo-34010: improve tarfile stream read performance~~ bpo-34010: Improve tarfile stream read performance Jul 4, 2018

methane changed the title ~~bpo-34010: Improve tarfile stream read performance~~ bpo-34010: Fix tarfile read performance regression Jul 4, 2018

methane added type-bug An unexpected behavior, bug, or error needs backport to 3.6 performance Performance or resource usage needs backport to 3.7 labels Jul 4, 2018

methane merged commit 12a08c4 into python:master Jul 4, 2018

bedevere-bot removed the awaiting review label Jul 4, 2018

bedevere-bot removed the needs backport to 3.7 label Jul 4, 2018

bedevere-bot removed the needs backport to 3.6 label Jul 4, 2018

methane mentioned this pull request Jul 6, 2018

bpo-34043: Optimize tarfile uncompress performance #8089

Merged

bpo-34010: Fix tarfile read performance regression #8020

bpo-34010: Fix tarfile read performance regression #8020

hajoscher commented Jun 30, 2018 •

edited by bedevere-bot

the-knights-who-say-ni commented Jun 30, 2018

hajoscher commented Jun 30, 2018

methane Jul 1, 2018

hajoscher Jul 4, 2018

methane Jul 1, 2018

methane Jul 4, 2018

methane Jul 1, 2018

hajoscher Jul 4, 2018

methane Jul 4, 2018

hajoscher Jul 4, 2018

methane Jul 4, 2018

miss-islington commented Jul 4, 2018

bedevere-bot commented Jul 4, 2018

bedevere-bot commented Jul 4, 2018

bpo-34010: Fix tarfile read performance regression #8020

bpo-34010: Fix tarfile read performance regression #8020

Conversation

hajoscher commented Jun 30, 2018 • edited by bedevere-bot

the-knights-who-say-ni commented Jun 30, 2018

hajoscher commented Jun 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miss-islington commented Jul 4, 2018

bedevere-bot commented Jul 4, 2018

bedevere-bot commented Jul 4, 2018

hajoscher commented Jun 30, 2018 •

edited by bedevere-bot