Improve performance of pathlib.scandir() #82116

ShaiAvr · 2019-08-24T06:31:41Z

BPO	37935
Nosy	@brettcannon, @gpshead, @pitrou, @serhiy-storchaka, @hongweipeng, @ShaiAvr
PRs	bpo-37935: Improve performance of pathlib.scandir() #15331 bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() #15956 [3.8] bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() (GH-15956) #16043

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = None
created_at = <Date 2019-08-24.06:31:41.093>
labels = ['3.8', 'library', '3.9', 'performance']
title = 'Improve performance of pathlib.scandir()'
updated_at = <Date 2019-09-12.15:07:49.880>
user = 'https://github.com/ShaiAvr'

bugs.python.org fields:

activity = <Date 2019-09-12.15:07:49.880>
actor = 'gregory.p.smith'
assignee = 'gregory.p.smith'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2019-08-24.06:31:41.093>
creator = 'Shai'
dependencies = []
files = []
hgrepos = []
issue_num = 37935
keywords = ['patch']
message_count = 13.0
messages = ['350354', '350589', '350604', '350605', '350608', '350613', '351706', '351735', '351736', '351756', '351901', '352148', '352194']
nosy_count = 6.0
nosy_names = ['brett.cannon', 'gregory.p.smith', 'pitrou', 'serhiy.storchaka', 'hongweipeng', 'Shai']
pr_nums = ['15331', '15956', '16043']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue37935'
versions = ['Python 3.8', 'Python 3.9']

ShaiAvr · 2019-08-24T06:31:41Z

I recently have taken a look in the source code of the pathlib module, and I saw something weird there:

when the module used the scandir function, it first converted its iterator into a list and then used it in a for loop. The list wasn't used anywhere else, so I think the conversion to list is just a waste of performance.

In addition, I noticed that the scandir iterator is never closed (it's not used in a with statement and its close method isn't called). I know that the iterator is closed automatically when it's garbaged collected, but according to the docs, it's advisable to close it explicitly.

I've created a pull request that fixes these issues:
PR 15331

In the PR, I changed the code so the scandir iterator is used directly instead of being converted into a list and I wrapped its usage in a with statement to close resources properly.

hongweipeng · 2019-08-27T03:45:28Z

Scandir() will be close when it iteration is over.You can see ScandirIterator_iternext:

static PyObject *
ScandirIterator_iternext(ScandirIterator *iterator)
{
    while (1) {
        ...
    }
    /* Error or no more files */
    ScandirIterator_closedir(iterator);
    return NULL;
}

So, entries = list(scandir(parent_path)) resources in this code can be properly closed.

ShaiAvr · 2019-08-27T06:42:32Z

From the docs (https://docs.python.org/3/library/os.html#os.scandir.close):

"This is called automatically when the iterator is exhausted or garbage collected, or when an error happens during iterating. However it is advisable to call it explicitly or use the with statement.".

The iterator is indeed closed properly, but the docs state that it's still advisable to close it explicitly, which is why I wrapped it in a with statement.

However, the more important change is that the iterator is no longer converted into a list, which should reduce the iterations from 2N to N, when N is the number of entries in the directory (one N when converting to list and another one when iterating it). This should enhance the performance of the functions that use scandir.

serhiy-storchaka · 2019-08-27T06:47:03Z

Could you please provide any microbenchmarks that show the performance improvement?

ShaiAvr · 2019-08-27T06:56:24Z

I'm new to contributing here. I've never done benchmarking before.

I'd appreciate it if you could provide a guide to benchmarking.
You could look at the changes I made in the pull request (PR 15331). They're easy to follow and I think that removing a useless call to list() should enhance the performance, but I'd like to have benchmarking to back this up, so if someone more experienced could do this or at least provide a link to a guide, I'd really appreciate it.

serhiy-storchaka · 2019-08-27T07:27:44Z

I think using the timeit module is enough. For more precise benchmarking you may need to use the pyperf module, but I think this is not a case.

For example, something like:

./python -m timeit -s "from pathlib import Path; p = Patch('...')" "for x in p.rglob('...'): pass"

gpshead · 2019-09-10T15:41:10Z

Shai, please open the 'Your Details' link in the bugs.python.org sidebar and make sure you have your github username filled in. it needs to say ShaiAvr for our CLA automation to understand.

Avoiding calling list() on the output of scandir() is always desirable. :)

serhiy-storchaka · 2019-09-10T18:14:24Z

The title is misleading. There is no pathlib.scandir().

serhiy-storchaka · 2019-09-10T18:20:11Z

Any optimization can be accepted only when we have any prove that the change actually has measurable effect, and that it is large enough.

Avoiding calling list() on the output of scandir() may be not harmless. For example it will left open a file descriptor. This can cause issues if walk a deep tree, especially if do this in few threads simultaneously. list() guaranties that it is open in very limited time (although using a context manager is desirable).

gpshead · 2019-09-10T23:03:27Z

For some context here, the pathlib scandir code was written by Serhiy in https://bugs.python.org/issue26032 and related commit 680cb15.

Any optimization can be accepted only when we have any prove that the change actually has measurable effect, and that it is large enough.

While often good advice, there is no such blanket policy. Saying that comes across as dismissive. A better way to phrase such a statement is as a question:

In what practical scenarios does this change provide a demonstrable benefit?

The improvement in this case could be avoiding an explosion of memory use in some circumstances. Depending on the particular filesystem. (the original reason based on real world production experience that I pushed to have our os.scandir API created in the first place)

Avoiding calling list() on the output of scandir() may be not harmless.

Keeping it may not be harmless either. :)

One key point of os.scandir() is to be iterated over instead of turned into a list to avoid using too much memory. Code that requires a list could've just called listdir() (as the previous code did in both of these places) - if that is intentional, we should include an explicit comment stating why a preloaded list was intentional.

Either these list calls go away as is natural with scandir, or comments should be added explaining why they are important (related: unittests exercising the situations they are important for would be ideal)

As you're the original author of the pathlib scandir code, do you remember what the intent of these list(scandir) calls was?

One potential difference without them: If someone is selecting via pathlib and modifying directory file contents while iterating over the results, the lists may be important. (what happens in this case presumably depends on the underlying os fs implementation)

It sounds like we don't have test cases covering such scenarios.

serhiy-storchaka · 2019-09-11T14:25:37Z

Yes, I am the author of the code that uses os.scandir() in os.walk(), os.fwalk(), glob.iglob() and Path.glob() (see bpo-23605, bpo-25996, bpo-25596, bpo-26032). And it was always in mind to not keep many file descriptors open when traverse directories recursively. It was also a reason of rejecting my earlier patch for speeding up os.walk() by using os.fwalk() (see bpo-15200).

It is safe to iterated over os.scandir() without turning it into a list if we do not do this recursively.

Unfortunately there were no special tests for this. PR 15956 adds them. You can easily break tests for pathlib if remove any of list(). It is harder to break tests for glob, because fnmatch.filter() consumes the iterator and implicitly closes the scandir iterator. You need to replace it with a generator and fnmatch.fnmatch() called in a loop. Breaking the tests for os.walk() is difficult. The code of os.walk() is so complex because it needs to return lists of files and subdirectories, and therefore it consumes the scandir iterator and closes it. But with some tricks it is possible to break tests for os.walk(). It is unlikely somebody will do this unintentionally.

gpshead · 2019-09-12T12:54:51Z

New changeset f9dc2ad by Gregory P. Smith (Serhiy Storchaka) in branch 'master':
bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() (GH-15956)
f9dc2ad

gpshead · 2019-09-12T15:07:50Z

New changeset 98a4a71 by Gregory P. Smith (Miss Islington (bot)) in branch '3.8':
bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() (GH-15956) (GH-16043)
98a4a71

…athlib.

GH-94939) Automerge-Triggered-By: GH:brettcannon

barneygale · 2022-07-21T15:06:20Z

I've added a couple of comments to pathlib.py explaining the current behaviour:

# We must close the scandir() object before proceeding to
# avoid exhausting file descriptors when globbing deep trees.
with scandir(parent_path) as scandir_it:
    entries = list(scandir_it)
for entry in entries:
    ...

I think this issue can now be resolved.

ShaiAvr mannequin added 3.9 stdlib Python modules in the Lib dir performance Performance or resource usage labels Aug 24, 2019

gpshead added the 3.8 label Sep 10, 2019

gpshead self-assigned this Sep 10, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

gpshead removed their assignment Jun 17, 2022

gpshead removed 3.9 3.8 labels Jun 17, 2022

brettcannon added the expert-pathlib label Jun 21, 2022

barneygale added a commit to barneygale/cpython that referenced this issue Jul 17, 2022

pythongh-82116: add comment explaining use of list(scandir_it) in p…

5116d9f

…athlib.

bedevere-bot mentioned this issue Jul 17, 2022

gh-82116: add comment explaining use of list(scandir_it) in pathlib. #94939

Merged

miss-islington pushed a commit that referenced this issue Jul 20, 2022

gh-82116: add comment explaining use of list(scandir_it) in pathlib. (

fd4a42d

GH-94939) Automerge-Triggered-By: GH:brettcannon

brettcannon closed this as completed Jul 21, 2022

Improve performance of pathlib.scandir() #82116

Improve performance of pathlib.scandir() #82116

ShaiAvr mannequin commented Aug 24, 2019

ShaiAvr mannequin commented Aug 24, 2019

hongweipeng mannequin commented Aug 27, 2019

ShaiAvr mannequin commented Aug 27, 2019

serhiy-storchaka commented Aug 27, 2019

ShaiAvr mannequin commented Aug 27, 2019

serhiy-storchaka commented Aug 27, 2019

gpshead commented Sep 10, 2019

serhiy-storchaka commented Sep 10, 2019

serhiy-storchaka commented Sep 10, 2019

gpshead commented Sep 10, 2019

serhiy-storchaka commented Sep 11, 2019

gpshead commented Sep 12, 2019

gpshead commented Sep 12, 2019

barneygale commented Jul 21, 2022

Improve performance of pathlib.scandir() #82116

Improve performance of pathlib.scandir() #82116

Comments

ShaiAvr mannequin commented Aug 24, 2019

ShaiAvr mannequin commented Aug 24, 2019

hongweipeng mannequin commented Aug 27, 2019

ShaiAvr mannequin commented Aug 27, 2019

serhiy-storchaka commented Aug 27, 2019

ShaiAvr mannequin commented Aug 27, 2019

serhiy-storchaka commented Aug 27, 2019

gpshead commented Sep 10, 2019

serhiy-storchaka commented Sep 10, 2019

serhiy-storchaka commented Sep 10, 2019

gpshead commented Sep 10, 2019

serhiy-storchaka commented Sep 11, 2019

gpshead commented Sep 12, 2019

gpshead commented Sep 12, 2019

barneygale commented Jul 21, 2022