Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in Python 3.11 (multiprocess shutdown?) #97641

Open
carltongibson opened this issue Sep 29, 2022 · 18 comments
Open

Potential regression in Python 3.11 (multiprocess shutdown?) #97641

carltongibson opened this issue Sep 29, 2022 · 18 comments
Assignees

Comments

@carltongibson
Copy link

carltongibson commented Sep 29, 2022

Running the Django test suite against the Python 3.11 pre-releases, we have hit
a potential regression.

Steps to reproduce

git clone https://github.com/django/django.git
python3.11 -m venv env
.\env\Scripts\Activate.ps1
python -m pip install -e django
cd django\tests
python runtests.py --parallel

Bug report

On Python 3.8, 3.9, and 3.10 this runs without problem.

On Python 3.11 the following error in seen after the test suite completes,
during shutdown:

Traceback (most recent call last):
  File "C:\Users\carlt\src\django\tests\runtests.py", line 773, in <module>
    failures = django_tests(
               ^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\tests\runtests.py", line 432, in django_tests
    failures = test_runner.run_tests(test_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlt\src\django\django\test\runner.py", line 1057, in run_tests
    self.teardown_databases(old_config)
  File "C:\Users\carlt\src\django\django\test\runner.py", line 980, in teardown_databases
    _teardown_databases(
  File "C:\Users\carlt\src\django\django\test\utils.py", line 378, in teardown_databases
    connection.creation.destroy_test_db(
  File "C:\Users\carlt\src\django\django\db\backends\base\creation.py", line 307, in destroy_test_db
    self._destroy_test_db(test_database_name, verbosity)
  File "C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py", line 111, in _destroy_test_db
    os.remove(test_database_name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'other_1.sqlite3'

This looked similar to us to open issue #95027 but we were asked to report it
separately.

I've tested all the way back to a1 where (along with other issues now resolved)
this error still occurs:

(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0rc2
...
(django-py311) PS C:\Users\carlt\src\django\tests> python --version
Python 3.11.0a1

This is somewhat frustrating as we've tried to test on all platforms since the
first releases.
Our test suite would only run on Windows with Python 3.11 very recently as
there was a third-party dependency that was not compatible with Windows. We
will try to adjust to test without dependencies as well on Windows for future
versions. (Sorry about that.)

Please do let us know if we can provide further info. I imagine though the
easiest thing is for you to run this yourself.

Thanks.

//cc @felixxm

Your environment

  • CPython versions tested on: 3.11 pre-releases from a1 to rc2.
  • Operating system and architecture: Windows 11.
@carltongibson carltongibson added the type-bug An unexpected behavior, bug, or error label Sep 29, 2022
@sobolevn
Copy link
Member

sobolevn commented Sep 29, 2022

Based on C:\Users\carlt\src\django\django\db\backends\sqlite3\creation.py I think it might be sqlite related 🤔

@corona10
Copy link
Member

corona10 commented Oct 2, 2022

cc @erlend-aasland

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 3, 2022

I don't have access to a PC with Windows 11, so unfortunately I am unable debug this :( If someone with Windows 11 could do a bisect, that would be super helpful.

A lot of changes happened between 3.10 entered the beta phase and the first 3.11 alpha was released:

$ git log --oneline --until=2021-10-05 --since=2021-05-03 Modules/_sqlite | wc -l
56

@carltongibson
Copy link
Author

carltongibson commented Oct 4, 2022

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that (frankly) I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success.

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

@carltongibson
Copy link
Author

carltongibson commented Oct 4, 2022

If someone with Windows 11 could do a bisect, that would be super helpful.

Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe.

I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success.

I'm happy to keep trying, but I've reached a bit of a roadblock with building Python failing

(Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.)

@felixxm
Copy link
Contributor

felixxm commented Oct 5, 2022

Can someone add OS-Windows label? 🪟

@felixxm
Copy link
Contributor

felixxm commented Oct 5, 2022

I'm not sure if it's related 🤔 but after forcing spawn on Linux, I'm getting:

Exception ignored in: <function Pool.__del__ at 0x7f5548c9d580>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 268, in __del__
ResourceWarning: unclosed running multiprocessing pool <multiprocessing.pool.Pool state=RUN pool_size=8>

/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@tkanhe
Copy link

tkanhe commented Oct 5, 2022

Just checked on following PC specifications:
Edition: Windows 11 Home Single Language
Version: 21H2
OS build: 22000.1042

with Python 3.11.0rc2, Django-4.2.dev20221005120449

Output:

`PS C:\Users\TK\Desktop> cd django\testsPS C:\Users\TK\Desktop\django\tests> python runtests.py --parallel
Testing against Django installed in 'C:\Users\TK\Desktop\django\django' with up to 16 processes
Found 16023 test(s).
Creating test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Cloning test database for alias 'default'...
Creating test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
Cloning test database for alias 'other'...
System check identified no issues (17 silenced).
ssssssssssssssssssssssssssssssssssssssssssssss.................................................................ss.........s.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................s..............................................s..................................................................................................................s.....................s.....sss..ss.....................................................................................................................................................................................................................................................................................................................................................................................................s.....................................................................................................................................................................................s.................................................................s................sss...s.ss.....s..ss.sss..s......ss.s.................s........s.....................ss.....................s..s....s...s.....................s..............s......s...s...ss............s.....................

Ran 16021 tests in 74.720s

OK (skipped=1828, expected failures=5)
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'default'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...
Destroying test database for alias 'other'...`

*truncked some of .......sss........ output.

Let me know any other info/tests needed; I will be happy to help in the future regarding testing on Windows 11.

@stic
Copy link

stic commented Oct 5, 2022

Couldn't replicate on auto (16 cores laptop).
python runtests.py --parallel
Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1).
python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

I've disabled Windows Defender on the folder under test (django) as well as %TEMP% to make sure Windows Defender it's not causing issues. That's due to fact that I've got quite a few issues with pickling (i.e. cleaning after that) e.g.

  File "c:\users\<username>\AppData\Local\Programs\Python\Python311\Lib\shutil.py", line 623, in _rmtree_unsafe
    os.rmdir(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\<username>\\AppData\\Local\\Temp\\django_jpnyol3z\\django_16co2dfg\\tmpcaxkg2x2\\test_project'

Not 100% sure if that's a good troubleshooting approach, but a few observations:
Pass that passes manages to call CloseFile from the first process, that is the process creating the other_1.sqlite3 and it's the same process that will try to delete the db file (clean-up). Crashing test run doesn't make such CloseFile call, last syscalls seen on system level are to unlock the file. In itself that shouldn't be a problem I guess (as it's the same PID trying to close / remove the file later on). However, there is quite a gap (up to 2sec on my box) between syscalls to unlock the file and close. Can't really figure it out yet (first time with Django test suit), but if you compare vs. a run on 3.10 there is a visible wait whilst: Creating test database for alias 'other'...
Working theory - we are coming back on the same PID (process) to clean-up a file, that we are still keeping open (for close operation) elsewhere?

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 10, 2022

FTR, I'm in the process of installing a Windows 11 development environment now. Will hopefully be up and running by this evening, or at least tomorrow. I'll need to delve into how to debug CPython on Windows using Visual Studio, since I have little experience with both Windows and Visual Studio.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 10, 2022

cc. release-manager @pablogsal

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 17, 2022

Couldn't replicate on auto (16 cores laptop). python runtests.py --parallel

Ditto (Win 11, 16 cores)

Bumping it up to ~40 started hitting the sharing violation on the removal of the 'other_1.sqlite3' (weirdly only the 1st i.e. other_1). python runtests.py --parallel 40

So confirmed on latest 3.11.0rc2.

Ditto. Bumped to 40, hit the PermissionError for other_1.sqlite3.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 18, 2022

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

@erlend-aasland erlend-aasland added pending The issue will be closed if no feedback is provided and removed expert-sqlite3 type-bug An unexpected behavior, bug, or error labels Oct 18, 2022
@erlend-aasland
Copy link
Contributor

erlend-aasland commented Oct 18, 2022

I'm for some reason unable to install Visual Studio 2015, so I'm unable to compile 3.10.0 beta1 at the moment. Thus, I'm once again unable to bisect this :(

@carltongibson
Copy link
Author

carltongibson commented Oct 25, 2022

This still looks to be failing with 3.11.0, so still looks like a regression from 3.10.

@erlend-aasland erlend-aasland removed the pending The issue will be closed if no feedback is provided label Oct 25, 2022
@felixxm
Copy link
Contributor

felixxm commented Oct 26, 2022

I was able to fix this by adding an explicit connection close() (see django/django#16225). A similar issue occurred in sqlite-utils, see simonw/sqlite-utils#503. I'm not sure why it started showing up in Python 3.11 but I don't mind closing the issue.

@eryksun
Copy link
Contributor

eryksun commented Oct 26, 2022

Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself.

Sorry, that comment wasn't meant for this issue. I cannot reproduce this issue to diagnose it. For me, there's a significant window of time between when the worker processes access the "*.sqlite3" files and when the main process deletes them at shutdown. If something else in the main process sometimes has the file open without delete sharing, I haven't been able to reproduce it in 3.11. (Note that any locking/unlocking of the database file that may be observed in monitoring tools has nothing to do with a sharing violation. Byte-range locks can lead to locking violations for read and write operations, but not to a sharing violation.)

That said, for projects that are using multiprocessing for parallel testing, I think it would help if multiprocessing introduced optional support for process groups and, in Windows, job objects, in order to do a better job ensuring that the entire process tree has been given a chance to gracefully exit, and subsequently to forcefully kill any remaining processes.

@eryksun
Copy link
Contributor

eryksun commented Oct 26, 2022

I'm not sure why it started showing up in Python 3.11

3.11 switched to using functools.lru_cache for the statement cache. See #87028. This introduces a reference cycle involving the connection object, so the connection object no longer gets immediately deallocated, not until, for example, gc.collect() is called to break the cycle.

@eryksun eryksun added extension-modules C modules in the Modules dir 3.12 expert-sqlite3 labels Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants