Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow collecting PGO data on Windows #113655

Open
mdboom opened this issue Jan 2, 2024 · 11 comments
Open

Stack overflow collecting PGO data on Windows #113655

mdboom opened this issue Jan 2, 2024 · 11 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes build The build process and cross-build OS-windows performance Performance or resource usage release-blocker type-bug An unexpected behavior, bug, or error

Comments

@mdboom
Copy link
Contributor

mdboom commented Jan 2, 2024

Bug report

Bug description:

During a PGO build on Windows on main (471aa75), a test in test_functools fails with a stack overflow.

Full log of build

PCbuild\build.bat --pgo
0:02:50 load avg: 0.35 [19/44] test_functools
Windows fatal exception: stack overflow

Thread 0x00002b38 (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\libregrtest\win_utils.py", line 43 in _update_load

Current thread 0x00000e3c (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1[875](https://github.com/faster-cpython/benchmarking/actions/runs/7367240496/job/20050307979#step:10:876) in fib
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
  ...

C:\actions-runner\_work\benchmarking\benchmarking\cpython>"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\msbuild.exe" "C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\\pythoncore.vcxproj" /t:KillPython /nologo /v:m /clp:summary /p:Configuration=PGInstrument /p:Platform=x64 /p:KillPython=true 

  Killing any running python.exe instances...

C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pyproject.props(184,5): error : PGO run did not succeed (no python313!*.pgc files) and there is no data to merge [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]

Build FAILED.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked PRs

@mdboom mdboom added type-bug An unexpected behavior, bug, or error performance Performance or resource usage labels Jan 2, 2024
@mdboom mdboom self-assigned this Jan 2, 2024
@itamaro itamaro added build The build process and cross-build OS-windows labels Jan 2, 2024
@gvanrossum
Copy link
Member

Trying to repro, I can't even manage to complete a PGO build (first step above).

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

It sounds like you are repro'ing (I probably wasn't clear enough). It fails during the PGO build while running the tests that generate PGO data.

git bisect tells me the first bad commit is: 45e09f9 (which increased the recursion limit, which at least seems related).

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

It looks like 45e09f9 introduced the failing test, so it could just be that the test is incorrect (rather than a test going from passing to failing).

@gvanrossum
Copy link
Member

Ah, you're right, the output similar to yours scrolled off my screen. :-(

You should be able to experiment with some of the numbers changed by the PR to see which one affects the Windows PGO build. The stack limit is definitely an ongoing discussion.

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

Windows release buildbots are similarly failing: https://buildbot.python.org/all/#/builders/914/builds/3274/steps/4/logs/stdio

@gvanrossum
Copy link
Member

So it is crashing in this addition to test_functools.py:

    @support.skip_on_s390x
    @unittest.skipIf(support.is_wasi, "WASI has limited C stack")
    def test_lru_recursion(self):

        @self.module.lru_cache
        def fib(n):
            if n <= 1:
                return n
            return fib(n-1) + fib(n-2)

        if not support.Py_DEBUG:
            with support.infinite_recursion():
                fib(2500)

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

Yes -- my experimentation arrived at that too. 5,000 is too large, but 4,000 seems to work. See #113668.

@mdboom mdboom added the 3.12 bugs and security fixes label Jan 2, 2024
@gvanrossum
Copy link
Member

Also, this docstring (deleted by that PR) suggests a possible cause:

    """Set a lower limit for tests that interact with infinite recursions
    (e.g test_ast.ASTHelpers_Test.test_recursion_direct) since on some
    debug windows builds, due to not enough functions being inlined the
    stack size might not handle the default recursion limit (1000). See
    bpo-11105 for details."""

Maybe we are running into the inlining problem even in non-debug mode?

@gvanrossum
Copy link
Member

Let's put this on tomorrow's agenda.

@zooba
Copy link
Member

zooba commented Jan 5, 2024

Since Windows 8, we've had GetCurrentThreadStackLimits, which should be able to give us a better estimate of how many native frames will fit in various configurations. We can also pass in the initial stack size as a preprocessor variable, though it's very possible to create new threads with different sizes, and I believe we might use a smaller stack for new threads already?

If we have any references to stack variables that are passed between frames (e.g. a pointer to a local in EvalFrame that's available in the next EvalFrame) then we can get the size of each recursion from the current build. Of course, that'll vary based on how it recurses (and how many of our own native APIs it goes through), but it's likely better than guessing. At the very least, it might be helpful for some assertions to detect when the guesses are invalidated.

A true solution would be to use structured exception handling and catch the stack overflow in EvalFrame. That's going to leave things in a pretty messy state though (leaked references at a minimum, and likely worse), so I think it's better to just crash.

@zooba
Copy link
Member

zooba commented Jan 5, 2024

Since Windows 8, we've had GetCurrentThreadStackLimits

We could also just check this against the address of a local in a function and raise if we're within some distance from the extent of the stack. It won't be a predictable number of recursions, and it may still be possible to use native recursion to cause a crash, but it should be fairly simple to implement and isn't really any worse than the alternatives.

@mdboom mdboom changed the title Python stack overflow collecting PGO data on Windows Stack overflow collecting PGO data on Windows Jan 5, 2024
@zooba zooba added release-blocker 3.13 new features, bugs and security fixes labels Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes build The build process and cross-build OS-windows performance Performance or resource usage release-blocker type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

4 participants