bpo-40255: Implement Immortal Instances #19474

eduardo-elizondo · 2020-04-11T16:51:06Z

Motivation

This PR introduces the ability to immortalize instances in CPython which bypasses reference counting. Tagging objects as immortal allows up to skip certain operations when we know that the object will be around for the entire execution of the runtime.

Note that this by itself will bring a performance regression to the runtime due to the extra reference count checks. However, this opens up future optimization opportunities that might end up making the runtime faster overall.

Implementation Details

The implementation relies on setting a bit in the reference count word. The 3 MSBs are already reserved. The 1st MSB can't be modified, otherwise, due to two's complement, the following: if (Py_REFCNT(o) > 0) fails on immortal instances. The 2-3 MSBs are used for the reduced GC Head (GH-7043). Therefore, we use the 4th MSB to mark an immortal instance with kImmortalBit. This also means that if an object reaches a reference count of kImmortalBit, it will automatically be immortalized.

Also, immortal instances will not interfere with the reference leak tooling. That is, they won't show up as leaks. This is done by guaranteeing the parity between _Py_RefTotal++ and _Py_RefTotal-- and setting them only on non-immortal instances.

https://bugs.python.org/issue40255

eduardo-elizondo · 2020-04-11T20:22:32Z

This is ready to review, the CI is finally green. Really no idea why the newly added GC tests are failing on Windows and unfortunately I don't have a Windows machine to debug this.

eduardo-elizondo · 2020-04-11T20:23:49Z

Looping in @carljm and @DinoV who have pointed out some of the issues with immortal instances in the permanent generation participating in a GC collection (i.e dicts). Let me know if you have some other thoughts or ideas on this!

eduardo-elizondo · 2020-04-11T20:24:58Z

Also looping in @vstinner. Finally got around upstreaming this patch since you recently wrote about this on your C-API Improvement Docs

Include/object.h

Modules/gcmodule.c

pablogsal · 2020-04-14T18:49:47Z

Include/object.h

+
+static inline int _Py_IsImmortal(PyObject *op)
+{
+    return (op->ob_refcnt & kImmortalBit) != 0;


Could you move these functions to the private API as well (Include/internal/pycore_gc.h)?

Lib/test/test_gc.py

nascheme · 2020-04-14T18:57:46Z

My first reaction is that this shouldn't become part of the default build because most Python users will not make use of it and then it becomes pure extra overhead. However, I know for some people that it is a useful feature (e.g. pre-fork server architecture that exploits copy-on-write OS memory management). I would use it myself since I write web applications with that style.

Would it be okay to make this a compile time option, disabled by default? I think in general it is a bad idea to have too many of those types of build options. It makes code maintenance and testing more difficult. Some example build variations from the past that caused issues: thread/no-threads, Unicode width, various debug options (@vstinner removed some of those). So, I'm not super excited about introducing a new build option.

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow? A couple years ago I did a "tagged pointer" experiment that used a similar bit. In that case, small integers became one machine word in size and also become immortal.

Another thought: when you did your testing, were any objects made immortal? I would imagine that, by default, you could make everything immortal after initial interpreter startup. You are paying for an extra test+branch in INCREF and DECREF but for many objects (e.g. None, True, False, types) you avoid dirtying the memory/cache with writes to the reference count.

eduardo-elizondo · 2020-04-14T19:28:34Z

@nascheme you should definitely join the conversation happening in the bug report of this PR https://bugs.python.org/issue40255

However, I know for some people that it is a useful feature

Exactly, this change might be a feature for CPython power users

Would it be okay to make this a compile time option, disabled by default?

Yeah, that's probably the best option. That's also the consensus in the bug report thread (if the change is approved)

I think in general it is a bad idea to have too many of those types of build options.

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow?

We can indeed, I think somebody also mentioned that in the bug report. A potentially good place could be main.c:pymain_main right after pymain_main. Let me explore that and push that change if it looks like performance a improvement!

In theory we could optimize even further to reduce the perf cost. By leveraging saturated adds and conditional moves we could remove the branching instruction. I haven't explored this further since the current PR was good enough. Personally, I favor the current PR, but this could be changed to:

/* Branch-less incref saturated at PY_SSIZE_T_MAX */
#define _Py_INC_REF(op) ({
    __asm__ (
        "addq $0x1, %[refcnt]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

/* Branch-less decref saturated at PY_SSIZE_T_MAX */
#define _Py_DEC_REF(op) ({
    Py_ssize_t tmp = ((PyObject *)op)->ob_refcnt;
    __asm__ (
        "subq $0x1, %[refcnt]"
        "addq $0x1, %[tmp]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt), [tmp] "+r" (tmp)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

pablogsal · 2020-04-14T20:03:37Z

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Not only that, we would need specialized buildbots to test the code base with this option activated in a bunch of supported platforms and that raises the maintainance costs.

vstinner

This feature sounds controversial, so I block it until a consensus can be reached.

bedevere-bot · 2020-04-14T21:48:43Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

ydshieh · 2022-02-27T03:29:05Z

If we didn't find this solution, or if this solution is not provided here, we would need to have 3000 GB RAM, which we just couldn't afford as a very early startup.

You can simply re-fork worker process after configured request. Or you can store your 150GB data in more efficient storage. So "need to have 3000GB RAM" is overpromotion.

But thank you for your feedback. This issue is not only on YouTube or Instagram.

I still want to know sample applications to measure/estimate effect.

150GB * 20 = 3000GB is too rough estimation. For example, data like docstrings are never touched, except finalization. Dicts is also big RAM eater of Python, but its hashtable is seprated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable. Only when updating the dict cause CoW and this pull request doesn't save it.

That's why I requested sample application many times. I don't want to do "optimize without measuring."

Hi @methane , let me provide a few more details, but I am not able to provide the full process, because I am obliged to keep some business secret in my previous role.

First, we have 2 kinds of data: one is dictionary, another one is trie (for which we use FlashText), kind of nested dictionary. Both are quite large.
Each worker has 150 GB data (and we want it being shared by using COW), but also contains other things, like TensorFlow/PyTorch models. These models had problem (frozen) if they are shared across processes (say forked) - at least at that time (not sure about for their new versions). There are other objects we could not share, just like TensorFlow models. If we re-fork the process for each request, we will also need to initialize such objects, and it is just too slow (it depends on the requirement, but for us, this was just not an option).
store your 150GB data in more efficient storage: For each client request, our application process is like generating (potentially a lot) combinations (in a lazy way), and test each one to see if it is in a dictionary. It is quite often that we need to generate > 5000 combinations for each client request.
- I am not an database expert, but if we store the data in a database. For each client request, if we need to send > 5000 database requests to see if there is a match, it seems this would be much slower than if we could test against data in the memory. (if 1 client request only invokes 1 database request, I will store the data in database!) The database requests also cost $.
- It is not clear to me if database approach works if we are dealing with the trie structure.
- Our application data was subjective to change quite often (re-generate a few times per day - we were still in the experimentation stage). Writing the huge data to a database in this case was also difficult (time/money consuming).
- The keys of our dictionaries are usually quite long ( > 128 characters). The keys for our trie data is short (about 10 characters), but it is a very deep nested dictionary (10 ~ 20 levels). I am not able to say anything about its hashtable is separated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable., but we did face the COW issue. And once we ported @eduardo-elizondo solution to Python 3.8.5, the issue was completed solved!

These are all I remembered. I am no longer at that company. I could contact them if you want to hear more details from them. Currently they have to work with Python 3.8 with @eduardo-elizondo (modified) solution - which ends in 2024. I just want to let people reviewing this PR know there are real world needs for this solution. (Even if, in some case, there are sophisticated ways to arrange our huge data, it is not always obvious, and potentially complicate the whole codebase a lot).

methane · 2022-02-27T04:58:58Z

@ydshieh Thank you for your detailed information.

Although I think building 100GB trie on top of Python dict is bad idea, it's one of Python application.

If we re-fork the process for each request

Of course, re-fork on each request is bad idea. Common web applications recreate worker process on each 100~1000 requests. (And loading application after fork is much safe. e.g. --lazy-app option of uWSGI)

I am not able to say anything about its hashtable is separated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable.

Dicts have two memory blocks. One is PyObject part and it is 48bytes. Another is variable sized hashtable. When dict is very small (but not empty), hashtable is 160bytes (len(d) <= 5) or 280bytes (5 < len(d) <= 10). INCREF/DECREF changes only PyObject part. Assuming all dict in your trie is very small, at most 64/(64+160) = 29% memory would be copied to each processes.

Again, thanks for you for your information. Such information is needed to make this kind of estimation.

FWIW, there is another pull request about memory usage.
https://bugs.python.org/issue46845
This reduces the hashtable size about 3/4 ~ 2/3 (160 -> 120, 280 -> 200 bytes) when all keys are strings. This will save much RAM regardless pre-fork is used or not.

methane · 2022-02-27T05:07:22Z

@eduardo-elizondo Would you merge main branch to all these pull requests?

I am worrying about memory leak and performance regression. That's why I suggested to make this configure option at the mailing list.

@vstinner added leak test recently. So merging main branch will solve my worrying. I am OK to merge this without configure option if no leaks.

(I am not SC member so this is just a my personal request).

ydshieh · 2022-02-27T09:05:03Z

@methane Thank you for the reply. I would like to provide a few more information/remarks, and some questions (just take this chance to learn a bit more internal) if you have time to answer.

I think building 100GB trie on top of Python dict is bad idea, --> When I joined the team, the trie structure was still small. And it is the core component for our application (processing) code. Then it gets much larger, and our focus is still on the processing logic rather than finding a better way to deal with data. I do agree another approach is required if they want to scale the system further.
Assuming all dict in your trie is very small: This is not our case. Our trie (denoted by T below) looks like:
- Imagine a vocabulary vocab of 5000 words.
- Consider a language which are sentences (i.e. sequences) built on top of vocab. (of course, not all possible sequences, but it is still a large collection).
- Our trie structure is used to store this language with word as unit.
- So the top level keys are the set of the 1st word in a sentence. Suppose k = dog is the first word, then T[k] is a dictionary (more precisely, still a trie), whose keys are the set of the 2nd word in a sentence that starts with *dog*
- So the fact is: almost all dictionaries in T are large.

My questions (if you have time to give a CPython newbie some lessons)

64/(64+160): what 64 represents here, and what this formula computes?
One is PyObject part. Another is variable sized hashtable.: What are the roles of these 2 parts?
INCREF/DECREF changes only PyObject part. What happen when a read OP is performed, i.e. when we call a_dict[key], in terms of these 2 parts? In particular, what things are copied?

Final personal thoughts

There might be alternate solutions to avoid this CoW issue, but:

An engineer must be able to know the cause of his/her system's memory issue first in order to take the action. In our case, we are just 2 usual Python developers (5 people in the company, including the CEO), and we don't know much about CPython. It indeed took us months to identify the actual cause (under the stress of the production server being crashed each few days).
Even one knows the cause, it is always a difficult thing to change the fundamental architecture. I don't mean that it is not important enough to change, but there are other constrains to consider. If the team is large enough and has more experienced developer (CPython, WSGI, Database, etc.), it might be easier, but not our case.

Thank you again!

methane · 2022-02-28T11:14:03Z

Assuming all dict in your trie is very small: This is not our case.

If average dict size is larger, memory copied would become smaller than 29%.

64/(64+160): what 64 represents here, and what this formula computes?

sizeof(PyDictObject) + sizeof(PyGCHeader).

One is PyObject part. Another is variable sized hashtable.: What are the roles of these 2 parts?

PyObject part contains fixed size data and metadata that all Python objects have. The refcount is contained here. So CoW happened here.

Another part contains variable-sized data. This includes pointer to keys and values. This part don't have refcount of Python object. So this pull request doesn't affect it and CoW won't happen here until you modify the dict.

INCREF/DECREF changes only PyObject part. What happen when a read OP is performed, i.e. when we call a_dict[key], in terms of these 2 parts? In particular, what things are copied?

When getting item from value, PyObject* is retrieved from hashtable, and refcount of the value (not contained in the hashtable) is incremented. In your case, the value is very likly another dict.

Let's stop here. It is too offtopic.

methane · 2022-02-28T11:15:59Z

@eduardo-elizondo Thank you for updating this.
Would you update #31491 and fix its failing tests too?

This reverts commit 5af0167.

JimJJewett · 2022-03-09T17:17:50Z

I would not assume that the trie code is using the builtin dict implementation. It may be that they are using tree-related code, and their keys are mostly the sort of string that gets interned - and therefore no longer has reference count churn.

…

On Mon, Feb 28, 2022, 6:14 AM Inada Naoki ***@***.***> wrote: Assuming all dict in your trie is very small: This is not our case. If average dict size is larger, memory copied would become smaller than 29%. - 64/(64+160): what 64 represents here, and what this formula computes? sizeof(PyDictObject) + sizeof(PyGCHeader). - One is PyObject part. Another is variable sized hashtable.: What are the roles of these 2 parts? PyObject part contains fixed size data and metadata that all Python objects have. The refcount is contained here. So CoW happened here. Another part contains variable-sized data. This includes pointer to keys and values. This part don't have refcount of Python object. So this pull request doesn't affect it and CoW won't happen here until you modify the dict. - INCREF/DECREF changes only PyObject part. What happen when a read OP is performed, i.e. when we call a_dict[key], in terms of these 2 parts? In particular, what things are copied? When getting item from value, PyObject* is retrieved from hashtable, and refcount of the value (not contained in the hashtable) is incremented. In your case, the value is very likly another dict. Let's stop here. It is too offtopic. — Reply to this email directly, view it on GitHub <#19474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC4DCW6CWWOT7Q6IPEBNMWDU5NKIRANCNFSM4MGCCFAQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

This reverts commit 25fd52a.

Implement Immortal Instances

0c930b7

eduardo-elizondo requested a review from pablogsal as a code owner Apr 11, 2020

the-knights-who-say-ni added the CLA signed label Apr 11, 2020

bedevere-bot added the awaiting review label Apr 11, 2020

eduardo-elizondo added 9 commits Apr 11, 2020

Nits

7005944

Bypass immortality in NewReference

c6a1bfa

Add News and Fix MSVC Build

51e4879

Formatting Nits

cc2ece3

Typo

72d12fa

MSVC Test

f04776e

Skip test for MSVC

fa8d668

Skip test for MSVC 32 & 64

f066633

Skip all tests for Windows

36e0a9a

eduardo-elizondo changed the title ~~[WIP] bpo-40255: Implement Immortal Instances~~ bpo-40255: Implement Immortal Instances Apr 11, 2020

nascheme reviewed Apr 14, 2020

View changes

Include/object.h Outdated Show resolved Hide resolved

Include/object.h Outdated Show resolved Hide resolved

Modules/gcmodule.c Outdated Show resolved Hide resolved

pablogsal reviewed Apr 14, 2020

View changes

Modules/gcmodule.c Outdated Show resolved Hide resolved

pablogsal reviewed Apr 14, 2020

View changes

Lib/test/test_gc.py Outdated Show resolved Hide resolved

vstinner requested changes Apr 14, 2020

View changes

bedevere-bot removed the awaiting review label Apr 14, 2020

bedevere-bot added the awaiting changes label Apr 14, 2020

Immortalize known immortals

2f9fa29

eduardo-elizondo requested a review from 1st1 as a code owner Apr 15, 2020

eduardo-elizondo added 7 commits Feb 27, 2022

Move immortal bit to 2nd MSB

5949df4

Improve comments

2baee89

Remove extras refcounts inside interpreter loop

f7da0f8

immortalize deepfreeze

4bea515

Remove unused files

dfb5863

Merge remote-tracking branch 'upstream/main' into immortal-references

f7287b2

Remove unused refcounts in singletons within CPython/Objects

5af0167

eduardo-elizondo added 7 commits Mar 9, 2022

Merge remote-tracking branch 'upstream/main' into immortal-references

02cf9af

Revert "Remove unused refcounts in singletons within CPython/Objects"

25fd52a

This reverts commit 5af0167.

Include immortal interned strings

be86955

Regen frozen main

38a14a9

Properly clean up all immortal interned strings at runtime finalization

c828369

Build and test fixes

ee41af6

Temporarily disable single test_embed test

f835e6d

eduardo-elizondo added 9 commits Mar 9, 2022

Fix structseq test

8573af4

Move nonetype refcount to static refcnt

ad19ff6

Remove unneeded reference counts in Cpython/Objects

66c625f

This reverts commit 25fd52a.

Mark global instances as static globals

1379d50

Remove unneeded reference counts in Cpython/Python

1c9ee6d

Remove unneeded reference counts in gcmodule.c

287b57c

Cleanup bool and str usage

c736a7c

Merge branch 'main' into immortal-references

60f0760

Fix whitespaces

1321ff6

python / cpython Public

bpo-40255: Implement Immortal Instances #19474

bpo-40255: Implement Immortal Instances #19474

eduardo-elizondo commented Apr 11, 2020 •

edited

eduardo-elizondo commented Apr 11, 2020 •

edited

eduardo-elizondo commented Apr 11, 2020

eduardo-elizondo commented Apr 11, 2020 •

edited

pablogsal Apr 14, 2020

eduardo-elizondo Apr 14, 2020

nascheme commented Apr 14, 2020

eduardo-elizondo commented Apr 14, 2020 •

edited

pablogsal commented Apr 14, 2020 •

edited

vstinner left a comment

bedevere-bot commented Apr 14, 2020

ydshieh commented Feb 27, 2022 •

edited

methane commented Feb 27, 2022 •

edited

methane commented Feb 27, 2022

ydshieh commented Feb 27, 2022 •

edited

methane commented Feb 28, 2022

methane commented Feb 28, 2022

JimJJewett commented Mar 9, 2022

python / cpython Public

bpo-40255: Implement Immortal Instances #19474

Are you sure you want to change the base?

bpo-40255: Implement Immortal Instances #19474

Conversation

eduardo-elizondo commented Apr 11, 2020 • edited

Motivation

Implementation Details

eduardo-elizondo commented Apr 11, 2020 • edited

eduardo-elizondo commented Apr 11, 2020

eduardo-elizondo commented Apr 11, 2020 • edited

pablogsal Apr 14, 2020

Choose a reason for hiding this comment

eduardo-elizondo Apr 14, 2020

Choose a reason for hiding this comment

nascheme commented Apr 14, 2020

eduardo-elizondo commented Apr 14, 2020 • edited

pablogsal commented Apr 14, 2020 • edited

vstinner left a comment

bedevere-bot commented Apr 14, 2020

ydshieh commented Feb 27, 2022 • edited

methane commented Feb 27, 2022 • edited

methane commented Feb 27, 2022

ydshieh commented Feb 27, 2022 • edited

methane commented Feb 28, 2022

methane commented Feb 28, 2022

JimJJewett commented Mar 9, 2022

eduardo-elizondo commented Apr 11, 2020 •

edited

eduardo-elizondo commented Apr 11, 2020 •

edited

eduardo-elizondo commented Apr 11, 2020 •

edited

eduardo-elizondo commented Apr 14, 2020 •

edited

pablogsal commented Apr 14, 2020 •

edited

ydshieh commented Feb 27, 2022 •

edited

methane commented Feb 27, 2022 •

edited

ydshieh commented Feb 27, 2022 •

edited