Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-40255: Implement Immortal Instances #19474

Draft
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

Copy link
Contributor

@eduardo-elizondo eduardo-elizondo commented Apr 11, 2020

Motivation

This PR introduces the ability to immortalize instances in CPython which bypasses reference counting. Tagging objects as immortal allows up to skip certain operations when we know that the object will be around for the entire execution of the runtime.

Note that this by itself will bring a performance regression to the runtime due to the extra reference count checks. However, this opens up future optimization opportunities that might end up making the runtime faster overall.

Implementation Details

The implementation relies on setting a bit in the reference count word. The 3 MSBs are already reserved. The 1st MSB can't be modified, otherwise, due to two's complement, the following: if (Py_REFCNT(o) > 0) fails on immortal instances. The 2-3 MSBs are used for the reduced GC Head (GH-7043). Therefore, we use the 4th MSB to mark an immortal instance with kImmortalBit. This also means that if an object reaches a reference count of kImmortalBit, it will automatically be immortalized.

Also, immortal instances will not interfere with the reference leak tooling. That is, they won't show up as leaks. This is done by guaranteeing the parity between _Py_RefTotal++ and _Py_RefTotal-- and setting them only on non-immortal instances.

https://bugs.python.org/issue40255

@eduardo-elizondo
Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo commented Apr 11, 2020

This is ready to review, the CI is finally green. Really no idea why the newly added GC tests are failing on Windows and unfortunately I don't have a Windows machine to debug this.

@eduardo-elizondo eduardo-elizondo changed the title [WIP] bpo-40255: Implement Immortal Instances bpo-40255: Implement Immortal Instances Apr 11, 2020
@eduardo-elizondo
Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo commented Apr 11, 2020

Looping in @carljm and @DinoV who have pointed out some of the issues with immortal instances in the permanent generation participating in a GC collection (i.e dicts). Let me know if you have some other thoughts or ideas on this!

@eduardo-elizondo
Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo commented Apr 11, 2020

Also looping in @vstinner. Finally got around upstreaming this patch since you recently wrote about this on your C-API Improvement Docs

Include/object.h Outdated Show resolved Hide resolved
Include/object.h Outdated Show resolved Hide resolved
Modules/gcmodule.c Outdated Show resolved Hide resolved
Modules/gcmodule.c Outdated Show resolved Hide resolved
Include/object.h Outdated

static inline int _Py_IsImmortal(PyObject *op)
{
return (op->ob_refcnt & kImmortalBit) != 0;
Copy link
Member

@pablogsal pablogsal Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these functions to the private API as well (Include/internal/pycore_gc.h)?

Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do!

Lib/test/test_gc.py Outdated Show resolved Hide resolved
@nascheme
Copy link
Member

@nascheme nascheme commented Apr 14, 2020

My first reaction is that this shouldn't become part of the default build because most Python users will not make use of it and then it becomes pure extra overhead. However, I know for some people that it is a useful feature (e.g. pre-fork server architecture that exploits copy-on-write OS memory management). I would use it myself since I write web applications with that style.

Would it be okay to make this a compile time option, disabled by default? I think in general it is a bad idea to have too many of those types of build options. It makes code maintenance and testing more difficult. Some example build variations from the past that caused issues: thread/no-threads, Unicode width, various debug options (@vstinner removed some of those). So, I'm not super excited about introducing a new build option.

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow? A couple years ago I did a "tagged pointer" experiment that used a similar bit. In that case, small integers became one machine word in size and also become immortal.

Another thought: when you did your testing, were any objects made immortal? I would imagine that, by default, you could make everything immortal after initial interpreter startup. You are paying for an extra test+branch in INCREF and DECREF but for many objects (e.g. None, True, False, types) you avoid dirtying the memory/cache with writes to the reference count.

@eduardo-elizondo
Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo commented Apr 14, 2020

@nascheme you should definitely join the conversation happening in the bug report of this PR https://bugs.python.org/issue40255

However, I know for some people that it is a useful feature

Exactly, this change might be a feature for CPython power users

Would it be okay to make this a compile time option, disabled by default?

Yeah, that's probably the best option. That's also the consensus in the bug report thread (if the change is approved)

I think in general it is a bad idea to have too many of those types of build options.

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow?

We can indeed, I think somebody also mentioned that in the bug report. A potentially good place could be main.c:pymain_main right after pymain_main. Let me explore that and push that change if it looks like performance a improvement!

In theory we could optimize even further to reduce the perf cost. By leveraging saturated adds and conditional moves we could remove the branching instruction. I haven't explored this further since the current PR was good enough. Personally, I favor the current PR, but this could be changed to:

/* Branch-less incref saturated at PY_SSIZE_T_MAX */
#define _Py_INC_REF(op) ({
    __asm__ (
        "addq $0x1, %[refcnt]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

/* Branch-less decref saturated at PY_SSIZE_T_MAX */
#define _Py_DEC_REF(op) ({
    Py_ssize_t tmp = ((PyObject *)op)->ob_refcnt;
    __asm__ (
        "subq $0x1, %[refcnt]"
        "addq $0x1, %[tmp]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt), [tmp] "+r" (tmp)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Apr 14, 2020

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Not only that, we would need specialized buildbots to test the code base with this option activated in a bunch of supported platforms and that raises the maintainance costs.

Copy link
Member

@vstinner vstinner left a comment

This feature sounds controversial, so I block it until a consensus can be reached.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Apr 14, 2020

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@ydshieh
Copy link

@ydshieh ydshieh commented Feb 27, 2022

If we didn't find this solution, or if this solution is not provided here, we would need to have 3000 GB RAM, which we just couldn't afford as a very early startup.

You can simply re-fork worker process after configured request. Or you can store your 150GB data in more efficient storage. So "need to have 3000GB RAM" is overpromotion.

But thank you for your feedback. This issue is not only on YouTube or Instagram.

I still want to know sample applications to measure/estimate effect.

150GB * 20 = 3000GB is too rough estimation. For example, data like docstrings are never touched, except finalization. Dicts is also big RAM eater of Python, but its hashtable is seprated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable. Only when updating the dict cause CoW and this pull request doesn't save it.

That's why I requested sample application many times. I don't want to do "optimize without measuring."

Hi @methane , let me provide a few more details, but I am not able to provide the full process, because I am obliged to keep some business secret in my previous role.

  • First, we have 2 kinds of data: one is dictionary, another one is trie (for which we use FlashText), kind of nested dictionary. Both are quite large.

  • Each worker has 150 GB data (and we want it being shared by using COW), but also contains other things, like TensorFlow/PyTorch models. These models had problem (frozen) if they are shared across processes (say forked) - at least at that time (not sure about for their new versions). There are other objects we could not share, just like TensorFlow models. If we re-fork the process for each request, we will also need to initialize such objects, and it is just too slow (it depends on the requirement, but for us, this was just not an option).

  • store your 150GB data in more efficient storage: For each client request, our application process is like generating (potentially a lot) combinations (in a lazy way), and test each one to see if it is in a dictionary. It is quite often that we need to generate > 5000 combinations for each client request.

    • I am not an database expert, but if we store the data in a database. For each client request, if we need to send > 5000 database requests to see if there is a match, it seems this would be much slower than if we could test against data in the memory. (if 1 client request only invokes 1 database request, I will store the data in database!) The database requests also cost $.

    • It is not clear to me if database approach works if we are dealing with the trie structure.

    • Our application data was subjective to change quite often (re-generate a few times per day - we were still in the experimentation stage). Writing the huge data to a database in this case was also difficult (time/money consuming).

    • The keys of our dictionaries are usually quite long ( > 128 characters). The keys for our trie data is short (about 10 characters), but it is a very deep nested dictionary (10 ~ 20 levels). I am not able to say anything about its hashtable is separated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable., but we did face the COW issue. And once we ported @eduardo-elizondo solution to Python 3.8.5, the issue was completed solved!

These are all I remembered. I am no longer at that company. I could contact them if you want to hear more details from them. Currently they have to work with Python 3.8 with @eduardo-elizondo (modified) solution - which ends in 2024. I just want to let people reviewing this PR know there are real world needs for this solution. (Even if, in some case, there are sophisticated ways to arrange our huge data, it is not always obvious, and potentially complicate the whole codebase a lot).

@methane
Copy link
Member

@methane methane commented Feb 27, 2022

@ydshieh Thank you for your detailed information.

Although I think building 100GB trie on top of Python dict is bad idea, it's one of Python application.

If we re-fork the process for each request

Of course, re-fork on each request is bad idea. Common web applications recreate worker process on each 100~1000 requests. (And loading application after fork is much safe. e.g. --lazy-app option of uWSGI)

I am not able to say anything about its hashtable is separated from its PyObject. So INCREF/DECREF the dict don't cause huge CoW of hashtable.

Dicts have two memory blocks. One is PyObject part and it is 48bytes. Another is variable sized hashtable. When dict is very small (but not empty), hashtable is 160bytes (len(d) <= 5) or 280bytes (5 < len(d) <= 10). INCREF/DECREF changes only PyObject part. Assuming all dict in your trie is very small, at most 64/(64+160) = 29% memory would be copied to each processes.

Again, thanks for you for your information. Such information is needed to make this kind of estimation.

FWIW, there is another pull request about memory usage.
https://bugs.python.org/issue46845
This reduces the hashtable size about 3/4 ~ 2/3 (160 -> 120, 280 -> 200 bytes) when all keys are strings. This will save much RAM regardless pre-fork is used or not.

@methane
Copy link
Member

@methane methane commented Feb 27, 2022

@eduardo-elizondo Would you merge main branch to all these pull requests?

I am worrying about memory leak and performance regression. That's why I suggested to make this configure option at the mailing list.

@vstinner added leak test recently. So merging main branch will solve my worrying. I am OK to merge this without configure option if no leaks.

(I am not SC member so this is just a my personal request).

@ydshieh
Copy link

@ydshieh ydshieh commented Feb 27, 2022

@methane Thank you for the reply. I would like to provide a few more information/remarks, and some questions (just take this chance to learn a bit more internal) if you have time to answer.

  • I think building 100GB trie on top of Python dict is bad idea, --> When I joined the team, the trie structure was still small. And it is the core component for our application (processing) code. Then it gets much larger, and our focus is still on the processing logic rather than finding a better way to deal with data. I do agree another approach is required if they want to scale the system further.

  • Assuming all dict in your trie is very small: This is not our case. Our trie (denoted by T below) looks like:

    • Imagine a vocabulary vocab of 5000 words.
    • Consider a language which are sentences (i.e. sequences) built on top of vocab. (of course, not all possible sequences, but it is still a large collection).
    • Our trie structure is used to store this language with word as unit.
    • So the top level keys are the set of the 1st word in a sentence. Suppose k = dog is the first word, then T[k] is a dictionary (more precisely, still a trie), whose keys are the set of the 2nd word in a sentence that starts with *dog*
    • So the fact is: almost all dictionaries in T are large.

My questions (if you have time to give a CPython newbie some lessons)

  • 64/(64+160): what 64 represents here, and what this formula computes?
  • One is PyObject part. Another is variable sized hashtable.: What are the roles of these 2 parts?
  • INCREF/DECREF changes only PyObject part. What happen when a read OP is performed, i.e. when we call a_dict[key], in terms of these 2 parts? In particular, what things are copied?

Final personal thoughts

There might be alternate solutions to avoid this CoW issue, but:

  • An engineer must be able to know the cause of his/her system's memory issue first in order to take the action. In our case, we are just 2 usual Python developers (5 people in the company, including the CEO), and we don't know much about CPython. It indeed took us months to identify the actual cause (under the stress of the production server being crashed each few days).
  • Even one knows the cause, it is always a difficult thing to change the fundamental architecture. I don't mean that it is not important enough to change, but there are other constrains to consider. If the team is large enough and has more experienced developer (CPython, WSGI, Database, etc.), it might be easier, but not our case.

Thank you again!

@methane
Copy link
Member

@methane methane commented Feb 28, 2022

Assuming all dict in your trie is very small: This is not our case.

If average dict size is larger, memory copied would become smaller than 29%.

  • 64/(64+160): what 64 represents here, and what this formula computes?

sizeof(PyDictObject) + sizeof(PyGCHeader).

  • One is PyObject part. Another is variable sized hashtable.: What are the roles of these 2 parts?

PyObject part contains fixed size data and metadata that all Python objects have. The refcount is contained here. So CoW happened here.

Another part contains variable-sized data. This includes pointer to keys and values. This part don't have refcount of Python object. So this pull request doesn't affect it and CoW won't happen here until you modify the dict.

  • INCREF/DECREF changes only PyObject part. What happen when a read OP is performed, i.e. when we call a_dict[key], in terms of these 2 parts? In particular, what things are copied?

When getting item from value, PyObject* is retrieved from hashtable, and refcount of the value (not contained in the hashtable) is incremented. In your case, the value is very likly another dict.

Let's stop here. It is too offtopic.

@methane
Copy link
Member

@methane methane commented Feb 28, 2022

@eduardo-elizondo Thank you for updating this.
Would you update #31491 and fix its failing tests too?

@JimJJewett
Copy link

@JimJJewett JimJJewett commented Mar 9, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment