Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-25457: Allow json.encode() to handle mixed keys when sort_keys=True #8011

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jheiv
Copy link

@jheiv jheiv commented Jun 29, 2018

Note: This causes 3 regression tests to fail. The first two fail because the ordering of items when sort_keys=True is different than the hard coded expectation. The third failure is because specifying sort_keys=True and mixed keys no longer raises the expected TypeError:

BPO Issue: https://bugs.python.org/issue25457

python.bat -m test.regrtest test_json -l  -v
...
======================================================================
FAIL: test_encode_truefalse (test.test_json.test_dump.TestCDump)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\james\Source\Repos\cpython\lib\test\test_json\test_dump.py", line 21, in test_encode_truefalse
    '{"false": 1, "2": 3.0, "4.0": 5, "6": true}')
AssertionError: '{"2": 3.0, "4.0": 5, "6": true, "false": 1}' != '{"false": 1, "2": 3.0, "4.0": 5, "6": true}'
- {"2": 3.0, "4.0": 5, "6": true, "false": 1}
+ {"false": 1, "2": 3.0, "4.0": 5, "6": true}


======================================================================
FAIL: test_encode_truefalse (test.test_json.test_dump.TestPyDump)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\james\Source\Repos\cpython\lib\test\test_json\test_dump.py", line 21, in test_encode_truefalse
    '{"false": 1, "2": 3.0, "4.0": 5, "6": true}')
AssertionError: '{"2": 3.0, "4.0": 5, "6": true, "false": 1}' != '{"false": 1, "2": 3.0, "4.0": 5, "6": true}'
- {"2": 3.0, "4.0": 5, "6": true, "false": 1}
+ {"false": 1, "2": 3.0, "4.0": 5, "6": true}


======================================================================
FAIL: test_unsortable_keys (test.test_json.test_speedups.TestEncode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\james\Source\Repos\cpython\lib\test\test_json\test_speedups.py", line 73, in test_unsortable_keys
    self.json.encoder.JSONEncoder(sort_keys=True).encode({'a': 1, 1: 'a'})
AssertionError: TypeError not raised

----------------------------------------------------------------------
Ran 153 tests in 3.518s

FAILED (failures=3, skipped=1)

https://bugs.python.org/issue25457

abadger added a commit to abadger/ansible that referenced this issue Dec 5, 2018
…keys that aren't strings

This fixes one of the problems reported in ansible#49343

Upstream Python3 bug for the json traceback:
https://bugs.python.org/issue25457

and PR that may fix it:
python/cpython#8011
abadger added a commit to ansible/ansible that referenced this issue Dec 5, 2018
…keys that aren't strings

This fixes one of the problems reported in #49343

Upstream Python3 bug for the json traceback:
https://bugs.python.org/issue25457

and PR that may fix it:
python/cpython#8011
abadger added a commit to abadger/ansible that referenced this issue Dec 19, 2018
…s dictionary keys that aren't strings

This fixes one of the problems reported in ansible#49343

Upstream Python3 bug for the json traceback:
https://bugs.python.org/issue25457

and PR that may fix it:
python/cpython#8011.
(cherry picked from commit c817bef)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
kbreit pushed a commit to kbreit/ansible that referenced this issue Jan 11, 2019
…keys that aren't strings

This fixes one of the problems reported in ansible#49343

Upstream Python3 bug for the json traceback:
https://bugs.python.org/issue25457

and PR that may fix it:
python/cpython#8011
@aeros aeros added the stale label Sep 8, 2019
@aeros
Copy link

@aeros aeros commented Sep 8, 2019

An updated PR has been opened to address this issue (#15691). All of the CI tests are passing on that one and it's been a while since this one was last updated, so I'll label this as stale and recommend for it to be closed once the author fixes the title for #15691.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

In principle looks good to me, bet the drawback of this PR is that it makes slower the common case, when all keys are strings.

@aeros
Copy link

@aeros aeros commented Sep 8, 2019

@serhiy-storchaka:

In principle looks good to me, bet the drawback of this PR is that it makes slower the common case, when all keys are strings.

How significant is the performance loss? I guess the question is whether or not providing the functionality to properly sort mixed keys is worth slowing down the sorting process in the common cases.

Would it be reasonable to provide the users a kwarg for json.encode so that it provides this functionality optionally without slowing down the sorting process for the more common cases? This option would be False by default. The kwarg could be called mixed_keys or something along those lines.

@aeros
Copy link

@aeros aeros commented Sep 8, 2019

Removing the stale label since #15691 was closed. This also seems to be more of an enhancement than a bugfix, but I think we should look into updating this PR and providing users an option to accurately sort mixed keys.

@aeros aeros added type-feature and removed stale labels Sep 8, 2019
@yoann9344
Copy link

@yoann9344 yoann9344 commented Sep 9, 2019

I'm the author of #15691, my modification of source code seemed to work for special case, but it didn't directly call json.dumps, it pass through jsonify passing through itsdangerous. Let's go up to the river :
dict definition

BuildingInfo_Names = {
    'Wood': 1,
    'Clay': 2,
    'Iron': 3,
    'Crop': 4,
    ...}

add of reversed keys, for code simplification (in the __init__ of a module) (maybe not the best way ^^")

for name, b_type in list(BuildingInfo_Names.items()): 
    BuildingInfo_Names[b_type] = name

my line of code causing the issue return jsonify(BuildingInfo_Names)

jsonify from flask (pallets)

from itsdangerous import json as _json
...
def dumps(obj, app=None, **kwargs):
    _dump_arg_defaults(kwargs, app=app)
    encoding = kwargs.pop("encoding", None)
    rv = _json.dumps(obj, **kwargs)
    if encoding is not None and isinstance(rv, text_type):
        rv = rv.encode(encoding)
    return rv
...
def _dump_arg_defaults(kwargs, app=None):
    """Inject default arguments for dump functions."""
    if app is None:
        app = current_app

    if app:
        bp = app.blueprints.get(request.blueprint) if request else None
        kwargs.setdefault(
            "cls", bp.json_encoder if bp and bp.json_encoder else app.json_encoder
        )

        if not app.config["JSON_AS_ASCII"]:
            kwargs.setdefault("ensure_ascii", False)

        kwargs.setdefault("sort_keys", app.config["JSON_SORT_KEYS"])
    else:
        kwargs.setdefault("sort_keys", True)
        kwargs.setdefault("cls", JSONEncoder)
...
def jsonify(*args, **kwargs):
    indent = None
    separators = (",", ":")

    if current_app.config["JSONIFY_PRETTYPRINT_REGULAR"] or current_app.debug:
        indent = 2
        separators = (", ", ": ")

    if args and kwargs:
        raise TypeError("jsonify() behavior undefined when passed both args and kwargs")
    elif len(args) == 1:  # single args are passed directly to dumps()
        data = args[0]
    else:
        data = args or kwargs

    return current_app.response_class(
        dumps(data, indent=indent, separators=separators) + "\n",
        mimetype=current_app.config["JSONIFY_MIMETYPE"],
    )

itsdangerous (pallets) in __init__ : from ._json import json

try:
    import simplejson as json
except ImportError:
    import json


class _CompactJSON(object):
    """Wrapper around json module that strips whitespace."""

    @staticmethod
    def loads(payload):
        return json.loads(payload)

    @staticmethod
    def dumps(obj, **kwargs):
        kwargs.setdefault("ensure_ascii", False)
        kwargs.setdefault("separators", (",", ":"))
        return json.dumps(obj, **kwargs)

I hope the water was not too cold and that no one drowns !

nb : I have done the modification that I have submitted and it worked, but finally I have opted for putting True on app.config["JSON_SORT_KEYS"] for compatibility with default behavior
ps : we can notice how a lot of dependencies can be pain full ^^

@csabella
Copy link

@csabella csabella commented Feb 4, 2020

@jheiv, please resolve the merge conflict and respond to the code reviews. Thank you!

@jheiv
Copy link
Author

@jheiv jheiv commented Mar 24, 2020

@csabella Sorry for the delay -- merge conflicts are now resolved and the failing tests have been updated to account for mixed-type-keys.

Re @serhiy-storchaka 's concern about slowing down the common case, I agree and will follow up with some profiling tomorrow/this week so you guys can have an idea of the cost.

@jheiv
Copy link
Author

@jheiv jheiv commented Mar 24, 2020

Here are the results of some benchmarks:

   object_cache         imp       sort_keys       pre -> post perf_counter
---------------------------------------------------------------------------------------
   100k-1000-0.1.pkl    c         True      138696600 ->  135319600    2.435 % faster
   100k-1000-0.1.pkl    c         False     136588000 ->  135399100    0.870 % faster
   100k-1000-0.5.pkl    c         True      137782000 ->  135307000    1.796 % faster
   100k-1000-0.5.pkl    c         False     136207500 ->  134547300    1.219 % faster
   100k-1000-0.8.pkl    c         True      137624500 ->  135440500    1.587 % faster
   100k-1000-0.8.pkl    c         False     136098100 ->  134688400    1.036 % faster
    10k-0010-0.1.pkl    c         True       13736300 ->   13545600    1.388 % faster
    10k-0010-0.1.pkl    c         False      13614400 ->   13429000    1.362 % faster
    10k-0010-0.5.pkl    c         True       13804300 ->   13574100    1.668 % faster
    10k-0010-0.5.pkl    c         False      13738200 ->   13726700    0.084 % faster
    10k-0010-0.8.pkl    c         True       13764000 ->   13537200    1.648 % faster
    10k-0010-0.8.pkl    c         False      13644500 ->   13458200    1.365 % faster
    10k-0100-0.1.pkl    c         True       13841400 ->   13456100    2.784 % faster
    10k-0100-0.1.pkl    c         False      13647000 ->   13772900    0.923 % slower
    10k-0100-0.5.pkl    c         True       13793600 ->   13421800    2.695 % faster
    10k-0100-0.5.pkl    c         False      13605500 ->   13498100    0.789 % faster
    10k-0100-0.8.pkl    c         True       13716400 ->   13459500    1.873 % faster
    10k-0100-0.8.pkl    c         False      13579900 ->   13539500    0.297 % faster
    10k-1000-0.1.pkl    c         True       13748200 ->   13480800    1.945 % faster
    10k-1000-0.1.pkl    c         False      13622000 ->   13449200    1.269 % faster
    10k-1000-0.5.pkl    c         True       13771900 ->   13452800    2.317 % faster
    10k-1000-0.5.pkl    c         False      13598600 ->   13649500    0.374 % slower
    10k-1000-0.8.pkl    c         True       13744100 ->   13553500    1.387 % faster
    10k-1000-0.8.pkl    c         False      13646000 ->   13431300    1.573 % faster
   100k-1000-0.1.pkl    python    True      396553200 ->  398647100    0.528 % slower
   100k-1000-0.1.pkl    python    False     396115400 ->  400246100    1.043 % slower
   100k-1000-0.5.pkl    python    True      397332200 ->  399220700    0.475 % slower
   100k-1000-0.5.pkl    python    False     398192100 ->  398421100    0.058 % slower
   100k-1000-0.8.pkl    python    True      399345500 ->  399795800    0.113 % slower
   100k-1000-0.8.pkl    python    False     395161600 ->  398907400    0.948 % slower
    10k-0010-0.1.pkl    python    True       38967100 ->   39088400    0.311 % slower
    10k-0010-0.1.pkl    python    False      39113900 ->   39128600    0.038 % slower
    10k-0010-0.5.pkl    python    True       38796600 ->   39046200    0.643 % slower
    10k-0010-0.5.pkl    python    False      39129000 ->   38929200    0.511 % faster
    10k-0010-0.8.pkl    python    True       39188900 ->   39143000    0.117 % faster
    10k-0010-0.8.pkl    python    False      38901400 ->   39334300    1.113 % slower
    10k-0100-0.1.pkl    python    True       38851800 ->   39085200    0.601 % slower
    10k-0100-0.1.pkl    python    False      39042700 ->   38743900    0.765 % faster
    10k-0100-0.5.pkl    python    True       39044100 ->   38872200    0.440 % faster
    10k-0100-0.5.pkl    python    False      39196300 ->   39073100    0.314 % faster
    10k-0100-0.8.pkl    python    True       38734600 ->   38963900    0.592 % slower
    10k-0100-0.8.pkl    python    False      38613000 ->   39179800    1.468 % slower
    10k-1000-0.1.pkl    python    True       38911700 ->   39270300    0.922 % slower
    10k-1000-0.1.pkl    python    False      39057800 ->   39241600    0.471 % slower
    10k-1000-0.5.pkl    python    True       39380600 ->   39411700    0.079 % slower
    10k-1000-0.5.pkl    python    False      38666500 ->   39381500    1.849 % slower
    10k-1000-0.8.pkl    python    True       39263000 ->   38900400    0.924 % faster
    10k-1000-0.8.pkl    python    False      39197900 ->   38836400    0.922 % faster
  • obj_cache - a pickled list of randomly generated dictionaries. The three hyphen-separated parts of the filename are (# of records, # of items per record, fraction of keys that are non-strings).
  • imp - JSON Encoder implementation -- either cjson or pyjson as defined in https://github.com/python/cpython/blob/master/Lib/test/test_json/__init__.py#L9-L10
  • sort_keys - the sort_keys value passed to the encoder constructor.
  • pre -> post perf_counter - I used time.perf_counter_ns to time encoding of the unpickled list:
    def time_dumps(self, module, sort_keys, runs=5, runs_per_run=10, now=time.perf_counter_ns):
        encoder = module.encoder.JSONEncoder(sort_keys=sort_keys)
        encode  = encoder.encode

        times = []
        for _ in range(runs):
            t0 = now()
            for _ in range(runs_per_run):
                for obj in self.objects:
                    _ = encode(obj)
            td = now() - t0
            times.append(td)

For each pickle file I "kept" the minimum time from each of the runs deltas (though mean showed similar results).

I ran the test (loading the same pickled lists of dictionaries) for both the upstream master (without patch) and remote master (my fork, with patch).

The results are... confusing -- with the c implementation seeming slightly faster with the changes and the python implementation seeming slightly slower (but all, in general, less than +/- 2%)

If there's another method of benchmarking that's more appropriate here, I'd be happy to run them -- please let me know.

goto bail;
}

coerced_items = PyList_New(0);
Copy link
Member

@aeros aeros Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to introduce a few possible refleaks for coerced_items, effectively in each goto bail location where coerced_items isn't NULL, prior to the Py_DECREF(coerced_items). IMO, the easiest solution would be adding a Py_XDECREF(coerced_items) to bail since there are multiple paths where this could potentially occur.

There should probably also be the following error condition, in case the list can't be instantiated:

Suggested change
coerced_items = PyList_New(0);
coerced_items = PyList_New(0);
if (coerced_items == NULL) {
goto bail;
}

What do you think @serhiy-storchaka? My experience with the C-API is fairly limited, so I'm not entirely certain about the above.

@rhettinger
Copy link

@rhettinger rhettinger commented May 10, 2022

-1 from me.

IMO sorting mix keys is an anti-pattern that shouldn't be encouraged. This is doubly true on the encoding side when the user have complete control over the data going in.

For the most part, it is a feature that sorting fails here. It is an indication that something is seriously wrong with the data modeling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants