Include column offsets for bytecode instructions #88116

pablogsal · 2021-04-27T10:58:38Z

BPO	43950
Nosy	@terryjreedy, @nedbat, @aroberge, @markshannon, @serhiy-storchaka, @ammaraskar, @pablogsal, @miss-islington, @brandtbucher, @isidentical, @sweeneyde, @sobolevn
PRs	#26955 #26958 #26997 #27011 #27015 #27023 #27023 #27023 #27037 #27047 #27117 #27126 #27217 #27313 #27336 #27337 #27339 #28142 #28150

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-04-27.10:58:38.225>
labels = []
title = 'Include column offsets for bytecode instructions'
updated_at = <Date 2022-01-18.12:31:06.245>
user = 'https://github.com/pablogsal'

bugs.python.org fields:

activity = <Date 2022-01-18.12:31:06.245>
actor = 'sobolevn'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2021-04-27.10:58:38.225>
creator = 'pablogsal'
dependencies = []
files = []
hgrepos = []
issue_num = 43950
keywords = ['patch']
message_count = 39.0
messages = ['392054', '392062', '392081', '392082', '392520', '392540', '392541', '392555', '392557', '396866', '396874', '396950', '396957', '396962', '397109', '397352', '397368', '397589', '397666', '397667', '397669', '397673', '397829', '397837', '397844', '397882', '397909', '397910', '397911', '397914', '397915', '398089', '398144', '398169', '398170', '398172', '398197', '400998', '410856']
nosy_count = 12.0
nosy_names = ['terry.reedy', 'nedbat', 'aroberge', 'Mark.Shannon', 'serhiy.storchaka', 'ammar2', 'pablogsal', 'miss-islington', 'brandtbucher', 'BTaskaya', 'Dennis Sweeney', 'sobolevn']
pr_nums = ['26955', '26958', '26997', '27011', '27015', '27023', '27023', '27023', '27037', '27047', '27117', '27126', '27217', '27313', '27336', '27337', '27339', '28142', '28150']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue43950'
versions = []

pablogsal · 2021-04-27T10:58:38Z

If we could include column offsets from the AST nodes for every bytecode instructions we could leverage these to offer much better tracebacks and a lot more information to debuggers and similar tools. For instance, in an expression such as:

z['aaaaaa']['bbbbbb']['cccccc']['dddddd].sddfsdf.sdfsdf

where one of these elements is None, we could tell exactly what of these pieces is the actual None and we could do some cool highlighting on tracebacks.

Similarly, coverage tools and debuggers could also make this distinction and therefore could offer reports with more granularity.

The cost is not 0: it would be two integers per bytecode instruction, but I think it may be worth the effort.

pablogsal · 2021-04-27T12:33:32Z

I'm going to prepare a PEP since the discussion regarding if the two integers per bytecode are worth enough is going to be eternal.

markshannon · 2021-04-27T14:40:46Z

The additional cost will not only be the line number table, but we need to store the line for exceptions that are reraised after cleanup.
Adding a column will mean more stack consumption.

pablogsal · 2021-04-27T14:44:13Z

Yup, but I still think is worth the cost, giving that debugging improvements are usually extremely popular among users.

terryjreedy · 2021-04-30T22:28:13Z

Specific examples of current messages and proposed improvements would help focus discussion.

If you are willing to only handle code lines up to 256 chars, only 2 bytes should be needed. (0,0) or (255,255) could mean 'somewhere beyond the 256th char'.

pablogsal · 2021-04-30T23:25:22Z

Specific examples of current messages and proposed improvements would help focus discussion.

Yeah, I am proposing going from:

>>> x['aaaaaa']['bbbbbb']['cccccc']['dddddd'].sddfsdf.sdfsdf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not subscriptable

to

>>> x['aaaaaa']['bbbbbb']['cccccc']['dddddd'].sddfsdf.sdfsdf
                         ^^^^^^^^^^
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not subscriptable

pablogsal · 2021-04-30T23:26:53Z

Basically, to highlight in all exceptions the range in the displayed line where the error ocurred. For instance:

>>> foo(a, b/z+2, c, 132432 /x, d /y)
                     ^^^^^^^^^
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

terryjreedy · 2021-05-01T00:51:48Z

Marking what expression evaluated to None would be extremely helpful. So would marking the 0 denominator when there is more than one candidate: "e = a/b + c/d". It should be easy to revise IDLE Shell's print_exception to tag the span. In some cases, the code that goes from a traceback line to a line in a file and marks it could do the same.

What would you do when the expression is not the last line?

try:
x/y
...
except Exception as e:
...
raise e

The except and raise might even be in a separate module.
I look forward to the PEP and discussion.

pablogsal · 2021-05-01T00:59:23Z

So would marking the 0 denominator when there is more than one candidate: "e = a/b + c/d".

No, it will mark the offset of the bytecode that was getting executed when the exception was raised. Is just a way to mark what is raising the particular exception

What would you do when the expression is not the last line?

There is some logic needed to re-raise exceptions. But basically it boils down to comparate the line numbers to decide if you want to propagate or not the offsets.

pablogsal · 2021-07-02T14:10:15Z

New changeset 98eee94 by Pablo Galindo in branch 'main':
bpo-43950: Add code.co_positions (PEP-657) (GH-26955)
98eee94

miss-islington · 2021-07-02T19:04:35Z

New changeset ec8759b by Batuhan Taskaya in branch 'main':
bpo-43950: optimize column table assembling with pre-sizing object (GH-26997)
ec8759b

miss-islington · 2021-07-04T18:02:32Z

New changeset 44f91fc by Batuhan Taskaya in branch 'main':
bpo-43950: use 0-indexed column offsets for bytecode positions (GH-27011)
44f91fc

miss-islington · 2021-07-04T19:05:12Z

New changeset 693cec0 by Batuhan Taskaya in branch 'main':
bpo-43950: include position in dis.Instruction (GH-27015)
693cec0

pablogsal · 2021-07-04T23:14:40Z

New changeset 5644c7b by Ammar Askar in branch 'main':
bpo-43950: Print columns in tracebacks (PEP-657) (GH-26958)
5644c7b

pablogsal · 2021-07-07T19:07:21Z

New changeset 4823d9a by Ammar Askar in branch 'main':
bpo-43950: Add option to opt-out of PEP-657 (GH-27023)
4823d9a

pablogsal · 2021-07-12T19:32:48Z

New changeset 1890dd2 by Batuhan Taskaya in branch 'main':
bpo-43950: Specialize tracebacks for subscripts/binary ops (GH-27037)
1890dd2

pablogsal · 2021-07-13T00:29:47Z

New changeset 9c3eaf8 by Ammar Askar in branch 'main':
bpo-43950: Add documentation for PEP-657 (GH-27047)
9c3eaf8

pablogsal · 2021-07-15T23:38:19Z

New changeset 919ad53 by Batuhan Taskaya in branch 'main':
bpo-43950: make BinOp specializations more reliable (GH-27126)
919ad53

ammaraskar · 2021-07-16T20:51:11Z

As reported by Will McGugan on twitter:

Printing may be wrong for some unicode characters like:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    x = ("该" / 0) + (1 / 2)
         ~~~~~~^
TypeError: unsupported operand type(s) for /: 'str' and 'int'

and

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    x = ("💩" / 0) + (1 / 2)
         ~~~~~~~
TypeError: unsupported operand type(s) for /: 'str' and 'int'

pablogsal · 2021-07-16T20:54:39Z

Ah, this is our friend _PyPegen_byte_offset_to_character_offset. I am glad we refactor it to be used by the Parser and the traceback mechanism. Ammar, would you like to take a look?

ammaraskar · 2021-07-16T20:58:56Z

Aah, I won't have time to investigate until Monday. I'll take a look then unless someone gets to it first :)

pablogsal · 2021-07-16T21:51:23Z

We can probably leverage the unicodedata_UCD_east_asian_width_impl function in the unicodedata extension module to get the width of every character as emojis normally take 2 characters in the terminal. We could expose the necessary functions from there and use them in the _PyPegen_byte_offset_to_character_offset function

ammaraskar · 2021-07-19T18:24:00Z

Had some time to look into this. Just to summarize this problem, it deals with unicode points that are single characters but take up more than the width of a single character, even with a monospace font [1].

In the examples from above, the Chinese character itself counts as one character in a Python string. However, notice that it needs two carets:

>>> x = "该"
>>> print(x)
该
>>> len(x)
1
>>> print(x + '\n' + '^^')
该
^^

This issue is somewhat font dependent, in the case of the emoji I know that windows sometimes renders emojis as single-character wide black-and-white glyphs or colorful ones depending on the program.

As Pablo alluded to, unicodedata.east_asian_width is probably the best solution we can implement. For these wide characters it provides:

>>> unicodedata.east_asian_width('💩')
'W'
>>> unicodedata.east_asian_width('该')
'W'

W corresponding to Wide. Whereas for regular width characters:

>>> unicodedata.east_asian_width('b')
'Na'
>>> unicodedata.east_asian_width('=')
'Na'

we get Neutral (Not East Asian). This can be used to count the "displayed width" of the characters and hence the carets. However, organization is going to be a bit tricky since we're currently using _PyPegen_byte_offset_to_character_offset to get offsets to use for string slicing in the ast segment parsing code. We might have to make a separate function that gets the font display-width.

-------------

[1] Way more details on this issue here: https://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html and an example of a Python library that tries to deal with this issue here: https://github.com/jquast/wcwidth

terryjreedy · 2021-07-19T21:03:09Z

The effort to match caret lines to general unicode is similar to a previous issue that was closed as futile. (But I could not find it.) It has a downside that should be considered.

The fundamental problem is that there is no fixed pitch font for unicode. (Let alone any font covering all of unicode.) Nor is there a single-double width definition, or font, for all of unicode. Some character sets are not amenable to such treatment.

To see the problem easier, open, for instance, IDLE's option/settings dialog, showing the fonts tab and a multi-script sample. On Windows, select what I believe is the most 'fixed' font -- Courier New. ASCII, Latin1, IPA, Greek, Cyrillic, Hebrew, and Arabic are all rendered in the same fixed pitch. But the last 4 Cyrillic characters of "...ЪъЭэѠѤѬӜ" are extremely cramped and may be rendered differently from the rest. The East Asian characters are in a different fixed pitch, about 1.6 times the Ascii, etc. fixed pitch. (So the double-wide 2 is 1.6 rounded up. And with some fonts, the East Asian scripts are not all the same pitch.) The South Asian script are variable pitch and for the sample chars, average wider than 1 (they have 20 chars, like the Ascii, etc, lines). Tamil, especially, has a wide range of widths, with the widest as wide as the East Asian chars.

On Windows, on my machine, pasting the sample text between quotes results in the Greek chars, the last 4 Cyrillic chars, and all Asian chars (including Hebrew and Arabic) being replaced by replacement chars. (I thought that this was better once, but maybe I mis-remember.) While one can get script-specific fonts, the fixed-pitch South Asian fonts I tried on Mac were hardly readable. My conclusion is that people using certain scripts and anyone wanting a wide variety of scripts needs to use a GUI-based editor and shell rather than a fixed-pitch terminal/console.

As long as the caret line has 1 char per code char, a GUI program can use it to mark code characters, and do so differently for '~' and '^'. If some of these chars are doubled, exact character information is lost. If you go ahead with this, please use a third character, such as '-', for additions. GUI programs could then ignore these, given that they can otherwise can get the start character information.

ammaraskar · 2021-07-19T22:47:07Z

I think this is the previous issue you are talking about Terry. https://bugs.python.org/issue24665

The correct algorithm is a little more complex than just using east_asian_widths. Ideally we would like to use the wcwidth function (https://man7.org/linux/man-pages/man3/wcwidth.3.html) which implements all the logic to somewhat properly figure out widths.

Given Terry's comments from above and the previous issues I'm starting to think the fix for this issue might not be worth the added complexity.

pablogsal · 2021-07-20T15:42:23Z

New changeset fbc349f by Batuhan Taskaya in branch 'main':
bpo-43950: Distinguish errors happening on character offset decoding (GH-27217)
fbc349f

pablogsal · 2021-07-20T21:06:29Z

I don't know, seems that rust can deal with this just fine:

❯ cargo run
Compiling rtest v0.1.0 (/home/pablogsal/rtest)
error[E0308]: mismatched types
--> src/main.rs:7:10
|
7 | test("hellooooooo: 🥳 😏 😒 😞 😔 some emojis");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| |
| expected struct String, found &str
| help: try using a conversion method: "hellooooooo: 🥳 😏 😒 😞 😔 some emojis".to_string()

pablogsal · 2021-07-20T21:07:09Z

In bpo this looks bad, but test this small program yourself and check the error in your terminal:

fn test(number: String) -> () {
println!("Hello, world: {}", number);
}

fn main() {
test("hellooooooo: 🥳 😏 😒 😞 😔 some emojis");
}

pablogsal · 2021-07-20T21:09:53Z

They seem to be checking the unicode width as we are proposing here. Check the calls to unicode_width::UnicodeWidthChar::width:

https://github.com/rust-lang/rust/blob/master/compiler/rustc_errors/src/emitter.rs#L729-L733

ammaraskar · 2021-07-20T21:45:10Z

Indeed, and the unicode_width package seems to implement the wcwidth algorithm:

https://bugs.python.org/issue12568 is an issue from the past that proposed adding wcwidth to the standard library.

pablogsal · 2021-07-20T21:51:10Z

Even if this doesn't cover all cases it seems that everyone else is going with this solution for what I can see, and currently this looks suboptimal on python because we get this wrong on the vast majority of cases so my suggestion is to try to fix this the same way rust on others are doing it even if there is some subset of cases where it won't fully work

sweeneyde · 2021-07-23T18:58:19Z

bpo-44719 managed to make the assert(source_line); fail

pablogsal · 2021-07-24T12:50:36Z

New changeset c836231 by Batuhan Taskaya in branch 'main':
bpo-43950: ensure source_line is present when specialising the traceback (GH-27313)
c836231

isidentical · 2021-07-24T17:49:36Z

New changeset ef8b853 by Batuhan Taskaya in branch 'main':
bpo-43950: check against the raw string, not the pyobject (GH-27337)
ef8b853

isidentical · 2021-07-24T17:50:44Z

New changeset 4f5980a by Batuhan Taskaya in branch 'main':
bpo-43950: support long lines in traceback.py (GH-27336)
4f5980a

isidentical · 2021-07-24T18:28:59Z

There is a subtle difference between traceback.py and Python/traceback.c which makes the latter (C version) output the line without trimming the trailing whitespace. The Python version simply uses .strip() which strips both the left (starting) and right (ending) side. I'd personally go with adapting the C version to use full strip, since it makes more sense but I am open to comments on this.

Here is an example test (needs GH 27339 first) (func( call has 2 trailing whitespace characters);
https://gist.github.com/isidentical/37ab1693d551b678a52626298d99e582

miss-islington · 2021-07-25T22:01:51Z

New changeset 3e235e0 by Batuhan Taskaya in branch 'main':
bpo-43950: support some multi-line expressions for PEP-657 (GH-27339)
3e235e0

isidentical · 2021-09-03T15:29:17Z

New changeset 85ea2d6 by Batuhan Taskaya in branch 'main':
bpo-43950: support positions for dis.Instructions created through dis.Bytecode (GH-28142)
85ea2d6

sobolevn · 2022-01-18T12:31:06Z

I've opened a new issue about unused dis.Positions here: https://bugs.python.org/issue46422

JustAnotherArchivist · 2022-04-13T23:14:58Z

As far as I can see, this is not yet supported by inspect.trace and inspect.stack. That would be very useful for debugging.

pablogsal · 2022-04-14T10:29:50Z

As far as I can see, this is not yet supported by inspect.trace and inspect.stack. That would be very useful for debugging.

I can try to prepare a PR this week

pablogsal · 2022-04-14T11:29:50Z

Hummm, is unclear to me if we can actually update these interfaces. As they return a (named) tuple, users unpacking the tuple will be broken if we add more items to it.

JustAnotherArchivist · 2022-04-14T14:14:12Z

Yeah, I noticed that as well. The unpacking could be solved with a custom class that provides an __iter__ method for compatibility. For example:

import dataclasses
import typing

@dataclasses.dataclass
class X:
	a: int
	b: str
	c: float

	def __iter__(self):
		return iter((self.a, self.b))

x = X(42, 'foo', 3.1416)
a, b = x
print(a, b)
print(x.a, x.b)

This would still break other, more obscure uses of the named tuple specifics though, e.g. isinstance(x, tuple) or x._asdict().

pablogsal · 2022-04-14T15:16:06Z

Yeah, I noticed that as well. The unpacking could be solved with a custom class that provides an __iter__ method for compatibility. For example:
import dataclasses
import typing

@dataclasses.dataclass
class X:
	a: int
	b: str
	c: float

	def __iter__(self):
		return iter((self.a, self.b))

x = X(42, 'foo', 3.1416)
a, b = x
print(a, b)
print(x.a, x.b)
This would still break other, more obscure uses of the named tuple specifics though, e.g. isinstance(x, tuple) or x._asdict().

That's likely too convoluted and it will certainly be very surprising in a bad way to new users.

JustAnotherArchivist · 2022-04-14T15:36:11Z

I agree that it's more of an awful workaround than an actual solution.

Perhaps a better option might be adding a kwarg to trace and stack that makes them return a different data structure with the extra info. It would then default to False for backward compatibility. There's precedent for that as well: typing.get_type_hints and its include_extras argument. Although the question would still be what data type would be returned there; if we don't anticipate further additions anytime soon, maybe just an extended named tuple would do, but a dataclass or similar structure would be more future-proof.

pablogsal · 2022-04-14T15:42:55Z

. Although the question would still be what data type would be returned there; if we don't anticipate further additions anytime soon, maybe just an extended named tuple would do, but a dataclass or similar structure would be more future-proof.

Unfortunately adding a data class to the inspect module will have an impact on startup time and size, which is very undesirable as is on the hot path. I don't dislike the keyword argument, though, but we need to see what others think. @ammaraskar @isidentical

JelleZijlstra · 2022-04-14T16:35:05Z

We can make a namedtuple subclass that adds additional attributes. There is precedent for this in the form of os.stat_result, although that's a structseq (the C equivalent of namedtuple). It's a little confusing, but it should preserve compatibility and avoid convoluting the API.

pablogsal · 2022-04-17T17:14:01Z

We can make a namedtuple subclass that adds additional attributes. There is precedent for this in the form of os.stat_result, although that's a structseq (the C equivalent of namedtuple). It's a little confusing, but it should preserve compatibility and avoid convoluting the API.

Yeah, I recall that one. I am not a fan of that approach as I find it very confusing. Maybe we should open a thread in python-dev about this.

pablogsal · 2022-04-17T17:21:04Z

I have sent an email about this to python-dev

gpshead · 2022-04-17T18:58:40Z

As awkward of a beast as a tuple subclass emulating the previous fixed size namedtuple while adding additional attributes seems, it is very practical.

Users don't care about the specifics. They just need existing code to keep working and will want access to new values when updating this code or writing new code. Which means it should still be a tuple subclass for typing reasons known as inspect.Traceback and inspect.FrameInfo, support the existing indices and unpacking semantics, and support named field access. That this would be adding additional fields that indexing does not provide is a mere curiosity that most users would not notice as modern code should be field based anyways.

Nobody should care if the implementation behind the returned instances is unusual. The behavior will be what they expect. It avoids the need to expand into multiple different return types via an arg flag or the creation of parallel APIs.

This is a good practical example of why we should not design APIs to return tuples if they could conceivably change in the future. And why people unpacking on API calls that return tuples might find it wise to always [:slice] the return value as a style idiom.

All this said... we could also just add another field to the namedtuple. Some code inspection among existing API users is required, but I expect most users either immediately unpack upon return value assignment or use field names. Mix and matching of indexing and named access on a single return value is probably rare. The easy modification to existing code using unpacking to deal with this kind of API change upon version upgrade is to append a [:5] or similar to the API call. We have had other tuple returning APIs add fields in the past where this was the workaround though I've forgotten what they were off the top of my head.

This would be the more disruptive option. It is nice to avoid this if possible as it delays people's ability to update to and even test their code on 3.11. So I still lean towards the hybrid "half namedtuple" class.

…ts. (GH-91666) * Stores all location info in linetable to conform to PEP 626. * Remove column table from code objects. * Remove end-line table from code objects. * Document new location table format

…pty less dangerous. (GH-91790)

… information (GH-91531)

ezio-melotti transferred this issue from another repository Apr 10, 2022

JelleZijlstra mentioned this issue Apr 14, 2022

gh-88116: Enhance the inspect frame APIs to use the extended position information #91531

Merged

markshannon mentioned this issue Apr 18, 2022

GH-88116: Use a compact format to represent end line and column offsets. #91666

Merged

markshannon mentioned this issue Apr 21, 2022

GH-88116: Document that PyCodeNew is dangerous, and make PyCode_NewEmpty less dangerous. #91790

Merged

markshannon added a commit that referenced this issue Apr 21, 2022

GH-88116: Document that PyCodeNew is dangerous, and make PyCode_NewEm…

d44815c

…pty less dangerous. (GH-91790)

pablogsal added a commit that referenced this issue Apr 23, 2022

gh-88116: Enhance the inspect frame APIs to use the extended position…

0daa99f

… information (GH-91531)

python / cpython Public

Include column offsets for bytecode instructions #88116

Include column offsets for bytecode instructions #88116

Comments

pablogsal commented Apr 27, 2021

pablogsal commented Apr 27, 2021

pablogsal commented Apr 27, 2021

markshannon commented Apr 27, 2021

pablogsal commented Apr 27, 2021

terryjreedy commented Apr 30, 2021

pablogsal commented Apr 30, 2021

pablogsal commented Apr 30, 2021

terryjreedy commented May 1, 2021

pablogsal commented May 1, 2021

pablogsal commented Jul 2, 2021

miss-islington commented Jul 2, 2021

miss-islington commented Jul 4, 2021

miss-islington commented Jul 4, 2021

pablogsal commented Jul 4, 2021

pablogsal commented Jul 7, 2021

pablogsal commented Jul 12, 2021

pablogsal commented Jul 13, 2021

pablogsal commented Jul 15, 2021

ammaraskar commented Jul 16, 2021

pablogsal commented Jul 16, 2021

ammaraskar commented Jul 16, 2021

pablogsal commented Jul 16, 2021

ammaraskar commented Jul 19, 2021

terryjreedy commented Jul 19, 2021

ammaraskar commented Jul 19, 2021

pablogsal commented Jul 20, 2021

pablogsal commented Jul 20, 2021

pablogsal commented Jul 20, 2021

pablogsal commented Jul 20, 2021

ammaraskar commented Jul 20, 2021

pablogsal commented Jul 20, 2021

sweeneyde commented Jul 23, 2021

pablogsal commented Jul 24, 2021

isidentical commented Jul 24, 2021

isidentical commented Jul 24, 2021

isidentical commented Jul 24, 2021

miss-islington commented Jul 25, 2021

isidentical commented Sep 3, 2021

sobolevn commented Jan 18, 2022

JustAnotherArchivist commented Apr 13, 2022

pablogsal commented Apr 14, 2022

pablogsal commented Apr 14, 2022

JustAnotherArchivist commented Apr 14, 2022

pablogsal commented Apr 14, 2022

JustAnotherArchivist commented Apr 14, 2022

pablogsal commented Apr 14, 2022

JelleZijlstra commented Apr 14, 2022

pablogsal commented Apr 17, 2022

pablogsal commented Apr 17, 2022

gpshead commented Apr 17, 2022