Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-94526: getpath_dirname() no longer encodes the path #97645

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 29, 2022

Fix the Python path configuration used to initialized sys.path at
Python startup. getpath_basename() and getpath_dirname() functions no
longer encode the path to UTF-8/strict to avoid encoding errors if it
contains surrogate characters (created by decoding a bytes path with
the surrogateescape error handler).

The functions now use PyUnicode_FindChar() and PyUnicode_Substring()
on the Unicode path, rather than strrchr() on the encoded bytes
string.

@vstinner
Copy link
Member Author

vstinner commented Sep 29, 2022

Fix building Python in a non-ASCII path

Well. In fact, the issue is broader: no only _bootstrap_python is affected, any python program is affected since the Modules/getpath.c code is used by all Python executables.

@vstinner vstinner force-pushed the getpath_unicode branch 2 times, most recently from 51ee026 to f2235ae Compare Sep 29, 2022
@vstinner
Copy link
Member Author

vstinner commented Sep 29, 2022

I rebased and updated the PR to clarify that this issue affects the Python path configuration (sys.path creation).

@vstinner
Copy link
Member Author

vstinner commented Sep 29, 2022

Sadly, Modules/getpath.c is not a regular extension module, it cannot be loaded in test_getpath to easily write unit tests.

There are getpath_methods which are injected inside a namespace (dict) by funcs_to_dict() function.

It may be interesting to convert it to a regular extension module (_getpath?).

const char *path;
if (!PyArg_ParseTuple(args, "s", &path)) {
PyObject *path;
if (!PyArg_ParseTuple(args, "U", &path)) {
Copy link
Member

@serhiy-storchaka serhiy-storchaka Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

Why cannot they be implemented in Python?

Copy link
Member Author

@vstinner vstinner Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.

I tried to minimize the changes.

Why cannot they be implemented in Python?

Ask @zooba who designed this. Maybe it can be changed?

Fix the Python path configuration used to initialized sys.path at
Python startup. Paths are no longer encoded to UTF-8/strict to avoid
encoding errors if it contains surrogate characters (bytes paths are
decoded with the surrogateescape error handler).

getpath_basename() and getpath_dirname() functions no longer encode
the path to UTF-8/strict, but work directly on Unicode strings. These
functions now use PyUnicode_FindChar() and PyUnicode_Substring() on
the Unicode path, rather than strrchr() on the encoded bytes string.
@vstinner
Copy link
Member Author

vstinner commented Sep 29, 2022

@serhiy-storchaka:

The NEWS entry is meant to be read by a common Python user, not a core dev. getpath_basename and getpath_dirname are not Python function. Could you rewrite this?

I rephrased the NEWS entry to omit function names. Is it better? I only named functions in the commit message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants