Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

Merged
merged 19 commits into from Dec 23, 2020

Conversation

@rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Dec 18, 2020

  • closes #38424
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Alternative to #38522. I've been testing this locally using both xlrd 1.2.0 and 2.0.1.

One test fails because we used to default to xlrd but now default to openpyxl, it's not clear to me if this test should be passing with openpyxl.

cc @cjw296, @jreback, @jorisvandenbossche

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

Windows failures are related, will be investigating.

else:
peek = buf
except FileNotFoundError:
# File may be a url, return the extension

This comment has been minimized.

@cjw296

cjw296 Dec 19, 2020
Contributor

Why provide degraded inference just because the source is a URL?
It appears this code goes out of its way to avoid using seek, whereas the ODS inference code that was there before, and xlrd which is being hit in many of the existing code paths already does seek without issue.

This comment has been minimized.

@rhshadrach

rhshadrach Dec 19, 2020
Author Member

Previous ODS code read at most 84 bytes; current code needs the entire file. I'll do a partial revert here and utilize the previous ODS code, but I have reservations about downloading the entire file here. Would like to hear others' thoughts.

This comment has been minimized.

@rhshadrach

rhshadrach Dec 19, 2020
Author Member

Ah; I think I have falsely assumed we need to get the entire contents of the file. I think it should be possible to get BufferedIOBase/RawIOBase into the proper form for ZipFile without reading.

This comment has been minimized.

@cjw296

cjw296 Dec 19, 2020
Contributor

I did put quite a lot of effort into #38424; the code in that PR is the way it is from, as carefully as I could, following through the various code paths and ensuring the behaviour was as simple and robust as it could be, in spite of less automated testing than I was expecting. It's tough to see these kind of comments which sort of imply that I hadn't thought any of this through...

This comment has been minimized.

@rhshadrach

rhshadrach Dec 19, 2020
Author Member

In no way shape or form did I have any intention of implying such a thing. As my comment above says, I was mistaken.

@cjw296
Copy link
Contributor

@cjw296 cjw296 commented Dec 19, 2020

I believe this would also replace #38456.

I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2.

This PR has code that is more complex than #38522, even ignoring that which is in place to support the convoluted deprecation process it advocates for. I haven't seen code coverage metrics as part of the pandas PR process, but are you sure all the new conditional branches you've introduced are covered by sufficient tests?

In case it wasn't clear in #38522, I believe the most robust approach in this area is to:

  • obtain the stream we're going to end up with anyway
  • peek the minimum number of bytes from it to do content-based inference
  • .seek(0) to get the stream back to its initial state.
@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

Thanks for the comments @cjw296, I was able to simplify/improve a lot from them.

@cjw296
Copy link
Contributor

@cjw296 cjw296 commented Dec 19, 2020

@rhshadrach - I'm confused as to why you didn't just start with #38522 and add the fallback logic you're advocating for. (corrected)

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

@cjw296

I'm confused as to why you didn't just start with #38424 and add the fallback logic you're advocating for.

I think you mean your PRs #38456/#38522. I had started this work 7 days ago, prior to the existence of them.

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

2 linux tests failed to start, all other checks passed. Rerunning.

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

/azp run

@azure-pipelines
Copy link
Contributor

@azure-pipelines azure-pipelines bot commented Dec 19, 2020

Azure Pipelines successfully started running 1 pipeline(s).
@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 19, 2020

Update: Failure if from one of the builds using zh_CN.utf8

Failure is related, though I don't understand it:

>               handle = open(handle, ioargs.mode)
E               FileNotFoundError: [Errno 2] 没有那个文件或目录: 'foo.xlsm'

[snip]

with pytest.raises(FileNotFoundError, match="No such file or directory"):
>           pd.read_excel(bad_file)
E           AssertionError: Regex pattern 'No such file or directory' does not match "[Errno 2] 没有那个文件或目录: 'foo.xlsm'".
@jreback jreback added this to the 1.2 milestone Dec 21, 2020
Copy link
Contributor

@jreback jreback left a comment

pandas/io/excel/_base.py Outdated Show resolved Hide resolved
@jreback jreback mentioned this pull request Dec 21, 2020
rhshadrach added 2 commits Dec 21, 2020
@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 21, 2020

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

@jreback
Copy link
Contributor

@jreback jreback commented Dec 21, 2020

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

ok sure. let's try to get this in today and cut the release. i view better inference as good but shouldn't hold this up.

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 21, 2020

@jreback @jorisvandenbossche Is there room for improving the inference done here? As far as I can tell, the inference done here is that of #38522 but also handles bytes and urls (which I believe is causing failures in #38522).

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 21, 2020

Two of the travis builds failed to start, the third one passed. Restarted manually.

Copy link
Contributor

@jreback jreback left a comment

very minor comment. pls commetn / update

pandas/tests/io/excel/test_writers.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

@rhshadrach thanks for the update! Additional changes look good

There is one case that also changed behaviour in this PR (in a good way), which raises on master, and that is reading from a xlsx file-like object without xlrd:

with open("test.xlsx", "rb") as f:
    df = pd.read_excel(f)

On master, this would raise an error if you have xlrd 2.0 installed (because it always uses xlrd for file-likes on master), but with this PR it will now correctly choose openpyxl (and raise a proper warning when using xlrd).
Do you know if this improvement is covered by a test?

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 23, 2020

@rhshadrach

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=50726&view=logs&j=404760ec-14d3-5d48-e580-13034792878f&t=f81e4cc8-d61a-5fb8-36be-36768e5c561a looks like a legit failure

@jreback - Agreed, it is, but I don't understand it. The warning is emitted when openpyxl is not installed but xlrd < 2.0 is and a non-xls file is used. Replicating this in my environment, the test passes with the warning emitted with the right stacklevel (checked manually). So maybe this is a Windows vs Linux issue?

What really doesn't make sense is that actual_warning.filename is pandas\io\excel\_base.py where the Warning is emitted, but the stacklevel is very clearly being set to either be 2 or 4. I don't see how it's possible that _base.py is then the filename.

Attempting to debug via CI now.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 23, 2020

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 23, 2020

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

Will do, is this a known issue with Windows? Couldn't find an issue on github, will add one if it is.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 23, 2020

Actually, it's a test asserting warnings about positional arguments, not about the engine. So that might interfere with the test, because it now also raises another warning than the one it is testing?

@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 23, 2020

@jorisvandenbossche Yes, that appears to me to be the reason why it's failing now. However, it passes locally for me and stepping through the assert code, I don't see any reason why it might not work on Windows. But most likely I'm missing some peculiarity about windows..

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 23, 2020

I think it is certainly fine to skip the check for stacklevel here (or mayble only skip it on windows, so it's still tested generally)

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 23, 2020

@rhshadrach sorry, I merged #38456, which will give some merge conflict pains. I can also deal with it if you like

…rd_warnings

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/io/excel/_base.py
@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 23, 2020

@jorisvandenbossche - Not a problem. The merge was fairly straightforward, but your review would be appreciated to make sure I didn't mess any of it up. (I am also double checking it now)

The stacklevel issue was a bug in one of my previous PRs where the file path was hardcoded using '/'. Should be fixed now. Also your requested test has been added.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 23, 2020

Doc changes look good after the merge!


def test_corrupt_bytes_raises(self, read_ext, engine):
bad_stream = b"foo"
with pytest.raises(BadZipFile, match="File is not a zip file"):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Dec 23, 2020
Member

Could also be left for a follow-up (time to get this merged ;-)), but this is not a super clear error message in case you accidentally pass a non-excel file to read_excel

We should probably check in the inspect function if it is a zip file (like is also done here: https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R942), and if not raise an error like "file is not recognized as an excel file" or so.

@jreback
Copy link
Contributor

@jreback jreback commented Dec 23, 2020

@rhshadrach unfortunately a couple of legit looking failures. note happy to xfail those tests for those versions is fine.

rhshadrach added 2 commits Dec 23, 2020
@rhshadrach
Copy link
Member Author

@rhshadrach rhshadrach commented Dec 23, 2020

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

@jreback
Copy link
Contributor

@jreback jreback commented Dec 23, 2020

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

excellent, ping on green-ish

@jreback jreback merged commit 263e1ee into pandas-dev:master Dec 23, 2020
17 of 18 checks passed
17 of 18 checks passed
Checks
Details
pre-commit
Details
Web and docs
Details
continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
pandas-dev.pandas Build #20201223.83 succeeded
Details
pandas-dev.pandas (Linux py37) Linux py37 succeeded
Details
pandas-dev.pandas (Linux py37_locale_slow) Linux py37_locale_slow succeeded
Details
pandas-dev.pandas (Linux py37_minimum_versions) Linux py37_minimum_versions succeeded
Details
pandas-dev.pandas (Linux py37_slow) Linux py37_slow succeeded
Details
pandas-dev.pandas (Linux py38) Linux py38 succeeded
Details
pandas-dev.pandas (Linux py38_locale) Linux py38_locale succeeded
Details
pandas-dev.pandas (Linux py38_np_dev) Linux py38_np_dev succeeded
Details
pandas-dev.pandas (Linux py38_slow) Linux py38_slow succeeded
Details
pandas-dev.pandas (Linux py39) Linux py39 succeeded
Details
pandas-dev.pandas (Windows py37_np16) Windows py37_np16 succeeded
Details
pandas-dev.pandas (Windows py38_np18) Windows py38_np18 succeeded
Details
pandas-dev.pandas (macOS py37_macos) macOS py37_macos succeeded
Details
pandas-dev.pandas (py37_32bit) py37_32bit succeeded
Details
@jreback
Copy link
Contributor

@jreback jreback commented Dec 23, 2020

thanks @rhshadrach amazing job here.

and thank you @cjw296 for your PR and the input.

@jreback
Copy link
Contributor

@jreback jreback commented Dec 23, 2020

@meeseeksdev backport 1.2.x

meeseeksmachine added a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020
@rhshadrach rhshadrach deleted the rhshadrach:xlrd_warnings branch Dec 23, 2020
jreback pushed a commit that referenced this pull request Dec 24, 2020
…38670)

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 24, 2020

Thanks @rhshadrach !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.