Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient reverse line iterator #44691

Open
marktrussell mannequin opened this issue Mar 10, 2007 · 13 comments
Open

Efficient reverse line iterator #44691

marktrussell mannequin opened this issue Mar 10, 2007 · 13 comments
Labels
3.13 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@marktrussell
Copy link
Mannequin

marktrussell mannequin commented Mar 10, 2007

BPO 1677872
Nosy @rhettinger, @abalkin, @pitrou, @giampaolo, @tiran, @benjaminp, @merwok, @florentx, @james-emerton
Files
  • reverse_file_iterator.diff: Reverse file iterator implementation
  • reverse-file-iterator-20071209.diff
  • reverse-file-iterator-20071210.diff
  • readprevline-20140415.diff: Implementation of BufferedReader.readprevline()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2007-03-10.13:47:37.000>
    labels = ['library', 'performance']
    title = 'Efficient reverse line iterator'
    updated_at = <Date 2014-04-15.20:32:03.033>
    user = 'https://bugs.python.org/marktrussell'

    bugs.python.org fields:

    activity = <Date 2014-04-15.20:32:03.033>
    actor = 'jemerton'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2007-03-10.13:47:37.000>
    creator = 'mark_t_russell'
    dependencies = []
    files = ['7854', '8902', '8913', '34894']
    hgrepos = []
    issue_num = 1677872
    keywords = ['patch']
    message_count = 12.0
    messages = ['52152', '52153', '57266', '57269', '58326', '58361', '58370', '110532', '110533', '111041', '115651', '216381']
    nosy_count = 11.0
    nosy_names = ['rhettinger', 'belopolsky', 'pitrou', 'giampaolo.rodola', 'christian.heimes', 'mark_t_russell', 'benjamin.peterson', 'eric.araujo', 'flox', 'jcon', 'jemerton']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue1677872'
    versions = ['Python 3.3']

    Linked PRs

    @marktrussell
    Copy link
    Mannequin Author

    marktrussell mannequin commented Mar 10, 2007

    This is an implementation of __reversed__ for the TextIOWrapper type from the new IO interface (see http://docs.google.com/Doc?id=dfksfvqd_1cn5g5m). It is used as:

        import io
        for line in reversed(io.open(filename)):
              ...

    It is efficient (only reads a block at a time) but can handle arbitrary length lines. It is useful for scanning backwards through big log files, but my main reason for submitting it is as a demonstration of the usefulness of the new IO layers - it works by putting a new buffering layer round the RawIOBase object from the open() call.

    It's just a proof of concept, but if there's interest in this I'm happy to write unit tests and documentation.

    The patch also makes io.BufferedReader support buffering, and adds a very minimal implementation of io.TextIOBase and io.TextIOWrapper (needed to make io.open() work).

    @marktrussell marktrussell mannequin added the stdlib Python modules in the Lib dir label Mar 10, 2007
    @rhettinger
    Copy link
    Contributor

    FWIW, I've wanted something like this for a long time (for scanning log files in reverse).

    @gvanrossum gvanrossum changed the title Efficient reverse line iterator Efficient reverse line iterator Aug 30, 2007
    @tiran
    Copy link
    Member

    tiran commented Nov 8, 2007

    Some people like the feature, Guido isn't against the feature ... It
    looks as you have a good chance to get it into Python 3.0. :)

    Can you come up with a new patch and unit tests? The io module has
    changed a lot since your initial patch.

    @marktrussell
    Copy link
    Mannequin Author

    marktrussell mannequin commented Nov 8, 2007

    Sure - I'll do an updated patch at the weekend.

    @marktrussell
    Copy link
    Mannequin Author

    marktrussell mannequin commented Dec 9, 2007

    Here's an updated version of the patch. Changes:

    - Updated to work against current py3k branch (r59441)
    - Added support for universal newlines
    - Added unit tests
    - Added docs
    

    The patch includes documentation for reversed() and __reversed__() (in the
    library and reference manuals respectively) which are independent of the
    reverse lines iterator - I can split those out to separate patch if needed.

    I also updated the expected output from test_profile and test_cProfile,
    although I think a better fix would be to filter out the stdlib-related stuff
    from the expected output, as currently these tests break whenever io.py is
    changed.

    @gvanrossum
    Copy link
    Member

    I'd like to see the doc patches separated out and applied to 2.6 --
    they'll automatically merge into 3.0 then. Make that a separate bug please.

    I like the idea, haven't had time to carefully review the code, but
    noticed one oddity:

    >> for line in reversed(open("/etc/passwd")): print(line, end="")

    immediately raises

    ValueError: I/O operation on closed file

    @marktrussell
    Copy link
    Mannequin Author

    marktrussell mannequin commented Dec 10, 2007

    As Guido requested I've split off the generic reversed() and __reversed__()
    doc additions to this patch against 2.6: http://bugs.python.org/issue1582

    The I/O error from reversed(open("/etc/passwd")) was caused by the inner
    TextIOWrapper calling close() (via the inherited IOBase.__del__() method).
    I've fixed it by having TextIOReverseIterator keep a reference to the file
    object, and added a test case for the bug.

    I think it's at least questionable that TextIOWrapper.close() is calling
    buffer.close() on a buffer that it did not create. I assumed that keeping
    a reference to the buffer object would be enough to keep the buffer open,
    and I suspect this is likely to trip up others in future. I think
    TextIOWrapper.close() should probably just set a flag (for the use of its
    own closed() method) and rely on reference counting to call close()
    on the buffer object. If that sounds on the right lines I'm happy to think
    about it a bit more and submit a patch.

    @devdanzin devdanzin mannequin added the performance Performance or resource usage label Apr 26, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 17, 2010

    The OP has done everything asked of him. There are a lot of positive comments about this request. Snag is the patch is in python, I understand that io is now written in C. Could we at this late stage get this into 3.2, or even a minor release of 3.2, or will it have to be deferred until 3.3?

    @abalkin
    Copy link
    Member

    abalkin commented Jul 17, 2010

    The OP has done everything asked of him.

    Not quite. He split out the documentation part of his patch and it was accepted and committed.

    Guido raised an issue with the code. OP raised more questions. The code was never updated.

    Now the patch is out of date because io module has been reimplemented in C. The patch is still good as a prototype/proof of concept but needs to be updated to apply to _pyio.

    The idea is good, but the implementation is not ready.

    @marktrussell
    Copy link
    Mannequin Author

    marktrussell mannequin commented Jul 21, 2010

    I'll do a C version of the patch (hopefully in the next week or so).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2010

    Suggestions:

    • do it on BufferedReader, rather than TextIOWrapper: if you want full-speed scanning of log files, you probably want to open them in binary mode

    • rather than implementing a full-blown iterator, you can start with simple primitives: e.g. a method called readprevline() (or even scan_back() if you want to make the stop character(s) configurable)

    @james-emerton
    Copy link
    Mannequin

    james-emerton mannequin commented Apr 15, 2014

    Attached is an implementation of BufferedReader.readprevline(), as suggested by Antoine.

    At this point, it seems to be working but I would like to improve the tests when a single result spans multiple chunks. I would particularly like to ensure correct behaviour when a newline ends up as the last byte of a new chunk.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added the type-feature A feature request or enhancement label Aug 18, 2022
    @gvanrossum
    Copy link
    Member

    Last activity was 2014. Anyone interested in getting this over the finish line?

    @gvanrossum gvanrossum added the pending The issue will be closed if no feedback is provided label Sep 3, 2022
    @iritkatriel iritkatriel added 3.12 bugs and security fixes and removed pending The issue will be closed if no feedback is provided labels Sep 13, 2022
    @erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants