Skip to content

Fix misleading hint for iterparse #93618

Open
@Prometheus3375

Description

@Prometheus3375

Documentation

From docs:

Because it’s so flexible, XMLPullParser can be inconvenient to use for simpler use-cases. 
If you don’t mind your application blocking on reading XML data 
but would still like to have incremental parsing capabilities, take a look at iterparse(). 
It can be useful when you’re reading a large XML document and don’t want to hold it wholly in memory.

The last sentence is wrong. iterparse eventually loads the entire file to the memory, because iterparse forms XML tree incrementally, i.e., the problem of huge memory consumption cannot be solved by using iterparse without some extra code. If a person wants to process a large file with small memory cost, then they must at least repeatedly clean root element from children (it depends on XML structure). Therefore, I suggest to remove the last sentence in docs.


The code below deletes a root child once it is completed, then processes and removes it from the memory (if nothing more references to it ofc). This allows to process 7GB XML with with a memory usage up to 10MB (in case of great number of root children).

parser = XMLPullParser(['start', 'end'])  # can be replaced with iterparse as well
root = None
with open(file) as f:
    for line in f:
        parser.feed(line)
        for event, obj in parser.read_events():
            match event:
                case 'start':
                    if root is None: root = obj
                case 'end':
                    if len(root) > 0 and obj == root[0]:
                        del root[0]
                        # process obj

parser.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dir

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions