Skip to content

csv.Sniffer uses a heuristic to determine doublequote that is often false-negative #109339

Open
@loskutov

Description

@loskutov

Bug report

Bug description:

The current heuristic is as follows:

dq_regexp = re.compile(...) # some regex that matches more than two quotation marks between delimiters

if dq_regexp.search(data):
    doublequote = True
else:
    doublequote = False

That is, if the sample doesn't contain doubled quotation marks, it is assumed that the whole file doesn't either. However, such assumption is likely to be false: many CSV files only have a few quotation marks that need escaping, and a small sample is likely to include none. Moreover, all the built-in dialects (excel, excel-tab, and unix) have doublequote = True, so I believe it would make sense to also fallback to the less-restrictive True when unsure. The improved code might look roughly like:

bq_regexp = re.compile(...) # some regex that matches a backslash-followed-by-quotation-mark

if bq_regexp.search(data):
    # an attempt was made to escape a quotation mark with escapechar,
    # so the writer clearly was not relying on doublequotes
    doublequote = False
else:
    doublequote = True # don't know, but consider True a safe default

Also, the current heuristic can already yield false-positive results as well:

>>> sample = r"""Powerful Engine,"20\" Wheels",Awful Price"""
>>> csv.Sniffer().sniff(sample).doublequote
True

In fact, nothing in this sample indicates that the writer was doubling quotation marks, however it doesn't hurt: it can be parsed with either value of doublequote, yielding the same result.

CPython versions tested on:

3.11

Operating systems tested on:

No response

Metadata

Metadata

Labels

stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions