Description
Bug report
Bug description:
The current heuristic is as follows:
dq_regexp = re.compile(...) # some regex that matches more than two quotation marks between delimiters
if dq_regexp.search(data):
doublequote = True
else:
doublequote = False
That is, if the sample doesn't contain doubled quotation marks, it is assumed that the whole file doesn't either. However, such assumption is likely to be false: many CSV files only have a few quotation marks that need escaping, and a small sample is likely to include none. Moreover, all the built-in dialects (excel
, excel-tab
, and unix
) have doublequote = True
, so I believe it would make sense to also fallback to the less-restrictive True
when unsure. The improved code might look roughly like:
bq_regexp = re.compile(...) # some regex that matches a backslash-followed-by-quotation-mark
if bq_regexp.search(data):
# an attempt was made to escape a quotation mark with escapechar,
# so the writer clearly was not relying on doublequotes
doublequote = False
else:
doublequote = True # don't know, but consider True a safe default
Also, the current heuristic can already yield false-positive results as well:
>>> sample = r"""Powerful Engine,"20\" Wheels",Awful Price"""
>>> csv.Sniffer().sniff(sample).doublequote
True
In fact, nothing in this sample indicates that the writer was doubling quotation marks, however it doesn't hurt: it can be parsed with either value of doublequote
, yielding the same result.
CPython versions tested on:
3.11
Operating systems tested on:
No response
Metadata
Metadata
Assignees
Projects
Status