I have been doing lots of parsing of Apache log files recently with Python. Speed is of the essence, so I was interested to know if the speed issues mentioned in http://bugs.python.org/issue7471 are still relevant. So I set out to time it, under Python (v2.6.4 and v3.1.2, both on Fedora 13).
My test data consists of 950,000 lines in 5 gzipped Apache log files. I compared the times using the Python gzip module, and the zcat method suggested in http://bugs.python.org/issue7471.
A quick summary of results: yes, the issue is still relevant. The average times are:
For Python 2.6.4
- gzip.open – 9.5 seconds
- zcat and pipe – 2.3 seconds
For Python 3.1.2
- gzip.open – 11.6 seconds
- zcat and pipe – 2.7 seconds
Below are the details. First, the source code to time the tests:
import os
import sys
if sys.version.startswith("3"):
import io
io_method = io.BytesIO
else:
import cStringIO
io_method = cStringIO.StringIO
import gzip
import subprocess
import time
dirname = "test"
fl = os.listdir(dirname)
fl.sort()
ttime = [0, 0]
runs = 5
for i in range(2 * runs):
st = time.time()
for fn in fl:
if not fn.endswith(".gz"):
continue
cc = 0
lc = 0
sz = 0
fullfn = os.path.join(dirname, fn)
sz += os.stat(fullfn)[6]
if i % 2 == 0:
fh = gzip.GzipFile(fullfn, "r")
else:
p = subprocess.Popen(["zcat", fullfn], stdout = subprocess.PIPE)
fh = io_method(p.communicate()[0])
assert p.returncode == 0
for line in fh:
lc += 1
cc += len(line)
et = time.time()
dt = et - st
ttime[i % 2] += dt
print("time-taken = %0.2f seconds, 1000 characters per second = %0.0f, file size per second = %0.0f, character count=%s, line count=%s file size = %s" % (dt, 0.001 * cc / dt, 0.001 * sz / dt, cc, lc, sz))
print("\nAverages")
print(" gzip.open - %0.1f seconds" % (ttime[0] / runs))
print(" zcat and pipe - %0.1f seconds" % (ttime[1] / runs))
Timings for Python 2.6.4:
time-taken = 9.71 seconds, 1000 characters per second = 5237, file size per second = 480, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.28 seconds, 1000 characters per second = 22300, file size per second = 2044, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.48 seconds, 1000 characters per second = 5363, file size per second = 492, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.24 seconds, 1000 characters per second = 22750, file size per second = 2085, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.44 seconds, 1000 characters per second = 5389, file size per second = 494, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.30 seconds, 1000 characters per second = 22156, file size per second = 2031, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.42 seconds, 1000 characters per second = 5397, file size per second = 495, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.26 seconds, 1000 characters per second = 22516, file size per second = 2064, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.61 seconds, 1000 characters per second = 5293, file size per second = 485, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.31 seconds, 1000 characters per second = 22033, file size per second = 2019, character count=50859394, line count=194151 file size = 4661207
Averages
gzip.open - 9.5 seconds
zcat and pipe - 2.3 seconds
Timings for Python 3.1.2:
time-taken = 11.65 seconds, 1000 characters per second = 4364, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.87 seconds, 1000 characters per second = 17691, file size per second = 1621, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.66 seconds, 1000 characters per second = 4362, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.56 seconds, 1000 characters per second = 19834, file size per second = 1818, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.54 seconds, 1000 characters per second = 4408, file size per second = 404, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.64 seconds, 1000 characters per second = 19295, file size per second = 1768, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.58 seconds, 1000 characters per second = 4393, file size per second = 403, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.59 seconds, 1000 characters per second = 19659, file size per second = 1802, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.68 seconds, 1000 characters per second = 4354, file size per second = 399, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.97 seconds, 1000 characters per second = 17123, file size per second = 1569, character count=50859394, line count=194151 file size = 4661207
Averages
gzip.open - 11.6 seconds
zcat and pipe - 2.7 seconds