Python zipfile speedup tips

I have been working on a django project that requires large zip files to be unzipped.

At first I was just Popen’ing unzip. but its hard to track the progress of extraction, in the case of large files.

So I decided to use pythons zipfile module, and override extractall with a progress callback.

However I was very disappointed with the performance. A major slowdown compared to unzip binary.

Here was unzip performance:
time unzip -q /mnt/files/test.zip

real 0m8.880s
user 0m1.560s
sys 0m0.570s

8 seconds, not bad

This was my test script:

from zipfile import ZipFile
zf = ZipFile("/mnt/files/test.zip")
zf.extractall()

time python test.py

real 6m50.938s
user 0m2.990s
sys 0m1.010s

7 minutes!! what is going on… I scratched my head.. trying different things..
So I tried an strace.. And it was all clear.

If you pass a filename to ZipFile.. it doesnt open the file in the constructor.. oh no.

It actually saves the filename and on each extract operation, it opens the file, then closes.. for each file in the archive.

Now, on a local filesystem, this isn’t a big problem. However with a remote cifs filesystem opening a file is a lot more expensive, hence the slowdown.

So, an easy optimisation is to open the file and pass ZipFile a file descriptor.

from zipfile import ZipFile
zf = ZipFile(open("/mnt/files/test.zip","r"))
zf.extractall()

time python test.py

real 0m10.071s
user 0m2.550s
sys 0m0.690s

Bingo, just ~10% slower than unzip.

If you are using python 2.6, and easy optimisation is to use unzip.py from python 2.7, it has many optimisations with regard to large files in the archive.

Leave a Reply


Copyright © 2018 All Rights Reserved.
No computers were harmed in the 0.067 seconds it took to produce this page.

dmarkey.com