Archive for October, 2011

Python zipfile speedup tips

Saturday, October 15th, 2011

I have been working on a django project that requires large zip files to be unzipped.

At first I was just Popen’ing unzip. but its hard to track the progress of extraction, in the case of large files.

So I decided to use pythons zipfile module, and override extractall with a progress callback.

However I was very disappointed with the performance. A major slowdown compared to unzip binary.

Here was unzip performance:
time unzip -q /mnt/files/test.zip

real 0m8.880s
user 0m1.560s
sys 0m0.570s

8 seconds, not bad

This was my test script:

from zipfile import ZipFile
zf = ZipFile("/mnt/files/test.zip")
zf.extractall()

time python test.py

real 6m50.938s
user 0m2.990s
sys 0m1.010s

7 minutes!! what is going on… I scratched my head.. trying different things..
So I tried an strace.. And it was all clear.

If you pass a filename to ZipFile.. it doesnt open the file in the constructor.. oh no.

It actually saves the filename and on each extract operation, it opens the file, then closes.. for each file in the archive.

Now, on a local filesystem, this isn’t a big problem. However with a remote cifs filesystem opening a file is a lot more expensive, hence the slowdown.

So, an easy optimisation is to open the file and pass ZipFile a file descriptor.

from zipfile import ZipFile
zf = ZipFile(open("/mnt/files/test.zip","r"))
zf.extractall()

time python test.py

real 0m10.071s
user 0m2.550s
sys 0m0.690s

Bingo, just ~10% slower than unzip.

If you are using python 2.6, and easy optimisation is to use unzip.py from python 2.7, it has many optimisations with regard to large files in the archive.

SafeZipFile module

Tuesday, October 11th, 2011

This checks each file extracted for “..” in the path, and dont go over a file size limit that you set in the constructor.

from zipfile import ZipFile, ZipInfo
import os

class NotSafeFileException(Exception):
    pass


class SafeZipFile(ZipFile):
    def __init__(self, *args, **kwargs):
        self.max_size = kwargs.pop('max_size', None)
        ZipFile.__init__(self, *args, **kwargs)

    def extract(self, member, path=None, pwd=None):
        if not isinstance(member, ZipInfo):
            member = self.getinfo(member)
        if path is None:
            path = os.getcwd()
        self.safety_check(member)
        return self._extract_member(member, path, pwd)

    def safety_check(self, zipinfo):
        """Make sure that the file/dir:
            * Doesn't start with a slash in the path
            * Doesnt have ".." in the path
            * If max_size is passed, make sure the file isnt bigger than that threshhold
        """
        if zipinfo.filename.startswith("/"): raise NotSafeFileException("%s starts with a slash" % tarinfo.path)
        if ".." in zipinfo.filename: raise NotSafeFileException("%s contains '..'" % zipfile.filename)
        if self.max_size and self.max_size < zipinfo.file_size: raise NotSafeFileException("%s is too big" % zipinfo.filename)

SafeTarFile Module

Tuesday, October 11th, 2011

This TarFile module is a drop-in replacement for TarFile which makes sure that files in a tarfile, is safe using the following criteria:

* Doesn’t start with a slash in the path
* Doesnt have “..” in the path
* Is either a normal file or directory(no fifos, symlinks)
* If max_size is passed, make sure the file isnt bigger than that threshhold

Tested on 2.7

from tarfile import TarFile, ExtractError
import copy
import operator
import os.path

class NotSafeFileException(Exception):
    pass

class SafeTarFile(TarFile):
    def __init__(self, *args, **kwargs):
        self.max_size = kwargs.pop('max_size', None)
        super(SafeTarFile,self).__init__(*args, **kwargs)

    def safety_check(self, tarinfo, max_size=None):
        """Make sure that the file/dir:
            * Doesn't start with a slash in the path
            * Doesnt have ".." in the path
            * Is either a normal file or directory(no fifos,  symlinks)
            * If max_size is passed, make sure the file isnt bigger than that threshhold
        """

        if tarinfo.path.startswith("/"): raise NotSafeFileException("%s starts with a slash" % tarinfo.path)
        if ".." in tarinfo.path: raise NotSafeFileException("%s contains '..'" % tarinfo.path)
        if not tarinfo.isfile() and not tarinfo.isdir(): raise NotSafeFileException("%s is a strange filetype" % tarinfo.name)
        if self.max_size and self.max_size < tarinfo.size: raise NotSafeFileException("%s is too big" % tarinfo.name)

    def extract(self, member, path=""):
        self._check("r")

        if isinstance(member, basestring):
            tarinfo = self.getmember(member)
        else:
            tarinfo = member
        self.safety_check(tarinfo)
        super(SafeTarFile, self).extract(tarinfo, path)


Copyright © 2018 All Rights Reserved.
No computers were harmed in the 0.038 seconds it took to produce this page.

dmarkey.com