BOBOBK

Python zipfile Module Instantiation and Parsing

MISCELLANEOUS

Introduction

The zipfile module in Python is used for compressing and decompressing files in the ZIP format. As ZIP is a very common format, this module is used quite frequently.

Here, I’ll document some usage methods for zipfile, which will be convenient for both myself and others.

The Python zipfile module is used for compressing and decompressing data encoded in the ZIP format. To perform related operations, you first need to instantiate a ZipFile object. ZipFile accepts a string representing the compressed archive name as its required parameter. The second parameter is optional and indicates the open mode, similar to file operations. It has r/w/a modes, representing read, write, and append, respectively. The default is r, read mode.

There are two very important classes in zipfile: ZipFile and ZipInfo. In most cases, we only need to use these two classes. ZipFile is the main class used to create and read ZIP files, while ZipInfo stores information about each file within the ZIP archive.


I. Basic Operations of These Two Classes

For example, to read a Python zipfile module, let’s assume filename is a file path:

import zipfile
z = zipfile.ZipFile(filename, 'r')
# The second parameter 'r' means reading a zip file, 'w' means creating a zip file
for f in z.namelist():
    print f

The code above reads the names of all files in a ZIP archive. z.namelist() will return a list of all filenames within the archive.

Let’s look at another example:

import zipfile
z = zipfile.ZipFile(filename, 'r')
for i in z.infolist():
    print i.file_size, i.header_offset

Here, z.infolist() is used, which returns information about all files within the compressed archive as a list of ZipInfo objects. A ZipInfo object contains information about a file inside the archive, with commonly used attributes being filename, file_size, and header_offset, representing the filename, file size, and the offset of file data within the compressed archive, respectively. In fact, z.namelist() previously just read the filename from ZipInfo objects and returned them as a list.

To extract a file from a compressed archive, use the read method of ZipFile:

import zipfile
z = zipfile.ZipFile(filename, 'r')
print z.read(z.namelist()[0])

This reads the first file in z.namelist() and prints it to the screen. Of course, you can also save it to a file. Below is the method for creating a ZIP archive, which is quite similar to the reading method:

import zipfile, os
z = zipfile.ZipFile(filename, 'w')
# Note that the second parameter here is 'w', and filename is the name of the compressed archive

Note that the second parameter here is w, and filename is the name of the compressed archive. Suppose you want to add all files from a directory named testdir to the archive (only files in the first-level subdirectory are added here):

if os.path.isdir(testdir):
    for d in os.listdir(testdir):
        z.write(testdir+os.sep+d)

        z.close()

The code below is very simple. Now, consider a problem: if I add test/111.txt to the compressed archive, but I want it to be placed at test22/111.txt inside the archive, what should I do? This is where the second parameter of the Python ZipFile module’s write method comes into play. You just need to call it like this:

z.write("test/111.txt", "test22/111.txt")

II. Basic Operations of ZipFile and ZipInfo Classes

1. class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])

Creates a ZipFile object, representing a ZIP file. The file parameter indicates the path of the file or a file-like object; the mode parameter specifies the mode for opening the ZIP file.

The default value is 'r', which means reading an existing ZIP file. It can also be 'w' or 'a'. 'w' means creating a new ZIP document or overwriting an existing one.

import zipfile
f = zipfile.ZipFile(filename, 'r') # The second parameter 'r' means reading a zip file, 'w' or 'a' means creating a zip file

for f_name in f.namelist(): # z.namelist() returns a list of all filenames within the archive.
    print(f_name)
# The code above reads the names of all files in a zip archive.

'a' means appending data to an existing ZIP document. The compression parameter indicates the compression method used when writing the ZIP document, and its value can be zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED. If the ZIP file to be operated on exceeds 2GB, allowZip64 should be set to True.

ZipFile also provides the following commonly used methods and properties:

ZipFile.getinfo(name) Retrieves information about a specified file within the ZIP document. Returns a zipfile.ZipInfo object, which includes detailed file information.

ZipFile.infolist() Retrieves information about all files within the ZIP document, returning a list of zipfile.ZipInfo objects.

ZipFile.namelist() Retrieves a list of all file names within the ZIP document.

ZipFile.extract(member[, path[, pwd]]) Extracts the specified file from the ZIP document to the current directory. The member parameter specifies the name of the file to be extracted or its corresponding ZipInfo object; the path parameter specifies the folder where the extracted file will be saved; pwd is the decompression password. The following example extracts all files from duoduo.zip located in the program’s root directory to the D:/Work directory:

import zipfile, os
f = zipfile.ZipFile(os.path.join(os.getcwd(), 'duoduo.zip')) # Concatenate to form a path
for file in f.namelist():
    f.extract(file, r'd:/Work') # Extract files to d:/Work
f.close()

The image above demonstrates the usage of os.getcwd()!

ZipFile.extractall([path[, members[, pwd]]]) Extracts all files from the ZIP document to the current directory. The default value for the members parameter is a list of all file names within the ZIP document, but you can also set it yourself to select specific files to extract.

ZipFile.printdir() Prints information about the ZIP document to the console.

ZipFile.setpassword(pwd) Sets the password for the ZIP document.

ZipFile.read(name[, pwd]) Retrieves the binary data of the specified file within the ZIP document. The following example demonstrates the use of read(). The ZIP document contains a text file duoduo.txt. read() is used to read its binary data, which is then saved to D:/duoduo.txt.

import zipfile, os
zipFile = zipfile.ZipFile(os.path.join(os.getcwd(), 'duoduo.zip'))
data = zipFile.read('duoduo.txt')
# (lambda f, d: (f.write(d), f.close()))(open(r'd:/duoduo.txt', 'wb'), data) # One-line statement to write the file. Think about it! ~_~
with open(r'd:/duoduo.txt','wb') as f:
    for d in data:
        f.write(d)
zipFile.close()

ZipFile.write(filename[, arcname[, compress_type]]) Adds a specified file to the ZIP document. filename is the file path, arcname is the name to be saved within the ZIP document, and the compress_type parameter indicates the compression method, whose value can be zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED. The following example demonstrates how to create a ZIP document and add the file D:/test.doc to the compressed document.

import zipfile, os
zipFile = zipfile.ZipFile(r'D:/test.zip'), 'w')
zipFile.write(r'D:/test.doc', 'name_to_save', zipfile.ZIP_DEFLATED)
zipFile.close()

ZipFile.writestr(zinfo_or_arcname, bytes) writestr() supports directly writing binary data to the compressed document.

2. Class ZipInfo

  • The ZipFile.getinfo(name) method returns a ZipInfo object, representing information about the corresponding file in the ZIP document. It supports the following attributes:

  • ZipInfo.filename: Get file name.

  • ZipInfo.date_time: Get last modification time of the file. Returns a tuple containing 6 elements: (year, month, day, hour, minute, second).

  • ZipInfo.compress_type: Compression type.

  • ZipInfo.comment: Document comment.

  • ZipInfo.extr: Extra field data.

  • ZipInfo.create_system: Get the system that created this ZIP document.

  • ZipInfo.create_version: Get the PKZIP version that created the ZIP document.

  • ZipInfo.extract_version: Get the PKZIP version required to extract the ZIP document.

  • ZipInfo.reserved: Reserved field, current implementation always returns 0.

  • ZipInfo.flag_bits: ZIP flag bits.

  • ZipInfo.volume: Volume label of the file header.

  • ZipInfo.internal_attr: Internal attributes.

  • ZipInfo.external_attr: External attributes.

  • ZipInfo.header_offset: File header offset.

  • ZipInfo.CRC: CRC-32 of the uncompressed file.

  • ZipInfo.compress_size: Get compressed size.

  • ZipInfo.file_size: Get uncompressed file size.

The following simple example illustrates the meaning of these attributes:

import zipfile, os
zipFile = zipfile.ZipFile(os.path.join(os.getcwd(), 'duoduo.zip'))
zipInfo = zipFile.getinfo('file_in_archive.txt')
print ('filename:', zipInfo.filename) # Get file name
print ('date_time:', zipInfo.date_time) # Get last modification time of the file. Returns a tuple containing 6 elements: (year, month, day, hour, minute, second)
print ('compress_type:', zipInfo.compress_type) # Compression type
print ('comment:', zipInfo.comment) # Document comment
print ('extra:', zipInfo.extra) # Extra field data
print ('create_system:', zipInfo.create_system) # Get the system that created this ZIP document.
print ('create_version:', zipInfo.create_version) # Get the PKZIP version that created the ZIP document.
print ('extract_version:', zipInfo.extract_version) # Get the PKZIP version required to extract the ZIP document.
print ('extract_version:', zipInfo.reserved) # Reserved field, current implementation always returns 0.
print ('flag_bits:', zipInfo.flag_bits) # ZIP flag bits.
print ('volume:', zipInfo.volume) # Volume label of the file header.
print ('internal_attr:', zipInfo.internal_attr) # Internal attributes.
print ('external_attr:', zipInfo.external_attr) # External attributes.
print ('header_offset:', zipInfo.header_offset) # File header offset.
print ('CRC:', zipInfo.CRC) # CRC-32 of the uncompressed file.
print ('compress_size:', zipInfo.compress_size) # Get compressed size.
print ('file_size:', zipInfo.file_size) # Get uncompressed file size.
zipFile.close()

III. Python Example: Using In-Memory Zipfile Objects to Package Files

import zipfile
import StringIO # Note: StringIO is for Python 2. For Python 3, use io.BytesIO or io.StringIO

class InMemoryZip(object):
    def __init__(self):
        self.in_memory_zip = StringIO.StringIO() # For Python 3, use io.BytesIO() for binary data

    def append(self, filename_in_zip, file_contents):
        # Get a handle to the in-memory zip in append mode
        zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
        # Write the file to the in-memory zip
        zf.writestr(filename_in_zip, file_contents)
        # Mark the files as having been created on Windows so that
        # Unix permissions are not inferred as 0000
        for zfile in zf.filelist:
            zfile.create_system = 0
        return self

    def read(self):
        self.in_memory_zip.seek(0)
        return self.in_memory_zip.read()

    def writetofile(self, filename):
        f = file(filename, "w") # Note: 'file' is for Python 2. For Python 3, use open()
        f.write(self.read())
        f.close()

if __name__ == "__main__":
    # Run a test
    imz = InMemoryZip()
    imz.append("test.txt", "Another test").append("test2.txt", "Still another")
    imz.writetofile("test.zip")

Python Reading Zip Files

The following code demonstrates how to read a ZIP file using Python, print all files within the compressed archive, and read the first file from the compressed archive.

import zipfile
z = zipfile.ZipFile("zipfile.zip", "r")
# Print list of files in the zip file
for filename in z.namelist():
    print('File:', filename)
# Read the first file in the zip file
first_file_name = z.namelist()[0]
content = z.read(first_file_name)
print(first_file_name)
print(content)

Python Writing/Creating Zip Files

Python primarily uses the write function of ZipFile to write ZIP files.

import zipfile
z = zipfile.ZipFile('test.zip', 'w', zipfile.ZIP_DEFLATED)
z.write('test.html')
z.close()

When creating a ZipFile instance, there are 2 points to note:

  1. Use 'w' or 'a' mode to open the ZIP file in a writable manner.
  2. Compression modes are ZIP_STORED and ZIP_DEFLATED. ZIP_STORED is merely a storage mode and does not compress files (this is the default value). If you need to compress files, you must use ZIP_DEFLATED mode.

IV. Python Method for Cracking Encrypted Zip Files

First, let’s create a file on the desktop. We create a text file named q.txt, then compress it, remembering to set a password during compression. I’ll set the password to 123456. Using Python’s zipfile module, we’ll write a ZIP file password cracking machine. We need to use the extractall method from the ZipFile class. This class and method are very useful for programming a password-protected ZIP file cracker. Please note that the extractall() method takes an optional password parameter.

After importing the library, instantiate a new ZipFile class with the filename of the password-protected ZIP file. To decompress this ZIP file, we use the extractall method and provide the password in the optional pwd parameter. Create a .py file in the root directory, then place our compressed file in the same directory. Project structure:

Our .py file code:

import zipfile
zipFile = zipfile.ZipFile("q.zip","r") # This is our compressed file
zipFile.extractall(pwd="123456") # This is our password

This code essentially tries to decompress our compressed file with the given password. Most online tutorials write it this way, but when I use Python 3.6, I encounter an error during execution:

The error roughly means that the pwd parameter expects a bytes type, but it received a str type, so it’s a type mismatch. Let’s convert the password to bytes type. Our .py file code will be as follows:

import zipfile
zipFile = zipfile.ZipFile("q.zip","r")
password = '123456'
zipFile.extractall(pwd=str.encode(password))

Now, let’s run the project again.

This time, there’s no error.

We can see that a new file, the one we compressed earlier, has appeared in our project’s root directory.

If you want to learn more about zipfile, you can click here to open the link.

Next, let’s continue to refactor. What happens if we execute this script with an incorrect password? Let’s add some error handling code to the script to display the error message.

import zipfile
zipFile = zipfile.ZipFile("q.zip","r")
try:
    password = '123s456' # Incorrect password
    zipFile.extractall(pwd=str.encode(password))
except Exception as ex:
    print(ex)

Now, let’s look at our .py file code, and we’ll deliberately write an incorrect password to test it and see the execution result.

Here, we can see the error message, which tells us the password is incorrect.

We can use the exception thrown due to an incorrect password to test whether our dictionary file (the zidian.text that follows) contains the ZIP file’s password. After instantiating a ZipFile class, we open the dictionary file, iterate, and test each word in the dictionary. If the extractall() function executes without error, then print a message outputting the correct password. However, if the extractall() function throws a password error exception, ignore this exception and continue testing the next password in the dictionary.

First, let’s create a zidian.text file.

Next, we’ll write our password dictionary in the zidian.text file, one password per line. The red part is our correct password.

Then, place our password dictionary into the project.

Next, we’ll continue to modify our script.

zipFile = zipfile.ZipFile("q.zip","r")
# Open our dictionary file
passFile = open('zidian.txt')
for line in passFile.readlines():
 # Read each line of data (each password)
    password = line.strip('n')
try:
    zipFile.extractall(pwd=str.encode(password))
    print('=========Password is:'+password+'n')
 # If the password is correct, exit the program
    exit(0)
except Exception as ex:
 # Skip
    pass

Next, let’s look at the execution result.

Haha, we have successfully cracked the ZIP file password! From here, it’s easy to see that as long as our dictionary contains the password, we can crack it.

We continue to optimize our project:

import zipfile
def extractFile(zFile,password):
    try:
        zFile.extractall(pwd=str.encode(password))
 # If successful, return the password
        return password
    except:
        return
def main():
    zFile = zipfile.ZipFile("q.zip","r")
 # Open our dictionary file
    passFile = open('zidian.txt')
    for line in passFile.readlines():
 # Read each line of data (each password)
        password = line.strip('n')
        guess = extractFile(zFile,password)
        if (guess):
            print("=========Password is:"+password+"n")
            exit(0)
if __name__=='__main__':
    main()

This is much better! Next, I’ll provide code to generate all six-digit numeric passwords:

f = open('zidian.txt','w')
for id in range(1000000):
    password = str(id).zfill(6)+'n'
    f.write(password)
f.close()

After successful execution, we can see that our zidian.txt has been generated with numbers from 000000 to 999999. This means we can now crack any 6-digit numeric password for ZIP files!

Related

Parallelism in One Line of Python Code

TECHNOLOGY
Parallelism in One Line of Python Code

Python has a somewhat notorious reputation when it comes to program parallelization. Technical issues aside, such as thread implementation and the GIL, I believe incorrect teaching guidance is the main problem. Common classic Python multithreading and multiprocessing tutorials often seem "heavy" and tend to scratch the surface without deeply exploring the most useful content for daily work.