BOBOBK

Recursively download files python

TECHNOLOGY

Recently, I wanted to back up a website, but PHP has file size limits for downloads, and I was too lazy to set up FTP to download it. So I thought about temporarily creating a subdomain site, then using Python (Python 3) and the requests library to directly download all files and folders under the root directory of the website, achieving the purpose of backup.

1. Install the requests library

pip install requests

2. Download all files and folders under a directory

The main part to handle here is the folders. We check if the link is a folder; if so, automatically create the folder and recursively continue. Otherwise, if it is a file, directly use requests.get to download it. Without further ado, here is the code:

import requests
import re
import os
import sys

def help(script):
    text = 'python3 %s thttps://www.bobobk.com ./' % script
    print(text)

def get_file(url, path):  ## File download function
    content = requests.get(url)
    print("write %s in %s" % (url, path))
    filew = open(path + url.split("/")[-1], 'wb')
    for chunk in content.iter_content(chunk_size=512 * 1024): 
        if chunk:  # filter out keep-alive new chunks
            filew.write(chunk)
    filew.close()

def get_dir(url, path):  # Folder handling logic
    content = requests.get(url).text
    if "<title>Index of" in content:
        sub_url = re.findall('href="(.*?)"', content)
        print(sub_url)
        for i in sub_url:
            if "/" in i:
                i = i.split("/")[0]
                print(i)
                if i != "." and i != "..":
                    if not os.path.exists(path + i):
                        os.mkdir(path + i)
                    get_dir(url + "/" + i, path + i + "/")
                    print("url:" + url + "/" + i + " nurl_path:" + path + i + "/")
            else:
                get_file(url + "/" + i, path)
    else:
        get_file(url, path)

if __name__ == '__main__':
    if len(sys.argv) <= 1:
        help(sys.argv[0])
        exit(0)
    else:
        get_dir(sys.argv[1], "./")

At this point, the entire directory structure and files of the original website have been fully downloaded and restored locally.

Related