BOBOBK

Using Docker to Render JavaScript Web Pages via Browser from Command Line

TECHNOLOGY

When using scrapy to crawl web pages, many websites render content using JavaScript, so directly fetching the source code does not retrieve the needed page content. At this point, using selenium to drive a browser to get the web content is very appropriate. However, one problem is that this requires a browser installed locally and must be run as a non-root user. Therefore, using Docker to provide a Chrome service, driven by selenium, to get the rendered web content is a good solution.

Running Chrome Docker Container

By searching, we know that the container on Docker Hub is selenium/standalone-chrome. If Docker is already installed locally, you can run this service on port 14444. For security reasons, only allow local access.

docker run -itd --name=chrome -p 127.0.0.1:14444:4444 --shm-size="2g" selenium/standalone-chrome

The parameters are very simple, only configuring backend running, port mapping, and shared memory size.

Using Selenium to Call Remote Service for Web Crawling

Selenium’s webdriver has a Remote parameter to specify the remote address.

from selenium import webdriver
from scrapy.selector import Selector

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # example

driver = webdriver.Remote("http://127.0.0.1:14444/wd/hub", options=options)
driver.get("https://www.bobobk.com")

hrefs = Selector(text=driver.page_source).xpath("//article/header/h1/a/@href").extract()
for url in hrefs:
    print(url)
# Example output:
# https://www.bobobk.com/833.html
# https://www.bobobk.com/621.html
# https://www.bobobk.com/852.html
# https://www.bobobk.com/731.html
# https://www.bobobk.com/682.html
# https://www.bobobk.com/671.html
# https://www.bobobk.com/523.html
# https://www.bobobk.com/521.html
# https://www.bobobk.com/823.html
# https://www.bobobk.com/512.html

In the example, the target site is this website itself. In actual tests, JavaScript-rendered sites can be perfectly crawled using this method.

Summary

Providing browser services via Docker can effectively solve the problem where web pages rendered dynamically by JavaScript cause the failure to fetch required web content.

Related

Python Script to Snatch Recently Expired Domains

TECHNOLOGY
Python Script to Snatch Recently Expired Domains

'Many domain enthusiasts scour forums and websites frantically searching for and snatching up suitable domains, even spending heavily to buy desired domains from their owners. International domain management bodies adopt a "first-to-apply, first-to-register, first-to-use" policy. Since domains only require a small annual registration fee, continuous registration grants you the right to use the domain. Because of this, many domain resellers (commonly known as "domaining pros") often spend heavily on short, easy-to-remember domains. I used to think about buying shorter domains for building scraping sites, but unfortunately, both snatching and buying from others were very expensive. Since it"s first-come, first-served, we can also acquire good domains by registering them before the current owner forgets to renew.'

Google Advertisement

Solving Expert-Level Sudoku Puzzles Quickly Using Python's Backtracking Algorithm

TECHNOLOGY
Solving Expert-Level Sudoku Puzzles Quickly Using Python's Backtracking Algorithm

I often play Sudoku in my leisure time as a form of relaxation. My usual method involves eliminating duplicates and filling in unique numbers first, then proceeding step by step. However, it's inevitable to guess numbers and adjust based on feedback. So, is there a better algorithm to solve Sudoku puzzles? Here, I will use the backtracking method in Python to solve 9x9 expert-level Sudoku puzzles.

Google Advertisement