December 8, 2021December 8, 2021

Using Docker to Render JavaScript Web Pages via Browser from Command Line

When using scrapy to crawl web pages, many websites render content using JavaScript, so directly fetching the source code does not retrieve the needed page content. At this point, using selenium to drive a browser to get the web content is very appropriate. However, one problem is that this requires a browser installed locally and must be run as a non-root user. Therefore, using Docker to provide a Chrome service, driven by selenium, to get the rendered web content is a good solution.

Running Chrome Docker Container

By searching, we know that the container on Docker Hub is selenium/standalone-chrome. If Docker is already installed locally, you can run this service on port 14444. For security reasons, only allow local access.

docker run -itd --name=chrome -p 127.0.0.1:14444:4444 --shm-size="2g" selenium/standalone-chrome

The parameters are very simple, only configuring backend running, port mapping, and shared memory size.

Using Selenium to Call Remote Service for Web Crawling

Selenium’s webdriver has a Remote parameter to specify the remote address.

from selenium import webdriver
from scrapy.selector import Selector

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # example

driver = webdriver.Remote("http://127.0.0.1:14444/wd/hub", options=options)
driver.get("https://www.bobobk.com")

hrefs = Selector(text=driver.page_source).xpath("//article/header/h1/a/@href").extract()
for url in hrefs:
    print(url)
# Example output:
# https://www.bobobk.com/833.html
# https://www.bobobk.com/621.html
# https://www.bobobk.com/852.html
# https://www.bobobk.com/731.html
# https://www.bobobk.com/682.html
# https://www.bobobk.com/671.html
# https://www.bobobk.com/523.html
# https://www.bobobk.com/521.html
# https://www.bobobk.com/823.html
# https://www.bobobk.com/512.html

In the example, the target site is this website itself. In actual tests, JavaScript-rendered sites can be perfectly crawled using this method.

Summary

Providing browser services via Docker can effectively solve the problem where web pages rendered dynamically by JavaScript cause the failure to fetch required web content.

July 1, 2019July 1, 2019

Serialization and Deserialization in Python

TECHNOLOGY

Serialization and Deserialization in Python

Sometimes you may need to temporarily store data so that it can be called directly the next time the program runs, or exchanged between different threads. Serialization is a way to store data in this way, and here we explain Python's serialization and deserialization using the pickle package.

December 6, 2021December 6, 2021

Accelerating pip and anaconda in China

TECHNOLOGY

Since the default addresses for pip and anaconda are very slow to access in China, adding domestic mirrors for acceleration is necessary.

December 28, 2020December 28, 2020

Hands-on Implementation of Random Forest Algorithm with Python

TECHNOLOGY

Hands-on Implementation of Random Forest Algorithm with Python

This article will guide you through a hands-on implementation of a powerful random forest machine learning model. It aims to complement my conceptual explanation of random forests, but as long as you have a basic understanding of decision trees and random forests, you can fully read it. Later, we will discuss how to improve the model built here.

September 22, 2020September 22, 2020

Python Script to Snatch Recently Expired Domains

TECHNOLOGY

Python Script to Snatch Recently Expired Domains

'Many domain enthusiasts scour forums and websites frantically searching for and snatching up suitable domains, even spending heavily to buy desired domains from their owners. International domain management bodies adopt a "first-to-apply, first-to-register, first-to-use" policy. Since domains only require a small annual registration fee, continuous registration grants you the right to use the domain. Because of this, many domain resellers (commonly known as "domaining pros") often spend heavily on short, easy-to-remember domains. I used to think about buying shorter domains for building scraping sites, but unfortunately, both snatching and buying from others were very expensive. Since it"s first-come, first-served, we can also acquire good domains by registering them before the current owner forgets to renew.'

Google Advertisement

July 29, 2020July 29, 2020

python3 solution to LeeCode medium problem

TECHNOLOGY

python3 solution to LeeCode medium problem

This is an article analyzing a problem from the coding practice site LeeCode.

April 28, 2020April 28, 2020

Efficient Ways to Check If a List or Tuple Is Empty in Python

TECHNOLOGY

Efficient Ways to Check If a List or Tuple Is Empty in Python

In Python, to check whether an array or tuple is empty, there are three methods: comparing with an empty list, checking the length, and using an if statement.

January 8, 2020January 8, 2020

Application of Python Implementation of Gradient Descent in Practice

TECHNOLOGY

Application of Python Implementation of Gradient Descent in Practice

Gradient descent is a first-order optimization algorithm, commonly called the steepest descent method. To find a local minimum of a function using gradient descent, one must iteratively move from the current point in the opposite direction of the gradient (or approximate gradient) by a specified step size.

November 11, 2019November 11, 2019

Solving Expert-Level Sudoku Puzzles Quickly Using Python's Backtracking Algorithm

TECHNOLOGY

Solving Expert-Level Sudoku Puzzles Quickly Using Python's Backtracking Algorithm

I often play Sudoku in my leisure time as a form of relaxation. My usual method involves eliminating duplicates and filling in unique numbers first, then proceeding step by step. However, it's inevitable to guess numbers and adjust based on feedback. So, is there a better algorithm to solve Sudoku puzzles? Here, I will use the backtracking method in Python to solve 9x9 expert-level Sudoku puzzles.

Google Advertisement

June 25, 2019June 25, 2019

Extracting Free High Anonymity Proxies with Python3

MISCELLANEOUS

Extracting Free High Anonymity Proxies with Python3

June 17, 2019June 17, 2019

Sharing These Python Tips

MISCELLANEOUS

June 3, 2019June 3, 2019

Python Class Inheritance and Polymorphism

MISCELLANEOUS

Python Class Inheritance and Polymorphism

June 2, 2019June 2, 2019

Process Pools, Thread Pools, and Coroutines in Python

MISCELLANEOUS

Process Pools, Thread Pools, and Coroutines in Python

Neither threads nor processes can be opened indefinitely; they will always consume and occupy resources. Hardware has limited capacity. While ensuring high-efficiency work, hardware resource utilization should also be guaranteed. Therefore, an upper limit needs to be set for hardware to alleviate its pressure, which led to the concept of pools.

Running Chrome Docker Container

Using Selenium to Call Remote Service for Web Crawling

Summary

Related