BOBOBK

Drawing a Stunning "Dream of the Red Chamber" Word Cloud with Python 3

TECHNOLOGY

Word clouds, which I’m sure you’ve all seen, are created using wordcloud, a famous Python library. This article will detail how to use wordcloud to create a word cloud for “Dream of the Red Chamber,” one of China’s Four Great Classical Novels.


1. Preparation

This involves three parts:

2. The wordcloud and jieba libraries, which can be installed using pip install wordcloud and pip install jieba.

3. Preparing a Chinese font file.

The .txt text file and font file are bundled together for your convenience to replicate this tutorial’s example.


2. Drawing the “Dream of the Red Chamber” Word Cloud

Here’s the code directly:

    from wordcloud import WordCloud
    import jieba
    text = "".join(jieba.cut(open("红楼梦.txt").read()))
    wordcloud = WordCloud(font_path="kaibold.ttf").generate(text)

    # Display the generated image:
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.show()

In the example above, we first import the necessary libraries, then read the text file and perform Chinese word segmentation using jieba’s cut function. After segmentation, the result is a list. We then join the list with spaces to meet the input requirements of the word cloud tool, similar to English text. Finally, we specify the font file to generate the graphic.

As you can see, the word cloud has been successfully generated, but there are still some obvious issues. For instance, the word “道” (dào) appears many times with a very high frequency, which needs to be removed. Let’s proceed with the removal.


3. Word Cloud in a Specific Shape

In addition to direct plotting, wordcloud can also draw word clouds based on a user-defined shape. This powerful feature simply requires specifying the mask parameter when generating the word cloud. Here’s the code:

    from wordcloud import WordCloud
    import jieba,requests
    from PIL import Image
    import numpy as np
    text = " ".join(jieba.cut(open("红楼梦.txt").read()))
    remove_word = [i.strip() for i in open("remove.txt").readlines()]
    for i in remove_word:
        text = text.replace(i+" ","")
    wave_mask = np.array(Image.open(BytesIO(requests.get(
            "https://www.bobobk.com/wp-content/uploads/2018/11/butter.jpg").content)))

    # Make the figure

    wordcloud = WordCloud(mask=wave_mask,background_color="lightblue",font_path="/Library/Fonts/kaibold.ttf").generate(text)

    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.show()

Here’s the word cloud generated using the butterfly curve from this site:


Summary

Using the open-source Python library wordcloud in conjunction with the Chinese word segmentation tool jieba, we’ve successfully created a word cloud for the complete text of “Dream of the Red Chamber.”

Download link for font and text files:

Link: https://pan.baidu.com/s/1Wi8sdpj9tva0pglDyfv8gA Extraction Code: pq6t

Related