Introduction to WeTextProcessing: Chinese Text Normalization

Ng Wai Foong
3 min readMay 8

Normalize chinese text for downstream tasks

Image by the author

By reading this piece, you will learn to perform chinese text normalization and inverse text normalization. Text normalization is one of the most important preprocessing steps in natural language processing (NLP).

Most of the time, raw text should not be used directly for any downstream tasks as it might affects the performance/accuracy. For example, the following text can be normalized as follows:

|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|共465篇,约315万字 |共四百六十五篇,约三百一十五万字 |
|-------------------|----------------------------|

Besides that, text normalization may remove unnecessary punctuation, symbols and characters from the raw text.

|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|呃这个呃啊我不知道!!! | 这个我不知道! |
|-------------------|----------------------------|

This tutorial covers the WeTextProcessing module, a chinese text normalization Python module built on top of other useful modules.

Note that WeTextProcessing uses OpenFst and Pynini as the foundational libraries, which are only installable on Linux (x86) and MacOS platforms.

Setup

It is highly recommended to create a new virtual environment before continue with the installation. Activate it and run the following command to install WeTextProcessing:

pip install WeTextProcessing

Alternatively, you can install it using the following requirement file:

Cython==0.29.34
importlib-resources==5.12.0
pynini==2.1.5
WeTextProcessing==0.1.0

Save it as requirements.txt and install it as follows:

pip install -r requirements.txt

Usage

Normalization

There are 3 components in TN pipeline:

  • pre-processing (before tagger)
  • non-standard word normalization
Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/