Introduction to WeTextProcessing: Chinese Text Normalization

Ng Wai Foong
3 min readMay 8, 2023

Normalize chinese text for downstream tasks

Image by the author

By reading this piece, you will learn to perform chinese text normalization and inverse text normalization. Text normalization is one of the most important preprocessing steps in natural language processing (NLP).

Most of the time, raw text should not be used directly for any downstream tasks as it might affects the performance/accuracy. For example, the following text can be normalized as follows:

|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|共465篇,约315万字 |共四百六十五篇,约三百一十五万字 |
|-------------------|----------------------------|

Besides that, text normalization may remove unnecessary punctuation, symbols and characters from the raw text.

|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|呃这个呃啊我不知道!!! | 这个我不知道! |
|-------------------|----------------------------|

This tutorial covers the WeTextProcessing module, a chinese text normalization Python module built on top of other useful modules.

Note that WeTextProcessing uses…

--

--

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/