Introduction to WeTextProcessing: Chinese Text Normalization
--
Normalize chinese text for downstream tasks
By reading this piece, you will learn to perform chinese text normalization and inverse text normalization. Text normalization is one of the most important preprocessing steps in natural language processing (NLP).
Most of the time, raw text should not be used directly for any downstream tasks as it might affects the performance/accuracy. For example, the following text can be normalized as follows:
|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|共465篇,约315万字 |共四百六十五篇,约三百一十五万字 |
|-------------------|----------------------------|
Besides that, text normalization may remove unnecessary punctuation, symbols and characters from the raw text.
|-------------------|----------------------------|
|Raw text | Normalized text |
|-------------------|----------------------------|
|呃这个呃啊我不知道!!! | 这个我不知道! |
|-------------------|----------------------------|
This tutorial covers the WeTextProcessing
module, a chinese text normalization Python module built on top of other useful modules.
Note that WeTextProcessing uses OpenFst and Pynini as the foundational libraries, which are only installable on Linux (x86) and MacOS platforms.
Setup
It is highly recommended to create a new virtual environment before continue with the installation. Activate it and run the following command to install WeTextProcessing
:
pip install WeTextProcessing
Alternatively, you can install it using the following requirement file:
Cython==0.29.34
importlib-resources==5.12.0
pynini==2.1.5
WeTextProcessing==0.1.0
Save it as requirements.txt
and install it as follows:
pip install -r requirements.txt
Usage
Normalization
There are 3 components in TN pipeline:
- pre-processing (before tagger)
- non-standard word normalization