Hi Jimmy Liu,
I used a single GeForce RTX 2080 Ti (11GB RAM) when training BERT-Base English and Chinese. Prediction time tested in Jupyter Notebook is around 2.75 seconds per call using the same GPU (not suitable for real-time prediction). For the online machine environment, I served it via Flask on Intel(R) Core i7–8550U CPU 16GB RAM Windows 10 machine. Prediction time is somewhere around 5~6 seconds (approximation). You need to modify the underlying code in convert_examples_to_features and input_fn_builder functions if you intend to speed it up. Thanks a lot.