After runing the prepro, I ran the train file. The train file printed that the: pnyn vocabulary size is 42 hanzi vocabulary size is 5072 I have not been able to figure out why the output of the prepro result in different vocabulary size.