sacremoses 地址:https://github.com/alvations/sacremoses
sacremoses使用Python实现了Moses的tokenizer, truecaser以及normalizer功能,使用起来比较方便。
一、安装
使用pip install sacremoses
进行安装
二、Tokenizer and Detokenizer
Tokenizer
tokenize()函数
tokenize(text, aggressive_dash_splits=False, return_str=False, escape=True, protected_patterns=None)
text:需要进行tokenize的句子,类型str
aggressive_dash_splits:破折号切分,类型bool,默认False
return_str:一字符串形式返回,类型bool,默认False
escape:特殊符号是否转译,类型bool,默认True
protected_patterns:需要保留的模式,类型list,默认None
tokenizer例子
import sacremoses as moses
tokenizer = moses.MosesTokenizer(lang='en')
text = u"It's a sentence with 'weird-symbols'."
result_1 = tokenizer.tokenize(text)
# ['It', ''s', 'a', 'sentence', 'with', ''', 'weird-symbols', ''', '.']
result_2 = tokenizer.tokenize(text,return_str=True)
# It 's a sentence with ' weird-symbols ' .
result_3 = tokenizer.tokenize(text,return_str=True,escape=False)
# It 's a sentence with ' weird-symbols ' .
result_4 = tokenizer.tokenize(text,aggressive_dash_splits=True,return_str=True,escape=False)
# It 's a sentence with ' weird @-@ symbols ' .
result_5 = tokenizer.tokenize(text,aggressive_dash_splits=True,return_str=True,escape=False,protected_patterns=["It's"])
# It's a sentence with ' weird @-@ symbols ' .
print(tokenizer_result)
多数情况下使用的是第四中方式来进行tokenize
Detokenizer
detokenize()函数
detokenize(tokens, return_str=True, unescape=True)
tokens:需要进行detokenize的句子,类型list
return_str:以字符串形式返回,类型bool,默认True
unescape:不进行特殊字符转义,类型bool,默认True
detokenize例子
import sacremoses as moses
detokenizer = moses.MosesDetokenizer(lang='en')
tokens = ['It', "'s", 'a', 'sentence', 'with', "'", 'weird', '@-@', 'symbols', "'", '.']
result = detokenizer.detokenize(tokens)
# It's a sentence with 'weird-symbols'.
三、Truecaser
对句子进行Truecaser一般分为两个步骤:
- 训练truecaser模型(如果已有训练好的truecaser模型,这步可以跳过)
- 使用truecaser模型对句子进行Truecase操作
训练truecaser模型
sacremoses.MosesTruecaser类中有三个函数可以用来训练truecaser模型,分别是MosesTruecaser.train()
,MosesTruecaser.train_from_file()
,MosesTruecaser.train_from_file_object()
MosesTruecaser.train()
MosesTruecaser.train(documents, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)
documents:用于训练truecaser模型的文档,类型list(list(str))
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False
注:对于多进程训练,我设置进程数大于1反而更慢,不知道什么原因
import sacremoses as moses
turecaser = moses.MosesTruecaser()
tokenizer = moses.MosesTokenizer(lang='en')
tokenized_docs = [tokenizer.tokenize(line) for line in open('test_file.txt')]
turecaser.train(tokenized_docs, save_to='truecasemodel.txt')
MosesTruecaser.train_from_file()
MosesTruecaser.train_from_file(filename, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)
filename:用于训练truecaser模型的文件路径,类型str
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False
import sacremoses as moses
turecaser = moses.MosesTruecaser()
turecaser.train_from_file('test_file.txt', save_to='truecasemodel.txt')
MosesTruecaser.train_from_file_object()
MosesTruecaser.train_from_file_object(file_object, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)
file_object:用于训练truecaser模型的文件对此,类型file_object
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False
import sacremoses as moses
turecaser = moses.MosesTruecaser()
file = open('test_file.txt','r',encoding='utf-8')
turecaser.train_from_file_object(file, save_to='truecasemodel.txt',)
file.close()
个人倾向于使用第二种方法训练truecaser模型
对句子进行truecase操
# 初始化MosesTruecaser对象时候加载truecaser模型
import sacremoses as moses
turecaser = moses.MosesTruecaser('truecaser_model')
text = 'Hello World!'
turecaser.truecase(text)
# 如果初始化的时候没有加载truecaser模型则MosesTruecaser对象需要训练truecaser模型
import sacremoses as moses
turecaser = moses.MosesTruecaser()
turecaser.train_from_file('test_file.txt')
text = 'Hello World!'
turecaser.truecase(text)
四、Normalizer
对符号标准化123
→123
import sacremoses as moses
normalizer = moses.MosesPunctNormalizer(lang='en', penn=True, norm_quote_commas=True, norm_numbers=True, pre_replace_unicode_punct=True, post_remove_control_chars=False)
result = normalizer.normalize('【】“”0,1,2,3,4,5,6,7,8,9')
# result : []""0,1,2,3,4,5,6,7,8,9