Python中sacremoses包的使用

Python中sacremoses包的使用

5 min read

sacremoses 地址:https://github.com/alvations/sacremoses
sacremoses使用Python实现了Moses的tokenizer, truecaser以及normalizer功能,使用起来比较方便。

一、安装

使用pip install sacremoses进行安装

二、Tokenizer and Detokenizer

Tokenizer

tokenize()函数

tokenize(text, aggressive_dash_splits=False, return_str=False, escape=True, protected_patterns=None)
text:需要进行tokenize的句子,类型str
aggressive_dash_splits:破折号切分,类型bool,默认False
return_str:一字符串形式返回,类型bool,默认False
escape:特殊符号是否转译,类型bool,默认True
protected_patterns:需要保留的模式,类型list,默认None

tokenizer例子

import sacremoses as moses
tokenizer = moses.MosesTokenizer(lang='en')
text = u"It's a sentence with 'weird-symbols'."
result_1 = tokenizer.tokenize(text)
# ['It', ''s', 'a', 'sentence', 'with', ''', 'weird-symbols', ''', '.']
result_2 = tokenizer.tokenize(text,return_str=True)
# It 's a sentence with ' weird-symbols ' .
result_3 = tokenizer.tokenize(text,return_str=True,escape=False)
# It 's a sentence with ' weird-symbols ' .
result_4 = tokenizer.tokenize(text,aggressive_dash_splits=True,return_str=True,escape=False)
# It 's a sentence with ' weird @-@ symbols ' .
result_5 = tokenizer.tokenize(text,aggressive_dash_splits=True,return_str=True,escape=False,protected_patterns=["It's"])
# It's a sentence with ' weird @-@ symbols ' .
print(tokenizer_result)

多数情况下使用的是第四中方式来进行tokenize

Detokenizer

detokenize()函数

detokenize(tokens, return_str=True, unescape=True)
tokens:需要进行detokenize的句子,类型list
return_str:以字符串形式返回,类型bool,默认True
unescape:不进行特殊字符转义,类型bool,默认True

detokenize例子

import sacremoses as moses
detokenizer = moses.MosesDetokenizer(lang='en')
tokens = ['It', "'s", 'a', 'sentence', 'with', "'", 'weird', '@-@', 'symbols', "'", '.']
result = detokenizer.detokenize(tokens)
# It's a sentence with 'weird-symbols'.

三、Truecaser

对句子进行Truecaser一般分为两个步骤:

  1. 训练truecaser模型(如果已有训练好的truecaser模型,这步可以跳过)
  2. 使用truecaser模型对句子进行Truecase操作

训练truecaser模型

sacremoses.MosesTruecaser类中有三个函数可以用来训练truecaser模型,分别是MosesTruecaser.train(),MosesTruecaser.train_from_file(),MosesTruecaser.train_from_file_object()

MosesTruecaser.train()

MosesTruecaser.train(documents, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)

documents:用于训练truecaser模型的文档,类型list(list(str))
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False
注:对于多进程训练,我设置进程数大于1反而更慢,不知道什么原因


import sacremoses as moses
turecaser = moses.MosesTruecaser()
tokenizer = moses.MosesTokenizer(lang='en')
tokenized_docs = [tokenizer.tokenize(line) for line in open('test_file.txt')]
turecaser.train(tokenized_docs, save_to='truecasemodel.txt')

MosesTruecaser.train_from_file()

MosesTruecaser.train_from_file(filename, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)

filename:用于训练truecaser模型的文件路径,类型str
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False


import sacremoses as moses
turecaser = moses.MosesTruecaser()
turecaser.train_from_file('test_file.txt', save_to='truecasemodel.txt')

MosesTruecaser.train_from_file_object()

MosesTruecaser.train_from_file_object(file_object, save_to=None, possibly_use_first_token=False, processes=1, progress_bar=False)

file_object:用于训练truecaser模型的文件对此,类型file_object
save_to:模型保存的文件,类型str,默认None
possibly_use_first_token:使用每一句的首个单词(默认情况下会去除首个单词),类型bool,默认False
processes:多进程训练,类型bool,默认1
progress_bar:显示训练进度条,类型bool,默认False


import sacremoses as moses
turecaser = moses.MosesTruecaser()
file = open('test_file.txt','r',encoding='utf-8')
turecaser.train_from_file_object(file, save_to='truecasemodel.txt',)
file.close()

个人倾向于使用第二种方法训练truecaser模型

对句子进行truecase操

# 初始化MosesTruecaser对象时候加载truecaser模型
import sacremoses as moses
turecaser = moses.MosesTruecaser('truecaser_model')
text = 'Hello World!'
turecaser.truecase(text)
# 如果初始化的时候没有加载truecaser模型则MosesTruecaser对象需要训练truecaser模型
import sacremoses as moses
turecaser = moses.MosesTruecaser()
turecaser.train_from_file('test_file.txt')
text = 'Hello World!'
turecaser.truecase(text)

四、Normalizer

对符号标准化123123

import sacremoses as moses
normalizer = moses.MosesPunctNormalizer(lang='en', penn=True, norm_quote_commas=True, norm_numbers=True, pre_replace_unicode_punct=True, post_remove_control_chars=False)
result = normalizer.normalize('【】“”0,1,2,3,4,5,6,7,8,9')
# result : []""0,1,2,3,4,5,6,7,8,9

前一篇

Ubuntu关闭和开启图形用户界面

后一篇

Google colab读取云硬盘文件