python - Pig中的标记化(使用python udf)

这是我在 pig 中标记化的工作，
我的 pig 脚本

--set the debugg mode
SET debug 'off'
-- Registering the python udf
REGISTER /home/hema/phd/work1/coding/myudf.py USING streaming_python as myudf

RAWDATA =LOAD '/home/hema/temp' USING TextLoader() AS content;
LOWERCASE_DATA =FOREACH RAWDATA GENERATE LOWER(content) AS con;
TOKENIZED_DATA =FOREACH LOWERCASE_DATA GENERATE myudf.special_tokenize(con) as conn;
DUMP TOKENIZED_DATA;

我的Python UDF

from pig_util import outputSchema
import nltk

@outputSchema('word:chararray')
def special_tokenize(input):
    tokens=nltk.word_tokenize(input)
    return tokens

该代码工作正常，但输出混乱。我该如何去除不需要的底线和竖线。输出看起来像这样

(|{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_)
(|{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_)
(|{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_)
(|{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_)
(|{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_)
(|{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_)
(|{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_)
(|{_|(_best|)_|,_|(_,|)_|}_)
(|{_|(_svetoslav|)_|}_)

原始数据

AdditionalContext in NameFinder 
Is there any possibility to use additionalContext with the NameFinderME.train? If so, how? If there isn't maybe this should be an issue to be added in the future releases?
I would REALLY greatly appreciate if someone can help (give me some sample code/show me)  how to add POS tag features while training and testing NameFinder.
If the incoming data is just tokens with NO POS tag information, where is the information taken then? A new file? Run a POS tagging model before training? Or?
And what is the purpose of the resources (i.e. Collection.<String,Object>emptyMap()) in the NameFinderME.train method? What should be ideally included in there?
I just can't get these things from the Java doc API.
 in advance!
Best,
Svetoslav

我想有一个 token 列表作为我的最终输出。

最佳答案

对“_”和“|”使用REPLACE然后将TOKENIZE用于 token 。

NEW_TOKENIZED_DATA =FOREACH TOKENINZED_DATA GENERATE REPLACE(REPLACE($0,'_',''),'|','');
TOKENS = FOREACH NEW_TOKENIZED_DATA GENERATE TOKENIZE($0);
DUMP TOKENS;

关于python - Pig中的标记化(使用python udf)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38585671/

python - Pig中的标记化(使用python udf)

上一篇：scala - 带URI的Flink，Kafka和Zookeeper

下一篇：docker - Docker ENTRYPOINT不执行命令/毒素