python - pickle .PicklingError : args[0] from newobj args has the wrong class with hadoop python

标签 python python-2.7 hadoop pyspark pickle

我正在尝试通过spark删除停用词，代码如下

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

我得到如下错误:

pickle.PicklingError: args[0] from newobj args has the wrong class

我不知道为什么，谁能帮帮我。
提前致谢

最佳答案

与上传停用词模块有关。作为在函数本身中导入停用词库的解决方法。请参阅下面链接的类似问题。我遇到了同样的问题，此解决方法解决了该问题。

    def stopwords_delete(word_list):
        from nltk.corpus import stopwords
        filtered_words=[]
        print word_list

Similar Issue

我会推荐 from pyspark.ml.feature import StopWordsRemover 作为永久修复。

关于python - pickle .PicklingError : args[0] from __newobj__ args has the wrong class with hadoop python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44911539/

上一篇：python - 在 Python UDF 中访问外部文件

下一篇：java - mapreduce hadoop 中的 reducer 数量

相关文章：

Python 2.7.3 : Copying symlinks, 但不跟

python - tf.estimator shuffle - 随机种子？

python - 无法在两个类之间建立桥梁

python - 在启动时自动启动 python 脚本？

python - Pandas concat 似乎忽略了索引

java - Scala:在类中找不到主要方法

python - 获取 .decode ("hex") 以在从 .py 运行时将结果显示为普通字符和转义十六进制

python - SUDS 自定义 header

java - 如何在 MapReduce 中将信息从一个 reducer 传递到另一个 reducer

hadoop - 使用 NiFi 将 CSV 数据提取到 Hive 中