我有一个 Spark Dataframe,其中每一行都有一个评论。
+--------------------+
| reviewText|
+--------------------+
|Spiritually and m...|
|This is one my mu...|
|This book provide...|
|I first read THE ...|
+--------------------+
我已经尝试过:
SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText')))
SplitSentences = SplitSentences.select(SplitSentences.split_sent)
然后我创建了这个函数:
def word_count(text):
return len(text.split())
wordcount_udf = udf(lambda x: word_count(x))
df2 = SplitSentences.withColumn("word_count",
wordcount_udf(col('split_sent')).cast(IntegerType())
我想计算每个评论(行)中每个句子的单词数,但它不起作用。
最佳答案
您可以使用split
内置函数来拆分句子,并使用size
内置函数来计算句子的个数数组长度为
df.withColumn("word_count", F.size(F.split(df['reviewText'], ' '))).show(truncate=False)
这样你就不需要昂贵的udf函数
举个例子,假设您有以下一句话数据框
+-----------------------------+
|reviewText |
+-----------------------------+
|this is text testing spliting|
+-----------------------------+
应用上述 size
和 split
函数后,您应该得到
+-----------------------------+----------+
|reviewText |word_count|
+-----------------------------+----------+
|this is text testing spliting|5 |
+-----------------------------+----------+
如果一行中有多个句子如下
+----------------------------------------------------------------------------------+
|reviewText |
+----------------------------------------------------------------------------------+
|this is text testing spliting. this is second sentence. And this is the third one.|
+----------------------------------------------------------------------------------+
然后你必须编写一个 udf
函数,如下所示
from pyspark.sql import functions as F
def countWordsInEachSentences(array):
return [len(x.split()) for x in array]
countWordsSentences = F.udf(lambda x: countWordsInEachSentences(x.split('. ')))
df.withColumn("word_count", countWordsSentences(df['reviewText'])).show(truncate=False)
这应该给你
+----------------------------------------------------------------------------------+----------+
|reviewText |word_count|
+----------------------------------------------------------------------------------+----------+
|this is text testing spliting. this is second sentence. And this is the third one.|[5, 4, 6] |
+----------------------------------------------------------------------------------+----------+
希望我的回答对您有帮助
关于python - Spark Dataframes 计算每个句子中的单词数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49267331/