我正在使用 Spark 版本 2.0.1 和 python 2.7。我正在运行以下代码
# This will return a new DF with all the columns + id
data1 = data.withColumn("id", monotonically_increasing_id()) # Create an integer index
data1.show()
def create_indexes(df,
fields=['country', 'state_id', 'airport', 'airport_id']):
""" Create indexes for the different element ids
for CMRs. This allows us to select CMRs that match
a given element and element value very quickly.
"""
if fields == None:
print("No fields specified, returning")
return
for field in fields:
if field not in df.columns:
print('field: ', field, " is not in the data...")
return
indexes = {}
for field in fields:
print(field)
res = df.groupby(field)
index = {label: np.array(vals['id'], np.int32) for label, vals in res}
indexes[field] = index
return indexes
# Create indexes. Some of them take a lot of time!
#Changed dom_client_id by gbl_buy_grp_id as it was changed in Line Number
indexes = create_indexes(data1, fields=['country', 'state_id', 'airport', 'airport_id'])
print type(indexes)
运行此代码时收到以下错误消息
TypeError: 'GroupedData' object is not iterable
你能帮我解决这个问题吗?
最佳答案
您必须对 GroupedData 执行聚合并在迭代结果之前收集结果,例如计算每组的项目数:res = df.groupby(field).count().collect()
关于python - 类型错误 : 'GroupedData' object is not iterable in pyspark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46791254/