python - Pandas/Python 中的数据操作

看似简单的数据操作操作。但我坚持这一点。

我有一个事件的推荐数据集。

Masteruserid content 

1             100
1             101
1             102
2             100
2             101
2             110

现在我们要为每个用户推荐至少 5 个内容。因此，例如 Masteruserid 1 有三个建议，我想从全局查看的内容中随机选择剩下的两个，这是一个单独的数据集(列表)。然后我还必须检查重复项，以防原始数据集中已经存在随机选择的数据。

global_content
100
300
301
101

实际上我有大约 4000 多个 Masteruserid。现在，我需要有关如何着手解决这个问题的帮助。

最佳答案

def add_content(df, gc, k=5):
    n = len(df)
    gcs = set(gc.squeeze())
    if n < k:
        choices = list(gcs.difference(df.content))
        mc = np.random.choice(choices, k - n, replace=False)
        ids = np.repeat(df.Masteruserid.iloc[-1], k - n)
        data = dict(Masteruserid=ids, content=mc)

        return df.append(pd.DataFrame(data), ignore_index=True)


gb = df.groupby('Masteruserid', group_keys=False)
gb.apply(add_content, gc).reset_index(drop=True)

关于python - Pandas/Python 中的数据操作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39103090/

上一篇：python - Pylint 属性错误 : 'module' object has no attribute 'append'

下一篇：python - Pandas 数据框数据透视表和分组

python - 无法在python中使用FPDF打印特定字符

python - 非负矩阵分解 - IndexError : index 4 is out of bounds for axis 1 with size 4

python - 无法在 add_url_rule 中设置方法

python - 尝试加载多个json文件并合并为一个 Pandas 数据框

python - 使用 Pandas 库将日期/时间转换为月份后获取 float 而不是整数

python - 基本 Django 模板中未定义 STATIC_URL

python - 阅读卡片上的文字

pandas - 将 Pandas 系列作为行有效地添加到现有数据帧

python - 将 CSV 数据流转换为 Pandas DataFrame (Python 2.7)