python - 使用 pandas 从 CSV 加载随机样本

我有以下格式的 CSV

Team, Player

我想要做的是将过滤器应用于现场团队，然后从每个团队中随机抽取 3 名球员。

例如，我的 CSV 看起来像:

Man Utd, Ryan Giggs
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
...

我希望最终得到一个由每队随机 3 名球员组成的 XLS，如果少于 3 名球员，则只有 1 或 2 名，例如，

Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish

我开始使用XLRD，我的原始帖子是here .

我现在正在尝试使用 Pandas，因为我相信这在未来会更加灵活。

所以，在伪代码中我想做的是:

foreach(team in csv)
   print random 3 players + team they are assigned to

我一直在研究 Pandas 并试图找到执行此操作的最佳方法，但我找不到任何与我想要做的类似的东西(这对 Google 来说是一件困难的事情!)。这是我到目前为止的尝试:

import pandas as pd
from collections import defaultdict
import csv as csv


columns = defaultdict(list) # each value in each column is appended to a list

with open('C:\\Users\\ADMIN\\Desktop\\CSV_1.csv') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        print(row)
        #for (k,v) in row.items(): # go over each column name and value
        #    columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

所以我注释掉了最后两行，因为我不确定是否需要我。我现在打印每一行，所以我只需要为每个足球队随机选择 3 行(如果数量较少，则选择 1 或 2 行)。

我怎样才能做到这一点？有什么提示/技巧吗？

谢谢。

最佳答案

首先使用优化更好的read_csv:

import pandas as pd

df = pd.read_csv('DataFrame')

现在作为一个随机示例，使用 lambda 通过随机化数据帧来获取随机子集(例如，将“x”替换为 LivFC):

In []
df= pd.DataFrame()
df['x'] = np.arange(0, 10, 1)
df['y'] = np.arange(0, 10, 1)
df['x'] = df['x'].astype(str)
df['y'] = df['y'].astype(str)

df['x'].ix[np.random.random_integers(0, len(df), 10)][:3]

Out [382]:
0    0
3    3
7    7
Name: x, dtype: object

这将使您更加熟悉 pandas，但是从版本 0.16.x 开始，现在内置了一个 DataFrame.sample 方法:

df = pandas.DataFrame(data)

# Randomly sample 70% of your dataframe
df_0.7 = df.sample(frac=0.7)

# Randomly sample 7 elements from your dataframe
df_7 = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:

df_rest = df.loc[~df.index.isin(df_0.7.index)]

关于python - 使用 pandas 从 CSV 加载随机样本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42486085/

python - 使用 pandas 从 CSV 加载随机样本

上一篇：Dataframe 查询中的 Python lambda 函数

下一篇：python - 如何重新排列数据框中的行并获得与 pandas 中其他两列具有百分比差异的新列？