python - 如何用最少的代码创建过滤后的 DataFrame

标签 python pandas indexing dataframe conditional-statements

有四辆车:bmwgeovwporsche:

import pandas as pd
df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

enter image description here

我想创建一个过滤后的 DataFrame,它只列出具有所有三个功能的汽车:DVD 播放器、天窗和保修(我们知道这里的 BMW 将所有功能都设置为"is")。

我可以一次做一个专栏:

cars_with_warranty = df['car'][df['warranty']=='yes']
print(cars_with_warranty)

enter image description here

然后我需要对 dvd 和 sunroof 列进行类似的列计算:

cars_with_dvd = df['car'][df['dvd']=='yes']
cars_with_sunroof = df['car'][df['sunroof']=='yes']

我想知道是否有一种巧妙的方法来创建过滤后的 DataFrame

稍后编辑:

发布的解决方案效果很好。但是生成的 cars_with_all_three 是一个简单的列表变量。我们需要具有单个“bmw”汽车的 DataFrame 对象作为其唯一的行和所有三列:dvd、sunroof 和 warranty(所有三个值都设置为“yes”)。

cars_with_all_three = []
for ind, car in enumerate(df['car']):
    if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
        cars_with_all_three.append(car)

最佳答案

您可以使用 boolean indexing :

print ((df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes'))
0     True
1    False
2    False
3    False
dtype: bool

print (df[(df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes')])
   car  dvd sunroof warranty
0  bmw  yes     yes      yes

#if need filter only column 'car' 
print (df.ix[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes'), 'car'])
0    bmw
Name: car, dtype: object

另一种解决方案,检查列中的所有值是否为 yes,然后通过 all 检查所有值是否为 True :

print ((df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1))
0     True
1    False
2    False
3    False
dtype: bool

print (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
   car  dvd sunroof warranty
0  bmw  yes     yes      yes

print (df.ix[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1), 'car'])
0    bmw
Name: car, dtype: object

如果 DataFrame 只有 4 列,代码最少的解决方案,如示例:

print (df[(df.set_index('car') == 'yes').all(1).values])
   car  dvd sunroof warranty
0  bmw  yes     yes      yes

时间:

In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
10 loops, best of 3: 120 ms per loop

In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.09 ms per loop

In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
1000 loops, best of 3: 1.53 ms per loop

In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.51 ms per loop

In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
1000 loops, best of 3: 1.64 ms per loop

In [49]: %timeit (mer(df))
The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.85 ms per loop

计时代码:

df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

print (df)
df = pd.concat([df]*1000).reset_index(drop=True)

def mer(df):
    df = df.set_index('car')
    return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()

关于python - 如何用最少的代码创建过滤后的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39240820/

相关文章:

python - 禁用 Python 的 re.findall() 的 "group becomes tuple"行为

python - 为多索引数据框更改热图的 yticks

python - pandas MultiIndex 中的幽灵指数

postgresql - 可空列上的 Postgres 部分索引与常规/完整索引

arrays - 如何将连续值复制到数组

arrays - Postgres : why does this GIN index not used for this "object in array" query

python - Waf - 找不到本地 dll

python - 如何找到字符串中字母(来自字母表)第一次出现的位置?

python - `tkinter.iconbitmap` 方法返回空字符串

Python Pandas - 将 csv 文件转换为特定格式