python - 过滤作为列表的 DataFrame 行

标签 python pandas dataframe filter

数据结构:

  • 一个 pandas DataFrame (business_df),其中有一列 (category) 我感兴趣的列表

  • 包含餐厅类别的列表 (restaurant_categories_list)

我正在尝试做的事情:

根据 category 列(具有列表结构)过滤 business_df 中的企业,如果至少有一个列出的类别匹配,则将企业归类为餐厅至少一种餐厅类别。

我检查了这 2 个问题,但它们没有为我的问题提供答案:

Filter dataframe rows if value in column is in a set list of values

use a list of values to select rows from a pandas dataframe

我现在正在运行这段代码:

restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
 print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])

这是我感兴趣的专栏:

0                          ['Fast Food', 'Restaurants']
1                                         ['Nightlife']
2                         ['Auto Repair', 'Automotive']
3                  ['Active Life', 'Mini Golf', 'Golf']
4     ['Shopping', 'Home Services', 'Internet Servic...
5     ['Bars', 'American (New)', 'Nightlife', 'Loung...
6     ['Active Life', 'Trainers', 'Fitness & Instruc...
7     ['Bars', 'American (Traditional)', 'Nightlife'...
8                ['Auto Repair', 'Automotive', 'Tires']
9                          ['Active Life', 'Mini Golf']
10                     ['Home Services', 'Contractors']
11                            ['Veterinarians', 'Pets']
12        ['Libraries', 'Public Services & Government']
13              ['Automotive', 'Auto Parts & Supplies']
14    ['Burgers', 'Breakfast & Brunch', 'American (T...

因此,如果我只处理这些行,我预期的数据框应该只包含第 0 行和第 14 行。

最佳答案

更新:

此版本使用 ast.literal_eval() 来反序列化字符串中的列表,它似乎工作正常:

import ast
import pandas as pd

restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']

df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)

df = df_orig[(pd.notnull(df_orig['categories']))]

mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0

print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)

但是正如@CorleyBrigman 已经说过的那样,使用 Pandas 处理这样的数据结构非常困难且效率非常低......

基于示例数据的旧答案:

您可以将列表转换为列/系列,然后使用 pd.isin() 函数生成一个 True/False 值矩阵,可以将其相加(因为在 Python 中:False== 0 和 True==1):

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])

解释:

print(df['business'].apply(pd.Series))

                0                             1                  2      3
0       Fast Food                   Restaurants                NaN    NaN
1       Nightlife                           NaN                NaN    NaN
2     Auto Repair                    Automotive                NaN    NaN
3     Active Life                     Mini Golf               Golf    NaN
4        Shopping                 Home Services    Internet Servic    NaN
5            Bars                American (New)          Nightlife  Loung
6     Active Life                      Trainers  Fitness & Instruc    NaN
7            Bars        American (Traditional)          Nightlife    NaN
8     Auto Repair                    Automotive              Tires    NaN
9     Active Life                     Mini Golf                NaN    NaN
10  Home Services                   Contractors                NaN    NaN
11  Veterinarians                          Pets                NaN    NaN
12      Libraries  Public Services & Government                NaN    NaN
13     Automotive         Auto Parts & Supplies                NaN    NaN
14        Burgers            Breakfast & Brunch           American    NaN

然后

print(df['business'].apply(pd.Series).isin(restaurant_categories_list))

输出:

        0      1      2      3
0   False   True  False  False
1   False  False  False  False
2   False  False  False  False
3   False  False  False  False
4   False  False  False  False
5   False  False  False  False
6   False  False  False  False
7   False  False  False  False
8   False  False  False  False
9   False  False  False  False
10  False  False  False  False
11  False  False  False  False
12  False  False  False  False
13  False  False  False  False
14   True   True  False  False

然后

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)

输出:

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
dtype: bool

最后:

print(df[(mask)])

输出:

                                   business
0                  [Fast Food, Restaurants]
14  [Burgers, Breakfast & Brunch, American]

关于python - 过滤作为列表的 DataFrame 行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35971335/

相关文章:

python - 如何使类返回整数实例

python - pd.Series.cat.as_ordered() 在 Pandas 中做什么?

python - 按标签系列重新索引 DataFrame 列

python - 从两个 DataFrame 列构建字典

python - 使用python求解非线性方程

python - Django查询优化: find a list of objects based on a many-to-one to a many-to-many

python - 查询字符串值的数据框列

python - MySQL 存储过程、Pandas 和 "Use multi=True when executing multiple statements"

python - Fillna 一次使用多种方法 - pandas

Python - 将字节数组转换为 JSON 格式