Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出

我有一个像这样的数据框我想按 url 和状态对它们进行分组，并按日期拆分记录，这是一种更有效的方法吗？

def transform_to_unique(df):
    test = []
    counter = 0

    #first_row
    if df.loc[0, 'status']!= df.loc[1, 'status']:
        counter = counter +1
    test.append(counter)

    for i in range(1, len(df)):

        if df.loc[i-1, 'url']!= df.loc[i, 'url']:
            counter=0

        if df.loc[i-1, 'status']!= df.loc[i, 'status'] :
            counter = counter +1
        test.append(counter)

    df['test'] = pd.Series(test)

    return df

df = transform_to_unique(frame)

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

这是一个数据框:

1000,20191109,active
1000,20191108,inactive
2000,20191109,active
2000,20191101,inactive
351,20191109,active
351,20191102,active
351,20191026,active
351,20191019,active
351,20191012,active
351,20191005,active
351,20190928,inactive
351,20190921,inactive
351,20190914,inactive
351,20190907,active
351,20190831,active
351,20190615,inactive
3000,20200101,active

import pandas as pd
frame =pd.read_clipboard(sep=",", header=None)
frame.columns = ['url', 'date_scraped', 'status']

最佳答案

我不确定我是否正确理解了 test 列的标题，但这是否是您想要实现的目标(基于您发布的示例数据):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["date_scraped_till"]=np.where(df["url"]==df["url"].shift(-1), 

df["date_scraped"].shift(-1), np.nan).astype(np.int32)

输出:

     url  date_scraped    status  date_scraped_till
15   351      20190615  inactive           20190831
14   351      20190831    active           20190907
13   351      20190907    active           20190914
12   351      20190914  inactive           20190921
11   351      20190921  inactive           20190928
10   351      20190928  inactive           20191005
9    351      20191005    active           20191012
8    351      20191012    active           20191019
7    351      20191019    active           20191026
6    351      20191026    active           20191102
5    351      20191102    active           20191109
4    351      20191109    active                  0
1   1000      20191108  inactive           20191109
0   1000      20191109    active                  0
3   2000      20191101  inactive           20191109
2   2000      20191109    active                  0
16  3000      20200101    active                  0

编辑

如果您的意思不是“拆分”，而是“折叠”，那么这应该可以解决问题(这基本上是执行测试列的更有效方法):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["test"]=np.where((df["url"]==df["url"].shift(1)) & (df["status"]==df["status"].shift(1)), 0,1)

df["test"]=df.groupby(["url", "status", "test"])["test"].cumsum().replace(to_replace=0, method='ffill')

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

输出:

                    max       min
url  status   test
351  active   1     20190907  20190831
              2     20191109  20191005
     inactive 1     20190615  20190615
              2     20190928  20190914
1000 active   1     20191109  20191109
     inactive 1     20191108  20191108
2000 active   1     20191109  20191109
     inactive 1     20191101  20191101
3000 active   1     20200101  20200101

关于Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59460167/

Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出

上一篇：python - 是否可以从该表中提取接口(interface)名称和接口(interface)状态？

下一篇：python - 使用 Selenium 进行 Facebook 搜索