Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出

标签 python pandas dataframe

我有一个像这样的数据框 我想按 url 和状态对它们进行分组,并按日期拆分记录,这是一种更有效的方法吗?

def transform_to_unique(df):
    test = []
    counter = 0

    #first_row
    if df.loc[0, 'status']!= df.loc[1, 'status']:
        counter = counter +1
    test.append(counter)

    for i in range(1, len(df)):

        if df.loc[i-1, 'url']!= df.loc[i, 'url']:
            counter=0

        if df.loc[i-1, 'status']!= df.loc[i, 'status'] :
            counter = counter +1
        test.append(counter)

    df['test'] = pd.Series(test)

    return df

df = transform_to_unique(frame)

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

ouptut from script

这是一个数据框:

1000,20191109,active
1000,20191108,inactive
2000,20191109,active
2000,20191101,inactive
351,20191109,active
351,20191102,active
351,20191026,active
351,20191019,active
351,20191012,active
351,20191005,active
351,20190928,inactive
351,20190921,inactive
351,20190914,inactive
351,20190907,active
351,20190831,active
351,20190615,inactive
3000,20200101,active
import pandas as pd
frame =pd.read_clipboard(sep=",", header=None)
frame.columns = ['url', 'date_scraped', 'status']

最佳答案

我不确定我是否正确理解了 test 列的标题,但这是否是您想要实现的目标(基于您发布的示例数据):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["date_scraped_till"]=np.where(df["url"]==df["url"].shift(-1), 

df["date_scraped"].shift(-1), np.nan).astype(np.int32)

输出:

     url  date_scraped    status  date_scraped_till
15   351      20190615  inactive           20190831
14   351      20190831    active           20190907
13   351      20190907    active           20190914
12   351      20190914  inactive           20190921
11   351      20190921  inactive           20190928
10   351      20190928  inactive           20191005
9    351      20191005    active           20191012
8    351      20191012    active           20191019
7    351      20191019    active           20191026
6    351      20191026    active           20191102
5    351      20191102    active           20191109
4    351      20191109    active                  0
1   1000      20191108  inactive           20191109
0   1000      20191109    active                  0
3   2000      20191101  inactive           20191109
2   2000      20191109    active                  0
16  3000      20200101    active                  0

编辑

如果您的意思不是“拆分”,而是“折叠”,那么这应该可以解决问题(这基本上是执行测试列的更有效方法):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["test"]=np.where((df["url"]==df["url"].shift(1)) & (df["status"]==df["status"].shift(1)), 0,1)

df["test"]=df.groupby(["url", "status", "test"])["test"].cumsum().replace(to_replace=0, method='ffill')

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

输出:

                    max       min
url  status   test
351  active   1     20190907  20190831
              2     20191109  20191005
     inactive 1     20190615  20190615
              2     20190928  20190914
1000 active   1     20191109  20191109
     inactive 1     20191108  20191108
2000 active   1     20191109  20191109
     inactive 1     20191101  20191101
3000 active   1     20200101  20200101

关于Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59460167/

相关文章:

python - 如何动态更改字符串中变量的值?

python - 将 python 中的递归深度增加到 100000

python - Pandas : Percentage of nan for each value of a column

python - 回调警告: Callback error creating dash' DataTable

python - 在 Pandas 数据帧上使用 ttest_ind 时遇到问题

python - 比较 2 个 DataFrame 的半匹配行

python - 快速的 python/jython IPC?

python - 无法导入 : 'unable to import rest_framework' when importing serializer?(Windows)

python - Pandas:从另一个 df 创建一个新的 df 包含组内的特定值

python - Pandas 数据框掩码将值写入新列