python - 是否有适用于大数据的 for 循环或任何其他循环的动态代码?

标签 python pandas numpy for-loop

Data.csv文件(示例数据)

Taluka  Crop    Village Area
T1  C1  V1  11
T1  C1  V2  15
T1  C1  V3  3
T1  C1  V4  1
T1  C1  V5  2
T1  C2  V1  12
T1  C2  V2  16
T1  C2  V3  4
T1  C2  V4  100
T1  C2  V5  52
T1  C3  V1  47
T1  C3  V2  15
T1  C3  V3  21
T1  C3  V4  5
T1  C3  V5  7
T1  C4  V1  20
T1  C4  V2  14
T1  C4  V3  18
T1  C4  V4  5
T1  C4  V5  24
T2  C1  V1  21
T2  C1  V2  20
T2  C1  V3  14
T2  C1  V4  7
T2  C1  V5  8
T2  C2  V1  18
T2  C2  V2  3
T2  C2  V3  12
T2  C2  V4  78
T2  C2  V5  56
T2  C3  V1  16
T2  C3  V2  11
T2  C3  V3  15
T2  C3  V2  45
T2  C3  V3  2
T2  C4  V1  3
T2  C4  V2  12
T2  C4  V3  12
T2  C4  V4  44
T2  C4  V5  10

我想知道
对于特定塔卢卡的特定裁剪,哪些村庄具有高风险、中风险和低风险区域。

我总共有 500 个塔卢卡,500 个塔卢卡以下有 10 到 14 种裁剪,每个塔卢卡有 100 到 200 个村庄。

所以,我想找出,对于 Taluka-1(即塔恩)、对于 Crop-1(即稻田),哪些村庄处于高风险、中风险和低风险。采用百分位数法。

我已经做了一些工作。但问题是我的代码不是动态的。我需要输入每种 taluka - 每种裁剪,并且有很多组合。所以。我需要使用一些循环动态地执行此操作(即 for 循环、if 循环) 但我被困在这部分。

请查看我的代码。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("/home/desktop/Data.csv")


df.head()

##part-1 Partition taluka's
T1= df[df['Taluka'] == 'T1']
T2= df[df['Taluka'] == 'T2']


##Part-2 Partition crop wise in each taluka's

T1_C1= T1[T1['Crop'] == 'C1']
T1_C2= T1[T1['Crop'] == 'C2']
T1_C3= T1[T1['Crop'] == 'C3']
T1_C4= T1[T1['Crop'] == 'C4']

T2_C1= T2[T2['Crop'] == 'C1']
T2_C2= T2[T2['Crop'] == 'C2']
T2_C3= T2[T2['Crop'] == 'C3']
T2_C4= T2[T2['Crop'] == 'C4']


##Descending order
T1_C1 = T1_C1.sort('Area', ascending=False)
T1_C2 = T1_C2.sort('Area', ascending=False)
T1_C3 = T1_C3.sort('Area', ascending=False)
T1_C4 = T1_C4.sort('Area', ascending=False)

T2_C1 = T2_C1.sort('Area', ascending=False)
T2_C2 = T2_C2.sort('Area', ascending=False)
T2_C3 = T2_C3.sort('Area', ascending=False)
T2_C4 = T2_C4.sort('Area', ascending=False)


#####Add levels for for  each crops in each taluka's

T1_C1['Level'] = pd.qcut(T1_C1['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T1_C2['Level'] = pd.qcut(T1_C2['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T1_C3['Level'] = pd.qcut(T1_C3['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T1_C4['Level'] = pd.qcut(T1_C4['Area'], 3, ['Low Risk','Medium Risk','High Risk'])

T2_C1['Level'] = pd.qcut(T2_C1['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T2_C2['Level'] = pd.qcut(T2_C2['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T2_C3['Level'] = pd.qcut(T2_C3['Area'], 3, ['Low Risk','Medium Risk','High Risk'])
T2_C4['Level'] = pd.qcut(T2_C4['Area'], 3, ['Low Risk','Medium Risk','High Risk'])


print(T1_C1)

所以,在这里我将获得裁剪C1,taluka T1,哪些村庄位于高风险区域,哪些村庄处于低风险区域...

如何在循环中执行此操作?我有减少代码的地方。代码将用于 500 taluka's?

最佳答案

我认为你需要groupbyapply和自定义函数:

def f(x):
    labels = ['Low Risk','Medium Risk','High Risk']
    x['Level'] = pd.qcut(x['Area'].sort_values(ascending=False), 3, labels = labels)
    return x


df1 = df.groupby(['Taluka','Crop']).apply(f)

print (df1)
   Taluka Crop Village  Area        Level
0      T1   C1      V1    11    High Risk
1      T1   C1      V2    15    High Risk
2      T1   C1      V3     3  Medium Risk
3      T1   C1      V4     1     Low Risk
4      T1   C1      V5     2     Low Risk
5      T1   C2      V1    12     Low Risk
6      T1   C2      V2    16  Medium Risk
7      T1   C2      V3     4     Low Risk
8      T1   C2      V4   100    High Risk
9      T1   C2      V5    52    High Risk
10     T1   C3      V1    47    High Risk
11     T1   C3      V2    15  Medium Risk
12     T1   C3      V3    21    High Risk
13     T1   C3      V4     5     Low Risk
14     T1   C3      V5     7     Low Risk
15     T1   C4      V1    20    High Risk
16     T1   C4      V2    14     Low Risk
17     T1   C4      V3    18  Medium Risk
18     T1   C4      V4     5     Low Risk
19     T1   C4      V5    24    High Risk
20     T2   C1      V1    21    High Risk
21     T2   C1      V2    20    High Risk
22     T2   C1      V3    14  Medium Risk
23     T2   C1      V4     7     Low Risk
24     T2   C1      V5     8     Low Risk
25     T2   C2      V1    18  Medium Risk
26     T2   C2      V2     3     Low Risk
27     T2   C2      V3    12     Low Risk
28     T2   C2      V4    78    High Risk
29     T2   C2      V5    56    High Risk
30     T2   C3      V1    16    High Risk
31     T2   C3      V2    11     Low Risk
32     T2   C3      V3    15  Medium Risk
33     T2   C3      V2    45    High Risk
34     T2   C3      V3     2     Low Risk
35     T2   C4      V1     3     Low Risk
36     T2   C4      V2    12  Medium Risk
37     T2   C4      V3    12  Medium Risk
38     T2   C4      V4    44    High Risk
39     T2   C4      V5    10     Low Risk

编辑:可以添加 sort_values最后:

df1 = df1.sort_values(['Taluka','Crop', 'Area'], ascending=[True, True, False])
print (df1)
   Taluka Crop Village  Area        Level
1      T1   C1      V2    15    High Risk
0      T1   C1      V1    11    High Risk
2      T1   C1      V3     3  Medium Risk
4      T1   C1      V5     2     Low Risk
3      T1   C1      V4     1     Low Risk
8      T1   C2      V4   100    High Risk
9      T1   C2      V5    52    High Risk
6      T1   C2      V2    16  Medium Risk
5      T1   C2      V1    12     Low Risk
7      T1   C2      V3     4     Low Risk
10     T1   C3      V1    47    High Risk
12     T1   C3      V3    21    High Risk
11     T1   C3      V2    15  Medium Risk
14     T1   C3      V5     7     Low Risk
13     T1   C3      V4     5     Low Risk
19     T1   C4      V5    24    High Risk
15     T1   C4      V1    20    High Risk
17     T1   C4      V3    18  Medium Risk
16     T1   C4      V2    14     Low Risk
18     T1   C4      V4     5     Low Risk
20     T2   C1      V1    21    High Risk
21     T2   C1      V2    20    High Risk
22     T2   C1      V3    14  Medium Risk
24     T2   C1      V5     8     Low Risk
23     T2   C1      V4     7     Low Risk
28     T2   C2      V4    78    High Risk
29     T2   C2      V5    56    High Risk
25     T2   C2      V1    18  Medium Risk
27     T2   C2      V3    12     Low Risk
26     T2   C2      V2     3     Low Risk
33     T2   C3      V2    45    High Risk
30     T2   C3      V1    16    High Risk
32     T2   C3      V3    15  Medium Risk
31     T2   C3      V2    11     Low Risk
34     T2   C3      V3     2     Low Risk
38     T2   C4      V4    44    High Risk
36     T2   C4      V2    12  Medium Risk
37     T2   C4      V3    12  Medium Risk
39     T2   C4      V5    10     Low Risk
35     T2   C4      V1     3     Low Risk

或者(较慢)在每个循环中进行排序:

def f(x):
    labels = ['Low Risk','Medium Risk','High Risk']
    x = x.sort_values('Area', ascending=False)
    x['Level'] = pd.qcut(x['Area'], 3, labels = labels)
    return x

df1 = df.groupby(['Taluka','Crop']).apply(f).reset_index(drop=True)
print (df1)
   Taluka Crop Village  Area        Level
0      T1   C1      V2    15    High Risk
1      T1   C1      V1    11    High Risk
2      T1   C1      V3     3  Medium Risk
3      T1   C1      V5     2     Low Risk
4      T1   C1      V4     1     Low Risk
5      T1   C2      V4   100    High Risk
6      T1   C2      V5    52    High Risk
7      T1   C2      V2    16  Medium Risk
8      T1   C2      V1    12     Low Risk
9      T1   C2      V3     4     Low Risk
10     T1   C3      V1    47    High Risk
11     T1   C3      V3    21    High Risk
12     T1   C3      V2    15  Medium Risk
13     T1   C3      V5     7     Low Risk
14     T1   C3      V4     5     Low Risk
15     T1   C4      V5    24    High Risk
16     T1   C4      V1    20    High Risk
17     T1   C4      V3    18  Medium Risk
18     T1   C4      V2    14     Low Risk
19     T1   C4      V4     5     Low Risk
20     T2   C1      V1    21    High Risk
21     T2   C1      V2    20    High Risk
22     T2   C1      V3    14  Medium Risk
23     T2   C1      V5     8     Low Risk
24     T2   C1      V4     7     Low Risk
25     T2   C2      V4    78    High Risk
26     T2   C2      V5    56    High Risk
27     T2   C2      V1    18  Medium Risk
28     T2   C2      V3    12     Low Risk
29     T2   C2      V2     3     Low Risk
30     T2   C3      V2    45    High Risk
31     T2   C3      V1    16    High Risk
32     T2   C3      V3    15  Medium Risk
33     T2   C3      V2    11     Low Risk
34     T2   C3      V3     2     Low Risk
35     T2   C4      V4    44    High Risk
36     T2   C4      V2    12  Medium Risk
37     T2   C4      V3    12  Medium Risk
38     T2   C4      V5    10     Low Risk
39     T2   C4      V1     3     Low Risk

关于python - 是否有适用于大数据的 for 循环或任何其他循环的动态代码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45320721/

相关文章:

python - 如何按特定顺序对两个(或更多)不同列上的 pandas 数据框进行排序

python - Pandas:合并数据帧行并取第二列值的平均值

python - numpy arctan2 参数根据语法导致 ValueError

python - 如何将查询集渲染到表中 template-django

python - PyQt4:为什么使用 QTreeWidgetItem 时 Python 会在关闭时崩溃?

python - .loc[] = value 在 Pandas 中返回 SettingWithCopyWarning

python -numpy : read csv into numpy with proper value type

python - 提示需要使用正则表达式环视

python - 如何在 Heroku 虚拟实例 (dyno) 中使用参数触发 Python 函数?

python - 如果我在自己的python软件包中使用了numpy并分发了自己的软件包,那么我应该包括numpy许可证吗?