我对 Pandas 很陌生。到目前为止,我一直在使用 csv 文件和 excel 电子表格学习 pandas。
现在我面临着将文本文件转换为数据框的问题。文本文件就是我所说的顺序数据。文件格式为:
State Name
City Name
State Name
City Name
City Name
City Name
...
列出了所有 50 个州和美国领土,但城市数量有所不同。我需要将其转换成类似
的数据框
[[State Name, City Name1],[State Name, City Name2],...]
使用 pandas read_table() 方法,我至少能够将文件读入数据框,但现在我不确定如何将其转换为正确的州名称城市名称格式。
我还有一本州名/州 2 个字母缩写的字典。字典的格式是
{'OH':'OHIO', 'KY':'Kentucky',...}
有没有办法可以使用这本字典,遍历文件并将州和城市分开?还是有更简单的方法来完成此操作?
谢谢
编辑 - 文本文件示例
下面列出了文本文件的示例。另外,请注意我无法更改文件。
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
假设您的列名为 A
。首先找到这样的状态:
df.A.str.contains('\[edit\]')
Out[25]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
10 False
11 True
12 False
13 False
14 False
使用cumsum
为每个州+城市定义一个索引:
csum = df.A.str.contains('\[edit\]').cumsum()
csum
Out[26]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
11 3
12 3
13 3
14 3
现在您可以获得州和城市:
states = df.groupby(csum).first()
states
Out[38]:
A
A
1 Alabama[edit]
2 Alaska[edit]
3 Arizona[edit]
cities = df.groupby(csum).apply(lambda g: g[1:])
cities
Out[39]:
A
A
1 1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
2 10 Fairbanks (University of Alaska Fairbanks)[2]
3 12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
现在加入数据框:
states.join(cities, rsuffix='_cities')
Out[49]:
A A_cities
A
1 1 Alabama[edit] Auburn (Auburn University)[1]
2 Alabama[edit] Florence (University of North Alabama)
3 Alabama[edit] Jacksonville (Jacksonville State University)[2]
4 Alabama[edit] Livingston (University of West Alabama)[2]
5 Alabama[edit] Montevallo (University of Montevallo)[2]
6 Alabama[edit] Troy (Troy University)[2]
7 Alabama[edit] Tuscaloosa (University of Alabama, Stillman Co...
8 Alabama[edit] Tuskegee (Tuskegee University)[5]
2 10 Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2]
3 12 Arizona[edit] Flagstaff (Northern Arizona University)[6]
13 Arizona[edit] Tempe (Arizona State University)
14 Arizona[edit] Tucson (University of Arizona)