我在 pandas 中有一个 DataFrame,其中包含有关人员位置的及时信息。大约有 300+ 百万行。
这是示例,其中每个名称通过 group.by
分配给唯一的 index
并按 Name
和 Year 排序
:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
输出:
+-------+-------+------+---------------+----------------------+
| Index | Name | Year | Address | Name_Grouped_Index |
+-------+-------+------+---------------+----------------------+
| 5 | Steve | 2018 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 15 | Steve | 2018 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 16 | Steve | 2018 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 6 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 7 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 8 | Steve | 2020 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 9 | Steve | 2020 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 13 | Steve | 2021 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 14 | Steve | 2022 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 17 | Steve | 2022 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 0 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 1 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 2 | John | 2019 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 3 | John | 2019 | Orange county | 0 |
+-------+-------+------+---------------+----------------------+
| 4 | John | 2019 | New York | 0 |
+-------+-------+------+---------------+----------------------+
| 10 | John | 2020 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 11 | John | 2021 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 12 | John | 2021 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
我想获取网络图矩阵(邻接矩阵),以查看地址之间的总变化。换句话说,例如,2018 年人们有多少次从“加拿大”搬到“加利福尼亚”。
理想输出:
1) 来自“地址”列的直接图。从技术上讲,将 Address 列转换为“Source”和“Target”两列,其中“Target”值是下一行的“Source”。最好在另一列“权重”中计算对数,而不是重复对数。
+------------+------------+------+--------+
| Source | Target | Year | Weight |
+------------+------------+------+--------+
| Canada | NewYork | 2018 | |
+------------+------------+------+--------+
| NewYork | California | 2018 | |
+------------+------------+------+--------+
| California | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | California | 2020 | |
+------------+------------+------+--------+
| California | Canada | 2020 | |
+------------+------------+------+--------+
| Canada | California | 2021 | |
+------------+------------+------+--------+
| California | California | 2022 | |
+------------+------------+------+--------+
| California | NewYork | 2022 | |
+------------+------------+------+--------+
或
2)一个矩阵来说明地址之间的总变化。
+---------------+--------+---------+------------+---------------+---------------+
| From \ To | Canada | NewYork | California | Beverly hills | Orange county |
+---------------+--------+---------+------------+---------------+---------------+
| Canada | 2 | 2 | 2 | 2 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| NewYork | 1 | 0 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| California | 2 | 1 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| Beverly hills | 0 | 0 | 0 | 2 | 1 |
+---------------+--------+---------+------------+---------------+---------------+
| Orange county | 0 | 1 | 0 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
最佳答案
这不是最漂亮的代码,但至少您可以遵循每个步骤。我选择了第二个选项,因为您可以轻松地从此连接矩阵制作图表。您在制作 networkx 图时需要帮助吗? 矩阵的行和列是:['Beverly hills', 'Orange county', 'New York', 'Canada', 'California', 'NewYork'] 你对每个人的 newyork 拼写都不一样,所以它出现了两次。
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
print (df)
dictionary_ = {} # where each person went
places = [] # all of the places
for index, row in df.iterrows():
if row['Author_Grouped_Index'] not in dictionary_:
dictionary_[row['Author_Grouped_Index']] = []
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
else:
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
if row["Address"] not in places:
places.append(row["Address"])
print (dictionary_)
new_dictionary = {} #number of times each place visited
for key, value in dictionary_.items():
for x in range(len(value)-1):
move = value[x] + "-" + value[x+1]
if not move in new_dictionary:
new_dictionary[move] = 1
else:
new_dictionary[move] += 1
print (new_dictionary)
print (places)
import numpy as np
array = np.zeros((len(places),len(places)), dtype=int)
for x, place in enumerate(places):
for y, place_2 in enumerate(places):
move_2 = (place + "-" + place_2)
try:
array[x,y] = (new_dictionary[move_2])
except:
array[x,y] = 0
print (array)
关于python - 将 pandas 数据框列转换为具有源和目标的 networkx 图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61307877/