我有一个数据框,其中包含论文引用文献的列,我想查找整个列中重复的所有引用文献。
以下是数据框中的一些行:
In [1]:
df4.iloc[0:2]
Out[2]:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD).
Retrieved December 7, 2016, from American Psychological Association,
http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
更多行:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
0 NaN American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html Y17-1020
0 Burton (2012) Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf Y17-1020
0 Choudhury, 2014 De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data. Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media. Y17-1020
0 Gotlib and Joormann (2010) Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98. Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley. Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312. Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web. Y17-102
这里“0”是第一篇论文的索引,该论文有很多引用文献,有 40k 篇论文,每篇论文大约有 20 篇引用文献。
查找在其他论文中再次使用的任何引用文献(这里每篇论文的索引不同)及其索引和重复次数。
尝试使用 pandas 的正则表达式和排序方法,例如
value_counts(sort=True).sort_index()
和
sort_values()
但这没有帮助。
Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'
最佳答案
IIUC,使用pandas.DataFrame.index.groupby
。
使用伪数据帧,df
:(请注意,我添加了最后三行用于演示):
print(df)
cit2ref reference _id
0 NaN All about depression: Diagnosis. (2013). Retri... Y17-1020
0 NaN American Psychological Association. (2016). Ce... Y17-1020
0 NaN American Psychological Association. (2016). Pa... Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes ... Y17-1020
0 NaN Burton (2012) Burton, N. (2012, June 5). D... Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The... Y17-1020
0 NaN Choudhury, 2014 De Choudhury, M., Counts, ... Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Ho... Y17-1020
0 NaN Gotlib and Joormann (2010) Gotlib IH, Kasch K... Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depressio... Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... Y17-102
1 NaN All about depression: Diagnosis. (2013). Retri... Y17-1020
1 NaN American Psychological Association. (2016). Ce... Y17-1020
1 NaN StackOverflow. Not to be grouped-by Y17-102
然后groupby
:
df.index.groupby(df['reference'])
# or
d = {k: list(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
# this looks prettier
index 0
0 All about depression: Diagnosis. (2013). Retri... [0, 1]
1 American Psychological Association. (2016). Ce... [0, 1]
2 American Psychological Association. (2016). Pa... [0]
3 Beattie, G.S. (2005, November). Social Causes ... [0]
4 Burton (2012) Burton, N. (2012, June 5). D... [0]
5 Choudhury, 2014 De Choudhury, M., Counts, ... [0]
6 Clark, P., Niblett, T. (1988, October 25). The... [0]
7 De Choudhury, M., Gamon, M., Couns, S., %27 Ho... [0]
8 Gotlib IH, Joormann J. Cognition and depressio... [0]
9 Gotlib and Joormann (2010) Gotlib IH, Kasch K... [0]
10 Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... [0]
11 Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... [0]
12 StackOverflow. Not to be grouped-by [1]
您可以查看哪篇论文出现在哪些索引中。如果你想要计数,你可以使用len
而不是list
:
d = {k: len(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
输出:
index 0
0 All about depression: Diagnosis. (2013). Retri... 2
1 American Psychological Association. (2016). Ce... 2
2 American Psychological Association. (2016). Pa... 1
3 Beattie, G.S. (2005, November). Social Causes ... 1
4 Burton (2012) Burton, N. (2012, June 5). D... 1
5 Choudhury, 2014 De Choudhury, M., Counts, ... 1
6 Clark, P., Niblett, T. (1988, October 25). The... 1
7 De Choudhury, M., Gamon, M., Couns, S., %27 Ho... 1
8 Gotlib IH, Joormann J. Cognition and depressio... 1
9 Gotlib and Joormann (2010) Gotlib IH, Kasch K... 1
10 Gotlib, I. H., %27 Hammen, C. L. (1992). Psych... 1
11 Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T... 1
12 StackOverflow. Not to be grouped-by 1
关于python - 在 pandas 数据框列中查找特定文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59942216/