python - 在 pandas 数据框列中查找特定文本

标签 python regex pandas dataframe

我有一个数据框,其中包含论文引用文献的列,我想查找整个列中重复的所有引用文献。

以下是数据框中的一些行:

In [1]:

df4.iloc[0:2]

Out[2]:

 **cit2ref**    **reference**                                                                                                    **_id**
0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
            http://www.allaboutdepression.com/dia_03.html                                                                   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). 
            Retrieved December 7, 2016, from American Psychological Association, 
            http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx  Y17-1020

更多行:

 **cit2ref** **reference**                                                                                                                                 **_id**

0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx   Y17-1020
0   NaN     American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx  Y17-1020
0   NaN     Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html   Y17-1020
0   Burton (2012)   Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism  Y17-1020
0   NaN     Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf    Y17-1020
0   Choudhury, 2014     De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data.  Y17-1020
0   NaN     De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media.     Y17-1020
0   Gotlib and Joormann (2010)  Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98.  Y17-1020
0   NaN     Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley.   Y17-1020
0   NaN     Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312.   Y17-1020
0   NaN     Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web.   Y17-102

这里“0”是第一篇论文的索引,该论文有很多引用文献,有 40k 篇论文,每篇论文大约有 20 篇引用文献。

查找在其他论文中再次使用的任何引用文献(这里每篇论文的索引不同)及其索引和重复次数。

尝试使用 pandas 的正则表达式和排序方法,例如

value_counts(sort=True).sort_index()

sort_values()

但这没有帮助。

Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'

最佳答案

IIUC,使用pandas.DataFrame.index.groupby

使用伪数据帧,df:(请注意,我添加了最后三行用于演示):

print(df)
   cit2ref                                          reference       _id
0      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
0      NaN  American Psychological Association. (2016). Ce...  Y17-1020
0      NaN  American Psychological Association. (2016). Pa...  Y17-1020
0      NaN  Beattie, G.S. (2005, November). Social Causes ...  Y17-1020
0      NaN  Burton   (2012)   Burton, N. (2012, June 5). D...  Y17-1020
0      NaN  Clark, P., Niblett, T. (1988, October 25). The...  Y17-1020
0      NaN  Choudhury, 2014     De Choudhury, M., Counts, ...  Y17-1020
0      NaN  De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  Y17-1020
0      NaN  Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  Y17-1020
0      NaN  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  Y17-1020
0      NaN  Gotlib IH, Joormann J. Cognition and depressio...  Y17-1020
0      NaN  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...   Y17-102
1      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
1      NaN  American Psychological Association. (2016). Ce...  Y17-1020
1      NaN                StackOverflow. Not to be grouped-by   Y17-102

然后groupby:

df.index.groupby(df['reference'])
# or
d = {k: list(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
# this looks prettier

                                                index       0
0   All about depression: Diagnosis. (2013). Retri...  [0, 1]
1   American Psychological Association. (2016). Ce...  [0, 1]
2   American Psychological Association. (2016). Pa...     [0]
3   Beattie, G.S. (2005, November). Social Causes ...     [0]
4   Burton   (2012)   Burton, N. (2012, June 5). D...     [0]
5   Choudhury, 2014     De Choudhury, M., Counts, ...     [0]
6   Clark, P., Niblett, T. (1988, October 25). The...     [0]
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...     [0]
8   Gotlib IH, Joormann J. Cognition and depressio...     [0]
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...     [0]
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...     [0]
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...     [0]
12                StackOverflow. Not to be grouped-by     [1]

您可以查看哪篇论文出现在哪些索引中。如果你想要计数,你可以使用len而不是list:

d = {k: len(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)

输出:

                                                index  0
0   All about depression: Diagnosis. (2013). Retri...  2
1   American Psychological Association. (2016). Ce...  2
2   American Psychological Association. (2016). Pa...  1
3   Beattie, G.S. (2005, November). Social Causes ...  1
4   Burton   (2012)   Burton, N. (2012, June 5). D...  1
5   Choudhury, 2014     De Choudhury, M., Counts, ...  1
6   Clark, P., Niblett, T. (1988, October 25). The...  1
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  1
8   Gotlib IH, Joormann J. Cognition and depressio...  1
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  1
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  1
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...  1
12                StackOverflow. Not to be grouped-by  1

关于python - 在 pandas 数据框列中查找特定文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59942216/

相关文章:

python - 对象没有属性 'count'

python - 没有抛出错误,python 创建空白文件,不知道如何排除故障

python - 无法解析 JSON 文件中的 TAB

python - Pandas str.extract : AttributeError: 'str' object has no attribute 'str'

python - Pycharm/IntelliJ 显示 pytest 的覆盖率为 0%,即使已生成覆盖率

javascript - 用于查找电话号码中重复模式的正则表达式?或许不是?

ruby - 拆分括号的内容而不分离括号 ruby

python - 通过标签元组选择二级索引

python - 将列添加到具有重复序列的数据框中

python - 按 dtype 为字符串和数字的条件删除列