python - 如何在 python/Django 中找出给定 URL 的摘要文本？

标签 python django

<分区>

如何找出给定 URL 的摘要文本？

摘要文本是什么意思？

Merck $41.1 Billion Schering-Plough Bid Seeks Science

链接说明

默克公司以 411 亿美元收购先灵葆雅公司，增加了治疗血栓、感染和精神 split 症的实验药物，并使这些公司能够加快生物技术药物的研究。

对于上述 URL，以下三行是摘要文本。
一个简短的 2 到 3 行的 URL 描述，我们通常通过获取该页面获得，然后检查内容，从该 html 标记中找出简短描述。

有什么好的算法可以做到这一点吗？ (或)
python/django 中是否有任何好的库可以做到这一点？

最佳答案

我有同样的需求，而 lemur，虽然它具有摘要功能，但我发现它有问题到无法使用的地步。周末我用 nltk 在 python 中编写了一个总结模块:https://github.com/thavelick/summarize

我从此处的 Java 库 Classifier4J 中获取算法:http://classifier4j.sourceforge.net/但尽可能使用 nltk 和 python。

这是基本用法:

>>> import summarize

SimpleSummarizer(目前唯一的摘要器)使用出现频率最高的句子进行摘要:

>>> ss = summarize.SimpleSummarizer()
>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text.'

您可以根据需要在摘要中指定任意数量的句子。

>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries. A Summariser is really cool. I don't think there are any other python summarisers."
>>> ss.summarize(input, 2)
"NLTK is a python library for working human-written text.  I don't think there are any other python summarisers."

与 Classifier4J 的原始算法不同，此摘要器有效正确使用句点以外的标点符号:

>>> input = "NLTK is a python library for working human-written text! Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text!'

更新

我现在(终于!)在 Apache 2.0 许可下发布了这个，与 nltk 相同的许可，并将模块放在 github 上(见上文)。欢迎任何贡献或建议。

关于python - 如何在 python/Django 中找出给定 URL 的摘要文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/626754/

上一篇：java - 是否有相当于Python的Python的itertools？

下一篇：python - 编译的字节码文件(.pyc)可以在不同的目录中生成吗？

相关文章：

python无法安装mysql库

python - NSUserNotificationCenter.defaultUserNotificationCenter() 在 python 中返回 None

具有常量的 Python 矢量化

jquery - Django-selectable - 在焦点上显示自动完成选项

python - Django Rest 序列化程序 : Use nested serializer on GET but not POST

python - 使用来自 StringIndexer 的标签进行 IndexToString 转换

Django 无法解压不可迭代的 'Q' 对象

python - 如何通过城市和国家获得时区？

django - 为什么我的Django应用程序无法连接到docker内部的postgres

python - 使用SST框架，如何通过css类或xpath点击链接？