java - 有什么方法可以只返回维基百科文章中的(干净的)文本吗？

我的总体目标是只返回维基百科文章中没有任何标记的干净句子。显然，有返回 JSON、XML 等的方法，但这些都是标记。到目前为止，我最好的方法是返回 Wikipedia 所称的 raw。例如，以下链接返回页面“Iron Man”的 raw 格式:

http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw

这是返回内容的片段:

...//I am truncating some markup at the beginning here. 
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]]. 
...//I am truncating here everything until the end.

我坚持使用 raw 格式，因为我发现它最容易清理。虽然到目前为止我用 Java 编写的内容很好地解决了这个问题，但仍有很多情况会被忽视。这些案例包括维基百科时间轴标记、维基百科图片和其他未出现在所有文章中的维基百科属性。同样，我正在使用 Java(特别是，我正在开发 Tomcat Web 应用程序)。

问题:是否有更好的方法从维基百科文章中获取清晰、人类可读的句子？也许有人已经为此建立了一个我找不到的库？

如果不清楚，我很乐意编辑我的问题，以详细说明我所说的干净和人类可读的含义。

我当前清理 raw 格式文本的 Java 方法如下:

public String cleanRaw(String input){
    //Next three lines attempt to get rid of references.
    input= input.replaceAll("<ref>.*?</ref>","");
    input= input.replaceAll("<ref .*?</ref>","");
    input= input.replaceAll("<ref .*?/>","");

    input= input.replaceAll("==[^=]*==", "");
    //I found that anything between curly braces is not needed. 
    while (input.indexOf("{{") >= 0){
        int prevLength= input.length();
        input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
        if (prevLength == input.length()){
            break;
        }
    }
    //Next line gets rid of links to other Wikipedia pages.
    input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
    input= input.replaceAll("<!--.*?-->","");
    input= input.replaceAll("[^A-Za-z0-9., ]", "");

    return input;
}

最佳答案

我找到了几个可能有帮助的项目。您可以通过在 Java 代码中包含 Javascript 引擎来运行第一个。

txtwiki.js 将 MediaWiki 标记转换为纯文本的 javascript 库。 https://github.com/joaomsa/txtwiki.js

维基提取器 从维基百科数据库转储中提取和清理文本的 Python 脚本 http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

来源: http://www.mediawiki.org/wiki/Alternative_parsers

关于java - 有什么方法可以只返回维基百科文章中的(干净的)文本吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20339240/

java - 有什么方法可以只返回维基百科文章中的(干净的)文本吗？

上一篇：java - JSF 2.0 web.xml 错误页面状态代码

下一篇：java - 使 YamlBeans 忽略特定的类成员