php - 从 Wikipedia 获取信息 - 如何获取 HTML 表单？

我正在使用curl 从维基百科检索信息。到目前为止，我已经成功检索了基本文本信息，但我真的想以 HTML 形式检索它。

这是我的代码:

$s = curl_init();       

$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);

$rs = Zend_Json::decode($rs);

$rs = ($rs['ysearchresponse']['resultset_web']);

$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);

$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);

$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];

但是，以这种方式检索到的文本不足以显示:(都是这种格式

'''Aix-les-Bains''' is a [[Communes of France|commune]] in the [[Savoie]] [[Departments of France|department]] in the [[Rhône-Alpes]] [[regions of France|region]] in southeastern [[France]].

It lies near the [[Lac du Bourget]], {{convert|9|km|mi|abbr=on}} by rail north of [[Chambéry]].

==History== ''Aix'' derives from [[Latin]] ''Aquae'' (literally, "waters"; ''cf'' [[Aix-la-Chapelle]] (Aachen) or [[Aix-en-Provence]]), and Aix was a bath during the [[Roman Empire]], even before it was renamed ''Aquae Gratianae'' to commemorate the [[Emperor Gratian]], who was assassinated not far away, in [[Lyon]], in [[383]]. Numerous Roman remains survive. [[Image:IMG 0109 Lake Promenade.jpg|thumb|left|Lac du Bourget Promenade]]

如何获取维基百科文章的 HTML？

更新:谢谢，但我对此有点陌生，现在我正在尝试运行 xpath 查询[尽管是第一次]，但似乎无法获得任何结果。我实际上需要知道一些事情。

如何仅请求文章的一部分？
如何获取所请求文章的 HTML。

我经历过这个url关于维基百科的数据挖掘 - 它提出了一个想法，以检索到的维基百科文本作为参数向维基百科 api 发出第二个请求，这将检索 html - 尽管到目前为止它似乎还没有工作:( - 我不想要只需将整篇文章作为一堆 html 抓取并转储它。基本上我的应用程序的作用是在 map 上指定一些位置和城市 - 您单击城市标记，它会通过 ajax 请求详细信息城市要显示在相邻的 div 中。我希望动态地从维基百科获取此信息。我会担心稍后处理特定城市不存在的文章，只需要确保其此时正常工作。

有谁知道一个很好的工作示例可以实现我正在寻找的功能，即阅读和解析维基百科文章的选定部分。

根据提供的 url - 它说我应该将 wikitext 发布到 wikipedia api 位置，以便它返回解析后的 html。问题是，如果我发布信息，我不会得到任何响应，而是收到拒绝访问的错误 - 但是，如果我尝试将 wiki 文本包含为 GET，它的解析不会出现任何问题。但当我有太多文本需要解析时，它当然会失败。

这是维基百科 API 的问题吗？因为我已经破解了两天了，但一点运气都没有:(

最佳答案

最简单的解决方案可能是抓取页面本身(例如 http://en.wikipedia.org/wiki/Combination )，然后提取 <div id="content"> 的内容，可能带有 xpath 查询。

关于php - 从 Wikipedia 获取信息 - 如何获取 HTML 表单？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/853450/

php - 从 Wikipedia 获取信息 - 如何获取 HTML 表单？

上一篇：arrays - Scala 数组初始化

下一篇：regex - 正则表达式 [a-Z] 是否有效，如果是，那么它是否与 [a-zA-Z] 相同？