python - 如何使用 Beautiful Soup 从 HTML 文档中获取纯文本和 URL？

我使用 Python 和正则表达式来查找 HTML 文档中的内容，与大多数人所说的不同，它工作得很好，即使事情可能会出错。不管怎样，我认为 Beautiful Soup 会更快更容易，但我真的不知道如何让它做我用正则表达式所做的事情，这相当简单，但很困惑。

我正在使用此页面的 HTML:

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

编辑:

这里是主要位置的 HTML:

<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>

第一个相似地点的示例:

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">

当我的程序获取它并使用 Beautiful Soup 使其更具可读性时，HTML 的结果与 Firefox 的“查看源代码”略有不同......我不知道为什么。

这些是我的正则表达式:

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main)

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main)

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main)

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

前两个是主要地点和地址。其余的都是为了其他地方的信息。完成这些后，我决定只需要 cName、cAddresses 和 cURL 的前 5 个结果，因为我不需要 91 或其他任何结果。

我不知道如何用BS找到这类信息。我对 BS 能做的就是找到特定的标签并用它们做事。这个 HTML 有点复杂，因为所有的信息。我想要的是表格，表格标签也有点乱......

如何获取该信息，并将其仅限于前 5 个结果左右？

谢谢。

最佳答案

人们说你不能用正则表达式解析 HTML 是有原因的，但这里有一个适用于你的正则表达式的简单原因:你有 \n 和在您的正则表达式中，并且这些可以并且将会在您尝试解析的页面上随机更改。当发生这种情况时，您的正则表达式将不匹配，并且您的代码将停止工作。

但是您想要完成的任务非常简单

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('this-stackoverflow-page.html'))

for anchor in soup('a'):
    print anchor.contents, anchor.get('href')

生成所有 anchor 标记，无论它们出现在该页面的深层嵌套结构中的哪个位置。以下是我从该三行脚本的输出中摘录的几行:

[u'Stack Exchange'] http://stackexchange.com
[u'msw'] /users/282912/msw
[u'faq'] /faq
[u'Stack Overflow'] /
[u'Questions'] /questions
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
[u'python'] /questions/tagged/python
[u'beautifulsoup'] /questions/tagged/beautifulsoup
[u'Marcus Johnson'] /users/1587751/marcus-johnson

很难想象更少的代码可以为您完成这么多工作。

关于python - 如何使用 Beautiful Soup 从 HTML 文档中获取纯文本和 URL？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11902974/

python - 如何使用 Beautiful Soup 从 HTML 文档中获取纯文本和 URL？

上一篇：python - vim 正确缩进 python 片段

下一篇：python - 如何在GAME中用Python进行一次性初始化？