python - Beautifulsoup - 抓取除表数据之外的所有内容

标签 python html css beautifulsoup

您好,我是 python 新手,目前正在尝试从网站上的表格下载数据 ( http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31 )

我尝试了许多不同的解决方案,但我尝试的所有方法都始终返回空列表。我读到问题可能是该表是使用 Javascript 加载的,但是当我关闭 Javascript 时,该表仍然存在,并且当我查看源代码时,我可以清楚地看到我想要的数据。

我使用的是python 2.7

当我运行此代码时:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
print soup

我得到的 table 应该在哪里:

<link href="appsearch/main.css" rel="stylesheet" type="text/css" />
        <TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
            <TR>
                <TD align=center>

                    <br />
                    <br />
                </TD>
            </TR>
            <TR>
                <TD>
                        </TD>
            </TR>
        </TABLE>

当我查看页面源代码时,我可以看到我想要的信息(我复制并粘贴了其中的一小部分

<TD align=center>
         <p align='center' class='H1'><u>Planning Authority      Applications Received (Planning Applications Outside Development Zone)</u></p><p align='center'>Result For Date 2016-8-31</p><p align='center'>Result output on 03/09/2016 23:23:29</p><strong><i>Disclaimer</strong>: The information ....in accordance with the Development Planning Act.</i>
                    <br />
                    <br />
                </TD>
            </TR>
            <TR>
                <TD>
                        <table class='formTable'><tr><td class='sectionHeading' colspan=2>Application Details</td></tr></table><table class='formTable'><TR><td class='sectionHeading'>Case Number</td><td class='sectionHeading'>Location</td><td class='sectionHeading'>Proposal</td><td class='sectionHeading'>Applicant</td><td class='sectionHeading'>Architect</td><td class='sectionHeading'>Case Category</td><td class='sectionHeading'>Local Council</td></tr><TR><td class='fieldData'><a href='SearchPA?Systemkey=166837&CaseFullRef=PA/05054/1

如果您能给我任何建议或为我指明任何可能对我有帮助的 Material 的方向,我将非常感激。

正如我之前所说,我对 python 和 stackoverflow 都很陌生,所以如果类似的问题已经得到解答或者我没有提供正确的信息,我深表歉意。

谢谢

最佳答案

如果您清除缓存并直接转到 http://www.pa.org.mt/appsreceived?month=01/08/2016 你根本看不到任何数据,就像你在自己的输出中看到的一样:

enter image description here

您需要使用 session 并首先访问您想要的页面之前的页面:

import  requests
head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}

with requests.Session() as s:
    s.headers.update(head)
    s.get("http://www.pa.org.mt/appsreceived?CaseType=PA&Category=PAI")

    r2 = (s.get("http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31"))
    print(r2.content)

现在下一个问题是,html 已损坏,因此 html.parse 将无法得到您想要的内容:

In [4]: with requests.Session() as s:
   ...:         s.headers.update(head)
   ...:         r= s.get("http://www.pa.org.mt/appsreceived?CaseType=PA&Category=PAI")
   ...:         page = (s.get("http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31").content)
   ...:         soup = BeautifulSoup(page, 'html.parser')
   ...:         print(soup.select_one("#Table1"))
   ...:     
<table border="0" cellpadding="1" cellspacing="1" id="Table1" width="100%">
<tr>
<td align="center">
<p align="center" class="H1"><u>Planning Authority Applications Received (Planning Applications Within Development Zone)</u></p><p align="center">Result For Date 2016-8-31</p><p align="center">Result output on 04/09/2016 01:56:44</p><strong><i>Disclaimer</i></strong>: The information below has been extracted from an on-line database and is meant only for your general guidance.The Planning Authority disclaims any responsibility for any inaccuracies there may be on this site. If you wish to verify the correctness of any information then you are advised to contact us directly. Furtheremore, in the event of any discrepancies between the information contained on this site and official printed communication then the latter is to prevail, in accordance with the Development Planning Act.</td></tr></table>

lxmlhtml5lib 会,我不会添加输​​出,因为它很大,但使用任一解析器都会为您提供完整的表数据。

关于python - Beautifulsoup - 抓取除表数据之外的所有内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39311702/

相关文章:

python - 如何在QGIS右上角设置DockWidget?

python - 如何识别属性的属性何时被设置?

javascript - HTML5 与 JavaScript

javascript - Javascript 滚动突出显示不起作用

python - 返回排序数组中每个数字的最后一次相遇的函数

python - 使用python从图表中提取数据

html - 如何让输入元素占据所有剩余的水平空间?

javascript - 如何标记我的 html 范围 slider ?

html - CSS 如何自动调整我的 div 大小以适应内容?

jquery - 更改背景图像 css