python - 无法从特定页面抓取 main_container

所以我正在尝试从这个url中抓取内容。您可以检查它有很多详细信息，这些详细信息位于类为 main_container 的 div 下。但每当我试图刮掉这个时，它就不会把那部分放在汤里。

<div class="main_container o-hidden" id="tfullview">

所以我研究并知道可能有两种方法:

该页面是从客户端加载的，因为它可能是脚本加载，所以我使用 PyQt4 从这个网站上抓取。代码在最后

所以这段代码显示 None 意味着没有找到标签。

我也尝试了 selenium 方式，它基本上首先加载页面，然后从中抓取数据。这也显示无响应。我还没有准备好该代码。

这个div还有一个o-hidden属性，这会阻止加载吗？这是 div:

pyqt 代码:

    import sys
    from PyQt4.QtGui import QApplication
    from PyQt4.QtCore import QUrl
    from PyQt4.QtWebKit import QWebPage
    import bs4 as bs
    import requests

class Client(QWebPage):

    def __init__(self,url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def on_page_load(self):
        self.app.quit()

url = 'https://eprocure.gov.in/cppp/tendersfullview/MjMyODQwA13h1OGQ2NzAxYTMwZTJhNTIxMGNiNmEwM2EzNmNhYWZhODk=A13h1OGQ2NzAxYTMwZTJhNTIxMGNiNmEwM2EzNmNhYWZhODk=A13h1MTU1MzU4MDQwNQ==A13h1NzIxMTUvODUwOCA4NTA5LzE4L0NPVy9PV0M=A13h1MjAxOV9JSFFfNDU4NjEzXzE='
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source,'lxml')
test = soup.find("div",class_="main_container")
print(test)

最佳答案

因此，受到请求的激励而重新编写。需要Session 来允许重用连接。您可以轻松适应对 allLinks 中的所有 url 进行循环。我展示第一个。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata?page=1'

with requests.Session() as s:

    r = s.get(url)
    soup = bs(r.content, 'lxml')

    ## all table links to individual tenders
    titles, allLinks = zip(*[(item.text, item['href']) for item in soup.select('td:nth-of-type(5) a')])

    r = s.get(allLinks[0]) #choose first link from table
    soup = bs(r.content, 'lxml')
    # container = soup.select_one('#tender_full_view')
    tables = pd.read_html(r.content)

    for table in tables:
        print(table.fillna(''))

<小时/>

如果 Selenium 是一个选项，您可以执行以下操作来收集第 1 页登陆的所有投标链接。然后，您可以索引到 URL 列表以访问任何单独的招标。我还会收集链接标题，以防您想通过该标题进行搜索，然后使用索引进行搜索。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

d = webdriver.Chrome()
url = 'https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata?page=1'

d.get(url)
## all table links to individual tenders
titles, allLinks = zip(*[(item.text, item.get_attribute('href')) for item in WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'td:nth-of-type(5) a')))])

d.get(allLinks[0]) #choose first link from table

container = WebDriverWait(d,5).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#tender_full_view')))
html = container.get_attribute('innerHTML')
tables = pd.read_html(html)

for table in tables:
    print(table.fillna(''))

关于python - 无法从特定页面抓取 main_container，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55352117/

python - 无法从特定页面抓取 main_container

上一篇：python - 如何计算二维数组元素之间差异的所有组合？

下一篇：python - pygame 模块 'pygame.event.get()' 不适用于方法吗？