python - 在python上使用selenium或beautifulsoup从带有链接的页面中抓取数据,没有类,没有id

标签 python selenium web-scraping beautifulsoup

我想知道如何抓取这个网站:https://1997-2001.state.gov/briefings/statements/2000/2000_index.html

它只包含 'a' 和 'href',没有类或 ID,结构非常简单。我想运行一个字符串来抓取页面上所有链接的内容。

我已经使用 chromedriver 尝试了这段代码,但它只打印了一个链接列表(我在网络抓取方面相当业余)。任何帮助都会很棒。

    >>> elems = driver.find_elements_by_xpath("//a[@href]")
    >>> for elem in elems:
    ...     print(elem.get_attribute("href"))

最佳答案

我希望我能很好地理解你的问题:这个脚本将遍历每个链接,打开它并打印它包含的文档:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[width="580"]').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

打印:

https://1997-2001.state.gov/briefings/statements/2000/ps001227.html
Statement by Philip T. Reeker, Deputy Spokesman
December 27, 2000
China - LUOYANG Fire
We were saddened to learn of the terrible fire that killed hundreds of people in the Chinese city of Luoyang.  The United States offers its sincerest condolences to the families of the victims of the tragic December 25 blaze.  We also offer our best wishes for a speedy recovery to the survivors.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001226.html
Media Note
December 26, 2000
Renewal of the Secretary of State's Advisory Committee
on
Private International Law
The Department of State has renewed the Charter of the Secretary of State's Advisory Committee on Private International Law (ACPIL), effective as of November 20, 2000.   The Under Secretary for Management has determined that ACPIL is necessary and in the public interest.
ACPIL enables the Department to obtain the expert and considered view of the private sector organizations and interests most knowledgeable of, as well as most affected by, international activities to unify private law.  The committee consists of members from private sector organizations, bar associations, national legal organizations, and federal and state government agency and judicial interests concerned with private international law.  ACPIL will follow the procedures prescribed by the Federal Advisory Committee Act (FACA) (Public Law 92-463).  Meetings will be open to the public unless a determination is made in accordance with Section 10(d) of the FACA, 5 U.S.C. 552b(c)(1) and (4), that a meeting or a portion of the meeting should be closed to the public.
Any questions concerning this committee should be referred to the Executive Director, Harold Burman, at 202-776-8420.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001225.html
Statement by Philip T. Reeker, Deputy Spokesman
December 25, 2000
Parliamentary Elections in Serbia
The United States congratulates the Democratic Opposition of Serbia on their victory in Saturday's election for the Serbia parliament. Official results indicate that the United Democratic Opposition (DOS) won with 64 percent of the vote to just 13 percent for the Socialist Party.
We also congratulate the Serbian people for their widespread participation in what international observers have stated was a free and fair election.  This is the first time the Serbian people have had a free and fair election in over a decade. As such, it is an important milestone in the ongoing democratic transition that began with Milosevic's defeat in September's federal presidential elections. The Democratic Opposition is now in a stronger position to carry out the reforms needed to fully integrate Serbia into the international community.
We look forward to working with the new Serbian government in the same amicable and cooperative spirit we now enjoy with the federal Yugoslav government.

--------------------------------------------------------------------------------

...and so on.

编辑:更正的代码:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

编辑 2(1999 年版):

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/1999/1999_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    if 'http' not in a['href']:
        u = 'https://1997-2001.state.gov/briefings/statements/1999/' + a['href']
    else:
        u = a['href']
    
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    tag = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"]), blockquote')
    if tag:
        t = tag.get_text(strip=True, separator='\n')
        print( t.split('[end of document]')[0] )
    print('-' * 80)

关于python - 在python上使用selenium或beautifulsoup从带有链接的页面中抓取数据,没有类,没有id,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63891393/

相关文章:

python - Beautiful Soup 4中如何处理<br></br>和<br/>?

c# - 使用 C# 和 .NET Framework 进行屏幕抓取、网页抓取、网页收集、网页数据提取等

python - 在 Apache/CGI 中运行 Flask 时遇到问题

java - 带有 TestNG 和 Gradle 的 Selenium : Setting browser crash tests as SKIPPED instead of FAILED

wordpress - 是否有可能使用 Selenium 自动执行在 Wordpress Admin 中删除用户的过程?

java - 如何使用 java 将 prop.getProperty() 返回的字符串转换为整数

python - selenium-webdriver:如何使用 for 循环查找元素

python - 使用 __init__.py 加载动态 python 模块时出现问题

python - 在 Python 中表示二维网格的最有效方法

python - Django分页: switch between paginated/non-paginated ListView