Python->Beautifulsoup->Webscraping->循环 URL(1 到 53)并保存结果

标签 python web-scraping beautifulsoup

Here is the Website I am trying to scrape http://livingwage.mit.edu/

具体网址来自

http://livingwage.mit.edu/states/01

http://livingwage.mit.edu/states/02

http://livingwage.mit.edu/states/04 (For some reason they skipped 03)

...all the way to...

http://livingwage.mit.edu/states/56

并且在这些 URL 中的每一个上,我都需要第二个表的最后一行:

Example for http://livingwage.mit.edu/states/01

Required annual income before taxes $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

期望输出:

阿拉巴马州 $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

阿拉斯加 $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403

...

...

怀俄明州 $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424

经过2个小时的折腾,这是我目前的情况(我是初学者):

import requests, bs4

res = requests.get('http://livingwage.mit.edu/states/01')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text)


state_name=states.select('h1')

table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]


result=[]

result.append(state_name)
result.append(rows)

当我在 Python 控制台中查看 state_name 和行时,它会给我 html 元素

[<h1>Living Wag...Alabama</h1>]

[<tr class = "odd...   </td> </tr>]

问题 1:这些是我想要的输出结果,但是如何让 python 以字符串格式而不是像上面的 HTML 格式给我?

问题 2:如何遍历 request.get(url01 到 url56)?

感谢您的帮助。

如果您能提供一种更有效的方法来访问我的代码中的行变量,我将不胜感激,因为我的方法不是很 Pythonic。

最佳答案

只需从初始页面获取所有状态,然后您可以选择第二个表并使用css 类 奇怪的结果 来获取tr 你需要,不需要切片,因为类名是唯一的:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

如果您运行代码,您将看到如下输出:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

您可以在 1-53 的范围内循环,但是从基页提取 anchor 也可以一步为我们提供州名称,使用该页面的 h1 也可以为您提供输出 Living Wage Calculation for阿拉巴马州 然后您将不得不尝试解析它以获取名称,考虑到某些州有更多的单词名称,这不是微不足道的。

关于Python->Beautifulsoup->Webscraping->循环 URL(1 到 53)并保存结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38896893/

相关文章:

javascript - Puppeteer 按 Enter 按钮或单击 Dialog OK 按钮

python - 多次检查 "None"- Python 中的正确方法是什么

python - 正则表达式匹配非常慢

python - Django:如何从引用类中的类方法获取外键的类?

python - 将 Pandas 数据框转换为 N 对 1 字典,其中 N 列是指向单个列的键作为值

python - Python 不存在的行出错

python - 使用 BeautifulSoup 调整 DOM 树中的所有文本

python - 在 python2.7 中绘制 4D 图

r - 不确定如何分离我抓取的一列数据

python - scrapy 分页 Selenium python