python - 使用 BeautifulSoup 从表中提取某些列

标签 python html xml web-scraping beautifulsoup

您好,我正在尝试使用 html 表确定在 eBay 上购买商品的日期:


def soup_creator(url):
  # Downloads the eBay page for processing
  res = requests.get(url)
  # Raises an exception error if there's an error downloading the website
  # Creates a BeautifulSoup object for HTML parsing
  return BeautifulSoup(res.text, 'lxml')

soup = soup_creator(item_link)      
purchases = soup.find('div', attrs={'class' : 'BHbidSecBorderGrey'})
purchases = purchases.findAll('tr', attrs={'bgcolor' : '#ffffff'})
for purchase in purchases:
    date = purchase.findAll("td", {"align": "left"})
    date = date[2].get_text()

当我运行 print 语句时,它不会返回任何内容,我认为这意味着它没有找到任何内容。我希望它打印出这样的内容:

Jul-02-19 18:22:28 PDT
Jun-27-19 16:12:59 PDT
Jun-23-19 06:46:23 PDT


Pandas :

对于 pandas 来说非常简单,只需为右表建立索引并切出列

import pandas as pd

table = pd.read_html('')[4]
table['Date of Purchase']

bs4 方法 1:

正如您所知的列号,您可以在感兴趣的表上使用 nth-of-type

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('')
soup = bs(r.content, 'lxml')
#if column # is known 
purchases = [item.text for item in'table[width] td:nth-of-type(5)')]

bs4 方法 2(不太理想且列号未知)

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('')
soup = bs(r.content, 'lxml')
#if column # not known
headers = [item.text.strip() for item in'table[width] th')]
desired_header = 'Date of Purchase'

if desired_header in headers: 
    print([item.text for item in'table[width] td:nth-of-type(' + str(headers.index(desired_header) + 1) + ')')])

