python - 使用 beautifulsoup 从多个 url 中抓取

我的代码可以工作。现在我想做一点修改来从多个 URL 中获取日期，但 URL 只有一个单词的差异。

这是我的代码，我仅从一个 URL 获取。

from string import punctuation, whitespace
import urllib2
import datetime
import re
from bs4 import BeautifulSoup as Soup
import csv
today = datetime.date.today()
html = urllib2.urlopen("http://www.99acres.com/property-in-velachery-chennai-south-ffid").read()

soup = Soup(html)
print "INSERT INTO `property` (`date`,`Url`,`Rooms`,`place`,`PId`,`Phonenumber1`,`Phonenumber2`,`Phonenumber3`,`Typeofperson`,` Nameofperson`,`typeofproperty`,`Sq.Ft`,`PerSq.Ft`,`AdDate`,`AdYear`)"
print 'VALUES'
re_digit = re.compile('(\d+)')
properties = soup.findAll('a', title=re.compile('Bedroom'))

for eachproperty in soup.findAll('div', {'class':'sT'}):
  a      = eachproperty.find('a', title=re.compile('Bedroom'))
  pdate  = eachproperty.find('i', {'class':'pdate'})
  pdates = re.sub('(\s{2,})', ' ', pdate.text)
  div    = eachproperty.find('div', {'class': 'sT_disc grey'})
  try:
    project = div.find('span').find('b').text.strip()
  except:
    project = 'NULL'        
  area = re.findall(re_digit, div.find('i', {'class': 'blk'}).text.strip())
  print ' ('
  print today,","+ (a['href'] if a else '`NULL`')+",", (a.string if a else 'NULL, NULL')+ "," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", (a['onclick'] if a else 'NULL, NULL, NULL, NULL, NULL, NULL')))+","+ ", ".join([project] + area),","+pdates+""
  print ' ), '

以下是我想同时获取的网址

http://www.99acres.com/property-in-velachery-chennai-south-ffid
http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid
http://www.99acres.com/property-in-madipakkam-chennai-south-ffid

因此您可以看到每个网址中只有一个词不同。

我正在尝试创建一个如下所示的数组

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid
, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"
html = urllib2.urlopen(link)
soup = Soup(html)

这似乎不起作用，实际上我只想将一个单词传递给像这样的 URL

for locality in areas(madipakkam, thoraipakkam, velachery):
    link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"
    html= urllib2.urlopen(link)
    soup = BeautifulSoup(html)

希望我说清楚了

最佳答案

这个:

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"

…由于多种原因，行不通。

首先，您正在调用一个从未在任何地方定义过的 areas 函数。而且我不确定您希望该函数做什么。

其次，您尝试传递 http://www.99acres.com/property-in-velachery-chennai-south-ffid ，就好像它是一个有意义的 Python 表达式一样，当它是甚至无法解析。如果要传递字符串，则必须将其放在引号中。

第三，“str(locality)”是文字字符串str(locality)。如果您想对 locality 变量调用 str 函数，请勿在其两边加上引号。但实际上，根本没有理由调用 str； locality 已经是一个字符串。

最后，您没有缩进 for 循环的主体。您必须缩进 link = 行以及之前在顶层执行的所有操作，以便它位于 for 下。这样，循环中的每个值都会发生一次，而不是在所有循环完成后总共发生一次。

试试这个:

for link in ("http://www.99acres.com/property-in-velachery-chennai-south-ffid",
             "http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid",
             "http://www.99acres.com/property-in-madipakkam-chennai-south-ffid"):
    # all the stuff you do for each URL

<小时/>

您的做法是正确的:

for locality in areas(madipakkam, thoraipakkam, velachery):
link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"

使用“模板字符串”来避免重复几乎总是一个好主意。

但是，还是存在很多问题。

首先，您再次调用了一个不存在的 areas 函数，并尝试使用不带引号的裸字符串。

其次，您遇到了与上一个问题相反的问题:您尝试将要计算的表达式 + 和 str(locality) 放入字符串的中间。您需要将其分解为两个单独的字符串，它们可以成为 + 表达式的一部分。

再说一次，您没有缩进循环体，并且您不必要地调用了 str。

所以:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = “http://www.99acres.com/property-in-" + locality + "-chennai-south-ffid"
    # all the stuff you do for each URL

<小时/>

当我们使用格式化函数而不是尝试将字符串连接在一起时，通常会更容易阅读代码，并且更容易确保没有出错。例如:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = "http://www.99acres.com/property-in-{}-chennai-south-ffid".format(locality)
    # all the stuff you do for each URL

在这里，每个地点适合字符串的位置、字符串的外观以及连字符的位置等等都一目了然。

关于python - 使用 beautifulsoup 从多个 url 中抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18821388/

python - 使用 beautifulsoup 从多个 url 中抓取

上一篇：python - 通过 USB 适配器/Python 发送短信

下一篇：python - 检查电子邮件的加密状态。