python - 如何抓取两个 URL 并将每个 URL 的元素放入一个表中?

标签 python pandas url beautifulsoup urlparse

我想抓取同一页面的两个网址来获取纽约市的房价。我使用 BeautifulSoup 来获取每个房间的地址、价格和空房情况。之后,我制作一个字典,以便我可以创建一个数据帧。

我为每个 URL 获取两个不同的 DataFrame,但我希望有关每个 URL 的信息位于一个 DataFrame 中。

获得所需的信息后,我将其附加到稍后用于字典的列表

def getRoomInfo (startingPage):
    html = requests.get (startingPage)
    bs1 = BeautifulSoup (html.text, "html.parser")
    url = "{}://{}".format (urlparse (startingPage).scheme, urlparse(startingPage).netloc)



href_links = []
for link in bs1.find_all("a", href = re.compile ("/new-york-apartment/roommate-share/"+"\d+")):
    href_links.append (link["href"])

room_link = []
for links in href_links:
    room_link.append(url+links)

addressList =[]
priceList = []
availabilityList = []

for page in room_link:
    html_page = requests.get (page)
    bs_page = BeautifulSoup (html_page.text, "html.parser")


    address = bs_page.find ("div",{"class": "ap-info-address"} )
    addressList.append (address.get_text())


    price = bs_page.find ("div",{"class": "apt-price price-cur-1"} )
    priceList.append (price.get_text())


    availability = bs_page.find ("td")
    availabilityList.append (availability.get_text())      

infoDataFrame = pd.DataFrame (
    {"Address": addressList, 
        "Price": priceList, 
         "Availability": availabilityList,  
    })

print (infoDataFrame)

links_rooms = ("https://www.nyhabitat.com/new-york-apartment/roommate-share ", “https://www.nyhabitat.com/new-york-apartment/list.php?page=2&dep=SH&lev=3&price=400;2400&guest=1&sort=new&cll=1&searchMapLeft=40.60484725779594&searchMapTop=-73.81336257537379&searchMapRight=40.90185344223534&searchMapBottom=-74.14810226043238&searchMapZoom=11&div_code=ny&lang=en”)

最佳答案

strip() Python 内置函数用于删除字符串中所有前导和尾随空格。

rooms = []
for page in room_link:
    html_page = requests.get(page)
    soup = BeautifulSoup (html_page.text, "html.parser")

    for row in soup.select('div[class*="grid-col"]'):
        room = {}
        priceDiv = row.find("div",{'class':'slider-descr-wrap'})
        room['price'] = priceDiv.find("div",{"class": "apt-price price-cur-1"}).text.strip()
        addrDiv = row.find("div",{'class':'slider-descr-bottom'})
        room['address'] = addrDiv.find("span",{"class": "slider-descr-2-row"}).text.strip()
        room['availability'] = addrDiv.find("span",{'class':'search-aval'}).text.strip()
        rooms.append(room)

print(rooms) 
df = pd.DataFrame(rooms, columns=['price', 'address','availability'])  
print(df) 

操作:

[{'price': '$1,395', 'address': 'Bushwick, Brooklyn', 'availability': 'Available Aug 01 2019'}, {'price': '$1,350', 'address': 'Fort Greene, Brooklyn', 'availability': 'Available Jun 15 2019'}, {'price': '$1,055', 'address': 'Kips Bay, Manhattan', 'availability': 'Available Jun 30 2019'}, {'price': '$1,350', 'address': 'Duplex, Brooklyn', 'availability': 'Available Jun 08 2019'}, {'price': '$900', 'address': 'Flatbush, Brooklyn', 'availability': 'Available Aug 10 2019'}, {'price': '$1,100', 'address': 'Flatbush, Brooklyn', 'availability': 'Available Aug 10 2019'}, {'price': '$615', 'address': 'Washington Heights, Manhattan', 'availability': 'Available Aug 31 2019'}, {'price': '$900', 'address': 'Duplex, Ridgewood, Queens', 'availability': 'Available Jun 08 2019'}, {'price': '$663', 'address': 'Washington Heights, Manhattan', 'availability': 'Available Jun 12 2020'}, {'price': '$1,150', 'address': 'Triplex, Ridgewood, Queens', 'availability': 'Available Jun 08 2019'}, {'price': '$1,317', 'address': 'Stuyvesant Town, Manhattan', 'availability': 'Available Dec 31 2019'}, {'price': '$750', 'address': 'Jamaica, Queens', 'availability': 'Available Jun 08 2019'}, {'price': '$1,700', 'address': 'Chelsea, Manhattan', 'availability': 'Available Sep 01 2019'}, {'price': '$950', 'address': 'Astoria, Queens', 'availability': 'Available Jul 22 2019'}, {'price': '$1,750', 'address': 'Chelsea, Manhattan', 'availability': 'Available Jun 08 2019'}, {'price': '$1,375', 'address': 'Harlem, Manhattan', 'availability': 'Available Oct 01 2019'}, {'price': '$531', 'address': 'Forest Hills, Queens', 'availability': 'Available Aug 01 2019'}, {'price': '$950', 'address': 'Brooklyn', 'availability': 'Available Jun 08 2019'}, {'price': '$938', 'address': 'Washington Heights, Manhattan', 'availability': 'Available Jun 08 2019'}, {'price': '$1,200', 'address': 'Flatbush, Brooklyn', 'availability': 'Available Dec 01 2019'}]
     price                        address           availability
0   $1,395             Bushwick, Brooklyn  Available Aug 01 2019
1   $1,350          Fort Greene, Brooklyn  Available Jun 15 2019
2   $1,055            Kips Bay, Manhattan  Available Jun 30 2019
3   $1,350               Duplex, Brooklyn  Available Jun 08 2019
4     $900             Flatbush, Brooklyn  Available Aug 10 2019
5   $1,100             Flatbush, Brooklyn  Available Aug 10 2019
6     $615  Washington Heights, Manhattan  Available Aug 31 2019
7     $900      Duplex, Ridgewood, Queens  Available Jun 08 2019
8     $663  Washington Heights, Manhattan  Available Jun 12 2020
9   $1,150     Triplex, Ridgewood, Queens  Available Jun 08 2019
10  $1,317     Stuyvesant Town, Manhattan  Available Dec 31 2019
11    $750                Jamaica, Queens  Available Jun 08 2019
12  $1,700             Chelsea, Manhattan  Available Sep 01 2019
13    $950                Astoria, Queens  Available Jul 22 2019
14  $1,750             Chelsea, Manhattan  Available Jun 08 2019
15  $1,375              Harlem, Manhattan  Available Oct 01 2019
16    $531           Forest Hills, Queens  Available Aug 01 2019
17    $950                       Brooklyn  Available Jun 08 2019
18    $938  Washington Heights, Manhattan  Available Jun 08 2019
19  $1,200             Flatbush, Brooklyn  Available Dec 01 2019

关于python - 如何抓取两个 URL 并将每个 URL 的元素放入一个表中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56506169/

相关文章:

.htaccess 用查询字符串重写 URL

javascript - 在 Javascript URL 中传递两个变量以根据传递的值更新下拉列表

python - 以下 python 代码返回什么数据类型?

python - 如何将两个 DataFrame 的特定列索引为一个 DataFrame?

python - 比较两个 pandas dataframe 单元格,如果相等 ==,则复制其他内容 - 导致错误

python - 带 f 字符串和 pandas 数据框的 For 循环

html - 如何从两个不同的路由/网址收集数据,然后最后使用收集到的数据向服务器发送请求?

python - 将 Pandas datetimeindex 的频率从每天更改为每小时,以根据每日重采样数据的条件选择每小时数据

python - matplotlib 散点图中的对数颜色条

python - 如何针对大量数据测试相同的断言