python - Pandas to_csv 只写入特定页面的数据

标签 python pandas beautifulsoup

我试图从 tripadvisor 抓取数据,但是从我尝试抓取的几个页面中,当我尝试将其导出到 csv 时,它只显示 1 行数据并给出这样的错误消息

AttributeError: 'NoneType' object has no attribute 'text'
这是我的代码
import requests
import pandas as pd
from requests import get
from bs4 import BeautifulSoup

URL = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'

for offset in range(0, 30, 10):
    
    url = URL + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
    headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
    
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, "html.parser")    
    
    container = soup.find_all('div', {'class':'_2rspOqPP'})
    
    for r in container:
        reviews = r.find_all('div', {'class': None})

        #the container that contains the elements that I want to scrape has no attributes and use DOM element. So I tried to access div with _2rspOqPP class first then access the div with no attributes from there

        records = []
        for review in reviews:
            user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
            country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
            date = review.find('div', {'class' : '_3JxPDYSx'}).text
            content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text

            records.append((user, country, date, content))
            df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
            df.to_csv('doublesix_.csv', index=False, encoding='utf-8')
代码更新
for r in container:
    reviews = r.find_all('div', {'class': None})
    records = []
    for review in reviews:
        try:
            user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
            country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
            date = review.find('div', {'class' : '_3JxPDYSx'}).text
            content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
            
            records.append((user, country, date, content))
        except:
            pass
        

print(records)
df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')

最佳答案

您应该移动 records出了for loops并取消缩进最后几行。
看到这个:

import pandas as pd
import requests
from bs4 import BeautifulSoup

main_url = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'

country_class = "DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy"
records = []

for offset in range(0, 30, 10):
    url = main_url + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    }

    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
    container = soup.find_all('div', {'class': '_2rspOqPP'})
    for r in container:
        reviews = r.find_all('div', {'class': None})
        for review in reviews:
            try:
                user = review.find('a', {'class': '_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
                country = review.find('div', {'class': country_class}).span.text
                date = review.find('div', {'class': '_3JxPDYSx'}).text
                content = review.find('div', {'class': 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
                records.append((user, country, date, content))
            except AttributeError:
                pass

df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')
.csv 的输出文件:
enter image description here

关于python - Pandas to_csv 只写入特定页面的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67310756/

相关文章:

python - 控制散点图中的颜色

python - 在 IPython 中自动重新加载模块

python - Pandas:合并(或内部联接)两个数据帧,但仅保留一个数据帧中的列

python bs4找不到div

python - 使用 beautiful soup 抓取数据的问题

python - BeautifulSoup:如何在特定的 html 标签后提取数据

python - 简单的 Numpy 向量化

python - 如何访问 SymEngine 以与 SymPy 一起使用?

python - Networkx:绘制平行边

python - 创建一个 DataFrame,其值作为连接的索引和列名