python - 使用 Beautiful soup 和 lxml 在 Python 中抓取论坛帖子 无法获取所有帖子

标签 python web-scraping beautifulsoup lxml

我遇到了一个让我发疯的问题。我是网络抓取的新手,我正在通过尝试抓取论坛帖子的内容(即人们发布的实际帖子)来练习网络抓取。我已将帖子隔离到我认为包含的文本 div id="post message_ 2793649 (请参阅随附的 Screenshot_1 以更好地表示 html)Screenshot_1

上面的例子只是众多帖子之一。每个帖子都有自己的唯一标识符号,但其余部分保持一致,如 div id="post_message_.

这就是我目前遇到的问题

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-    billion-2016-a-120.html')

soup = BeautifulSoup(r.content)

data = soup.find_all("td", {"class": "alt1"})

for link in data:
    print(link.find_all('div', {'id': 'post_message'}))

上面的代码只是创建了一堆空列表,这些列表沿着页面向下移动,这非常令人沮丧。 (请参阅 Screenshot_2,了解我运行的代码及其旁边的输出) Screenshot_2 我缺少什么。

我正在寻找的最终结果只是人们所说的所有内容包含在一个长字符串中,没有任何 html 困惑。

我正在使用 Beautiful Soup 4 运行 lxml 解析器

最佳答案

您遇到了几个问题,首先是您的网址中有多个空格,因此您无法访问您认为的页面:

In [50]: import requests


In [51]: r.url # with spaces
Out[51]: 'http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html'
Out[49]: 'http://www.catforum.com/forum/'

In [50]: r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html')

In [51]: r.url # without spaces
Out[51]: 'http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html'

下一个问题是 idpost_message 开头,没有一个与 post_message 完全相等,您可以使用将匹配的 css 选择器id 以 post_message 开头来提取您想要的所有 div,然后提取文本:

r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html')

soup = BeautifulSoup(r.text)


for div in soup.select('[id^=post_message]'):
     print(div.get_text("\n", strip=True))

这会给你:

11311301
Did you get the cortisone shots? Will they have to remove it?
My Dad and stepmom got a new Jack Russell! Her name's Daisy. She's 2 years old, and she's a rescue(d) dog. She was rescued from an abusive situation. She can't stand noise, and WILL NOT allow herself  to be picked up. They're working on that. Add to that the high-strung, hyper nature of a Jack Russell... But they love her. When I called last night, Pat was trying to teach her 'sit'!
11302
Well, I tidied, cleaned, and shopped. Rest of the list isn't done and I'm too tired and way too hot to care right now.
Miss Luna is howling outside the Space Kitten's room because I let her out and gave them their noms. SHE likes to gobble their food.....little oink.
11303
Daisy sounds like she has found a perfect new home and will realize it once she feels safe.
11304
No, Kurt, I haven't gotten the cortisone shot yet.  They want me to rest it for three weeks first to see if that helps.  Then they would try a shot and remove it if the shot doesn't work.  It might feel a smidge better today but not much.
So have you met Daisy in person yet?  She sounds like a sweetie.
And Carrie, Amelia is a piggie too.  She eats the dog food if I don't watch her carefully!
11305
I had a sore neck yesterday morning after turning it too quickly. Applied heat....took an anti-inflammatory last night. Thought I'd wake up feeling better....nope....still hurts. Grrrrrrrr.
11306
MM- Thanks for your welcome to the COUNTING thread. Would have been better if I remembered to COUNT. I've been a long time lurker on the thread but happy now to get involved in the chat.
Hope your neck is feeling better. Lily and Lola are reminding me to say 'hello' from them too.
11307
Welcome back anniegirl and Lily and Lola! We didn't scare you away! Yeah!
Nightmare afternoon. My SIL was in a car accident and he car pools with my daughter. So, in rush hour, I have to drive an hour into Vancouver to get them (I hate rush hour traffic....really hate it). Then an hour back to their place.....then another half hour to get home. Not good for the neck or the nerves (I really hate toll bridges and driving in Vancouver and did I mention rush hour traffic). At least he is unharmed. Things we do for love of our children!
11308. Hi annegirl! None of us can count either - you'll fit right in.
MM, yikes how scary. Glad he's ok, but that can't have been fun having to do all that driving, especially with an achy neck.
I note that it's the teachers on this thread whose bodies promptly went down...coincidentally once the school year was over...
DebS, how on earth are you supposed to rest your foot for 3 weeks, short of lying in bed and not moving?
MM, how is your shoulder doing? And I missed the whole goodbye to Pyro.
Gah, I hope it slowly gets easier over time as you remember that they're going to families who will love them.
I'm finally not constantly hungry, just nearly constantly.
My weight had gone under 100 lbs
so I have quite a bit of catching up to do. Because of the partial obstruction I had after the surgery, the doctor told me to try to stay on a full liquid diet for a week. I actually told him no, that I was hungry, lol. So he told me to just be careful. I have been, mostly (bacon has entered the picture 3 times in the last 3 days
) and the week expired today, so I'm off to the races.
11309
Welcome to you, annegirl, along with Lily and Lola!  We always love having new friends on our counting thread.
And Spirite, good to hear from you and I'm glad you are onto solid foods.
11310
DebS and Spirite thank you too for the Welcome. Oh MM what an ordeal with your daughter but glad everyone us on.
DevS - hope your foot is improving Its so horrible to be in pain.
Spirite - go wild on the  bacon and whatever else you fancy. I'm making a chocolate orange cheese cake to bring to a dinner party this afternoon. It has so much marscapone in it you put on weight just looking at it.

如果您想使用find_all,则需要使用正则表达式:

import re
r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html')
soup = BeautifulSoup(r.text)
for div in soup.find_all(id=re.compile("^post_message")):
    print(div.get_text("\n", strip=True))

结果是一样的。

关于python - 使用 Beautiful soup 和 lxml 在 Python 中抓取论坛帖子 无法获取所有帖子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38745080/

相关文章:

html - 使用 Ruby 和 Nokogiri 抓取 HTML 表格

python - BeautifulSoup:如何删除空表,同时保留部分空表或非空表

python - 我可以计算 p 值并使用 plotly 添加星号吗?

python - 你能动态地将类属性/变量添加到 python 中的子类吗?

python - 如何在不打开浏览器的情况下从网页获取动态 HTML 代码?

javascript - 如何在不分页的情况下抓取下一页

python - .decompose() 后标签未删除

python - 如何在不解析内部div的情况下提取外部div内容

python - 下面的代码在Python3中实现了多态性,这样说是否正确?

python - 接收KeyError : ' ' for my text based rpg