python - BeautifulSoup - 抓取论坛页面

标签 python beautifulsoup screen-scraping

我正在尝试抓取论坛讨论并将其导出为 csv 文件,其中包含“线程标题”、“用户”和“帖子”等行,其中后者是每个人的实际论坛帖子。

我是 Python 和 BeautifulSoup 的初学者,所以我真的很难过!

我目前的问题是,在 csv 文件中,所有文本都被分成每行一个字符。有没有人可以帮助我?如果有人能帮助我,那就太好了!

这是我一直在使用的代码:

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

最佳答案

好的,我们开始吧。不太确定我在这里帮助你做什么,但希望你有充分的理由分析丝绸之路的帖子。

这里有一些问题,最大的问题是您根本没有解析数据。 您使用 .get_text() 所做的实质是转到页面,突出显示整个内容,然后将整个内容复制并粘贴到 csv 文件中。

所以这是你应该尝试做的:

  1. 阅读页面源代码
  2. 用汤把它分成你想要的部分
  3. 将作者、日期、时间、帖子等的部分保存在并行数组中
  4. 将数据逐行写入csv文件

我写了一些代码来向您展示它的样子,它应该可以完成工作:

from bs4 import BeautifulSoup
import csv
import urllib2

# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")

# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")

# define where we will store info
titles = []
authors = []
times = []
posts = []

# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
    times.append(text.split(" on ")[1].strip()) # get date

# now we go through the actual post data and extract it
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)

# create template
writer.writerow(["Time", "Author", "Title", "Post"])

# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
    writer.writerow([time, author, title, post])


# close file
csvfile.close()

# done
print "Operation completed successfully."

编辑:包含的解决方案可以从目录中读取文件并使用其中的数据

好的,您的 HTML 文件已放在一个目录中。您需要获取目录中的文件列表,遍历它们,并为目录中的每个文件附加到您的 csv 文件。

这是我们新程序的基本逻辑。

如果我们有一个名为 processData() 的函数,它将文件路径作为参数,并将文件中的数据附加到您的 csv 文件中,它会是这样的:

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # increment counter

碰巧我们的 processData() 函数与我们之前所做的大致相同,只是做了一些更改。

所以这与我们上一个程序非常相似,有一些小的变化:

  1. 我们首先写列标题
  2. 然后我们打开带有“ab”标志的 csv 以附加
  3. 我们导入 os 以获取文件列表

这是它的样子:

from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs

# ** define our data processing function
def processData( pageFile ):
    ''' take the data from an html file and append to our csv file '''
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")

    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")

    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []

    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date

    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'ab')
    writer = csv.writer(csvfile)

    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])

    # close file
    csvfile.close()
# ** start our process of going through files

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # incriment counter

关于python - BeautifulSoup - 抓取论坛页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21972690/

相关文章:

python - 在 Cython 代码中使用 float 文字而不是 double?

Python:numpy:检测到解释器更改 - 每个进程只能将此模块加载到一个解释器中

python - django查看方法签名是否可以匹配GET参数?

python - 使用 BeautifulSoup 抓取预标记内的文本

python - 在 python 2.7.10、Mac OSX 下机械化的 SSLv3 握手失败

python - 如何使用 beautiful soup 来抓取网站来迭代并获取所有值?

python - 如何取消选择在kivy filechooser ListView 中选择的多个文件

python - 抓取动态网站

python - 初学者学习 Python 屏幕抓取的最佳方式

java - JSoup - 选择所有评论