python - 提取相关链接并将其存储为 .csv 文件

import urllib2
from datetime import datetime
from bs4 import BeautifulSoup


page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)

events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
    try:
        date_string, rest = event.text.split(':', 1)
        print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
    except ValueError:
        print event.text

使用上述方法，我可以从 < li > 标签中提取日期。我也希望提取引用链接。问题是每个

标签都有很多链接。尽管 cite 定义了类“cite:”。我仍然无法获得完整的链接。我最终的结果是将它们存储为表格，其中每行包含日期和引用链接。 (.csv 格式)。引用问题-Web crawler to extract from list elements

最佳答案

您可以使用以下内容作为开始。它以以下行格式创建 csv 文件:

date,link

如果提取日期组件时出现错误，它将跳过一行。目前，作为示例，它适用于“20 世纪”段落:

import csv
import urllib2
from datetime import datetime
from urlparse import urljoin
from bs4 import BeautifulSoup

base_url = 'http://en.wikipedia.org'
page = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page)

# build a list of references
references = {}
for item in soup.select('ol.references li[id]'):
    links = [a['href'] if a['href'].startswith('http') else urljoin(base_url, a['href'])
             for a in item.select('span.reference-text a[href]')]
    references[item['id']] = links


events = soup.find('span', id='20th_century').parent.find_next_siblings()
with open('output.csv', 'wb') as f:
    writer = csv.writer(f)
    for tag in events:
        if tag.name == 'h2':
            break

        for event in tag.find_all('li'):
            # extract text
            try:
                date_string, _ = event.text.split(':', 1)
                date = datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
            except ValueError:
                continue

            # extract links and write data
            links = event.find_all('a', href=lambda x: x.startswith('#cite_note-'))
            if links:
                for link in links:
                    for ref in references[link['href'][1:]]:
                        writer.writerow([date, ref])
            else:
                writer.writerow([date, ''])

运行脚本后的

output.csv:

19/09/1902,
30/12/1903,
11/01/1908,
24/12/1913,
23/10/1942,http://www.ferrovieinrete.com/doc_storici/GalleriaGrazie.pdf
09/03/1946,
01/01/1956,
02/01/1971,
...

关于python - 提取相关链接并将其存储为 .csv 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27885171/

python - 提取相关链接并将其存储为 .csv 文件

上一篇：使用特定 python 版本的 python make virtualenv 失败

下一篇：python - pygame 中的 pywin32 函数导致程序挂起/"python.exe is not responding"