python - BeautifulSoup 页码

标签 python pandas beautifulsoup

我正在尝试从 The Wealth of Nations 的在线版本中提取文本并创建一个数据框,其中每个观察结果都是书中的一页。我以一种迂回的方式来做,试图模仿我在 R 中所做的类似的事情,但我想知道是否有一种方法可以直接在 BeautifulSoup 中做到这一点。

我所做的是首先从页面中获取整个文本:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm')
soup = BeautifulSoup(r.text,'html.parser')

但从现在开始,我只使用正则表达式和文本。我找到了本书文本的开头和结尾:

beginning = [a.start() for a in re.finditer(r"BOOK I\.",soup.text)]
beginning
end = [a.start() for a in re.finditer(r"FOOTNOTES",soup.text)]
book = soup.text[beginning[1]:end[0]]

然后我删除回车符和换行符并分割成“[Pg digits]”形式的字符串并将所有内容放入 pandas 数据框中。

book = book.replace('\r',' ').replace('\n',' ')
l = re.compile('\[[P|p]g\s?\d{1,3}\]').split(book)
df = pd.DataFrame(l,columns=['col1'])
df['page'] = range(2,df.shape[0]+2)

HTML 代码中有页码指示器,形式为 <span class='pagenum'><a name="Page_vii" id="Page_vii">[Pg vii]</a></span> .有没有一种方法可以通过在这些“跨度”之间搜索文本来在 BeautifulSoup 中进行文本提取?我知道如何使用 findall 搜索页面标记,但我想知道如何提取这些标记之间的文本。

最佳答案

要获取页面标记和与之关联的文本,您可以使用 bs4re。为了匹配两个标记之间的文本,可以使用 itertools.groupby:

from bs4 import BeautifulSoup as soup
import requests
import re
import itertools
page_data = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm').text
final_data = [(i.find('a', {'name':re.compile('Page_\w+')}), i.text) for i in soup(page_data, 'html.parser').find_all('p')]
new_data = [list(b) for a, b in itertools.groupby(final_data, key=lambda x:bool(x[0]))][1:]
final_data = {new_data[i][0][0].text:'\n'.join(c for _, c in new_data[i+1]) for i in range(0, len(new_data), 2)}

输出(样本,SO格式实际结果太长):

{'[Pg vi]': "'In recompense for so many mortifying things, which nothing but truth\r\ncould have extorted from me, and which I could easily have multiplied to a\r\ngreater number, I doubt not but you are so good a christian as to return good\r\nfor evil, and to flatter my vanity, by telling me, that all the godly in Scotland\r\nabuse me for my account of John Knox and the reformation.'\nMr. Smith having completed, and given to the world his system of\r\nethics, that subject afterwards occupied but a small part of his lectures.\r\nHis attention was now chiefly directed to the illustration of\r\nthose other branches of science which he taught; and, accordingly, he\r\nseems to have taken up the resolution, even at that early period, of\r\npublishing an investigation into the principles of what he considered\r\nto be the only other branch of Moral Philosophy,—Jurisprudence, the\r\nsubject of which formed the third division of his lectures. At the\r\nconclusion of the Theory of Moral Sentiments, after treating of the\r\nimportance of a system of Natural Jurisprudence, and remarking that\r\nGrotius was the first, and perhaps the only writer, who had given any\r\nthing like a system of those principles which ought to run through,\r\nand be the foundation of the law of nations, Mr. Smith promised, in\r\nanother discourse, to give an account of the general principles of law\r\nand government, and of the different revolutions they have undergone\r\nin the different ages and periods of society, not only in what concerns\r\njustice, but in what concerns police, revenue, and arms, and whatever\r\nelse is the object of law.\nFour years after the publication of this work, and after a residence\r\nof thirteen years in Glasgow, Mr. Smith, in 1763, was induced to relinquish\r\nhis professorship, by an invitation from the Hon. Mr. Townsend,\r\nwho had married the Duchess of Buccleugh, to accompany the\r\nyoung Duke, her son, in his travels. Being indebted for this invitation\r\nto his own talents alone, it must have appeared peculiarly flattering\r\nto him. Such an appointment was, besides, the more acceptable,\r\nas it afforded him a better opportunity of becoming acquainted with\r\nthe internal policy of other states, and of completing that system of\r\npolitical economy, the principles of which he had previously delivered\r\nin his lectures, and which it was then the leading object of his studies\r\nto perfect.\nMr. Smith did not, however, resign his professorship till the day\r\nafter his arrival in Paris, in February 1764. He then addressed the\r\nfollowing letter to the Right Honourable Thomas Millar, lord advocate\r\nof Scotland, and then rector of the college of Glasgow:—", '[Pg vii]': "His lordship having transmitted the above to the professors, a meeting\r\nwas held; on which occasion the following honourable testimony\r\nof the sense they entertained of the worth of their former colleague\r\nwas entered in their minutes:—\n'The meeting accept of Dr. Smith's resignation in terms of the above letter;\r\nand the office of professor of moral philosophy in this university is therefore\r\nhereby declared to be vacant. The university at the same time, cannot\r\nhelp expressing their sincere regret at the removal of Dr. Smith, whose distinguished\r\nprobity and amiable qualities procured him the esteem and affection\r\nof his colleagues; whose uncommon genius, great abilities, and extensive\r\nlearning, did so much honour to this society. His elegant and ingenious\r\nTheory of Moral Sentiments having recommended him to the esteem of men\r\nof taste and literature throughout Europe, his happy talents in illustrating\r\nabstracted subjects, and faithful assiduity in communicating useful knowledge,\r\ndistinguished him as a professor, and at once afforded the greatest pleasure,\r\nand the most important instruction, to the youth under his care.'\nIn the first visit that Mr. Smith and his noble pupil made to Paris,\r\nthey only remained ten or twelve days; after which, they proceeded\r\nto Thoulouse, where, during a residence of eighteen months, Mr. Smith\r\nhad an opportunity of extending his information concerning the internal\r\npolicy of France, by the intimacy in which he lived with some of\r\nthe members of the parliament. After visiting several other places in\r\nthe south of France, and residing two months at Geneva, they returned\r\nabout Christmas to Paris. Here Mr. Smith ranked among his\r\nfriends many of the highest literary characters, among whom were\r\nseveral of the most distinguished of those political philosophers who\r\nwere denominated Economists."}

关于python - BeautifulSoup 页码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49348263/

相关文章:

python - Sqlalchemy:如何使用filter_rule过滤多个条件

python - nltk:如何将周围的单词融入上下文中进行词形还原?

python - Python-Suds 联系 Sharepoint 时出现 403 禁止错误

python - 如何组合两个具有不同类型索引的数据帧(一个是DatetimeIndex,另一个是PeriodIndex)?

python - BeautifulSoup 标签的出现顺序

python - 如何处理for循环中跳过的项目

Python正则表达式字符串排除

python - 如何更改 Pandas 数据框中的特定行标签?

python - Column DataFrame Python 中的计算器数组

Python statsmodels lib 弃用警告