html - Python3.5 BeautifulSoup4 从div中的 'p'获取文本

标签 html python-3.x beautifulsoup python-requests

我正在尝试从 div 类“caselawcontent searchable-content”中提取所有文本。此代码仅打印 HTML 而没有来自网页的文本。我缺少什么来获取文本?

以下链接位于“finteredcasesdoc.text”文件中:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

import requests
from bs4 import BeautifulSoup

with open('filteredcasesdoc.txt', 'r') as openfile1:

    for line in openfile1:
                rulingpage = requests.get(line).text
                soup = BeautifulSoup(rulingpage, 'html.parser')
                doctext = soup.find('div', class_='caselawcontent searchable-content')
                print (doctext)

最佳答案

from bs4 import BeautifulSoup
import requests

url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

我添加了一个更加可靠的 .find 方法(key : value)

whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})


the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children

基本上,我剖析了整个 至于段落(例如正文,the var para),您必须循环 打印(作者)

# and you can add .text (e.g. print(authors.text) to get the text without the tag. 
# or a simple function that returns only the text 
def rettext(something):
    return something.text
#Usage: print(rettext(authorts)) 

关于html - Python3.5 BeautifulSoup4 从div中的 'p'获取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43991012/

相关文章:

php - 从 for 循环创建 PHP 下拉菜单?

python - 在 Python 3 中使用 Pexpect

python - 提取 <div> 标签之外的文本 BeautifulSoup

javascript - 生成只有一列的表

javascript - Onmouseover <li> 改变 span : first-child color 的颜色

javascript - 两个输入文本只允许一个提交

python - peewee 中将一些数据插入到 sqlite 中的乘法插入方法有什么区别?

python - Python 闭包是否需要外部作用域超出作用域(即结束)

python - 无法在 .select() 方法中使用多个属性

python - 无法使用 BeautifulSoup 在直接级别中提取文本