python - 读取关卡脚本在某些站点上发送错误 "IndexError: string index out of range"

标签 python string indexing beautifulsoup range

这里完全是初学者。以下代码旨在分析网站中的 p 标签(使用 Python)并显示网站的阅读级别。

#import both BS4 and the new URLLIB using the added .request
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#credit to AbigailB (https://stackoverflow.com/users/1798848/abigailb) 
def syllables(word):
    count = 0
    vowels = 'aeiouy'
    word = word.lower().strip(".:;?!")
    if word[0] in vowels:
        count += 1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count += 1
    if word.endswith('e'):
        count -= 1
    if word.endswith('le'):
        count+=1
    if count == 0:
        count += 1
    return count

#site prompt, to be replaced by active tab browser address
#site = input("Enter the website to find out its reading level:")
#my_url = "{}".format(site)

#default site for testing
my_url = "https://en.wikipedia.org/wiki/Jane_Austen"

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#empty variables to be pushed w/ extracted, looped text
senNum = []
wordNum = []
syllNum = []

page_soup = soup(page_html, "html.parser")
page_soup.findAll("p")
paragraphs = page_soup.findAll("p")

#loop through every paragraph, do magic
for para in paragraphs:
    para = para.text.strip()
    paraSen = int(len(para.split('.')) - 1)
    paraWord = int(len(para.split()))
    paraSyll = syllables(para)
    intParaSen = int(paraSen)
    intParaWord = int(paraWord)
    intParaSyll = int(paraSyll)

    #append stripped values into empty variables
    senNum.append(intParaSen)
    wordNum.append(intParaWord)
    syllNum.append(intParaSyll)

#sums of all previously empty values
sumSenNum = sum(senNum)
sumWordNum = sum(wordNum)
sumSyllNum = sum(syllNum)

#averages for Flesch–Kincaid ease
avgWordsPerSen = sumWordNum/sumSenNum
avgSyllPerWord = sumSyllNum/sumWordNum

#final parts for Flesch–Kincaid ease
calcOne = avgWordsPerSen * 0.39
calcTwo = avgSyllPerWord * 11.8
finalCalc = calcOne + calcTwo - 15.59

print(finalCalc)

它很大程度上依赖于我发现上面标记为def syllables(word)(上面找到的信用)的一段代码,它显示字符串中的音节数。它可以在某些网站上运行,但是当我在其他网站上运行代码时,我收到以下错误:

Traceback (most recent call last):
  File "C:\Users\Waves\Desktop\gradeLevel.py", line 48, in <module>
    paraSyll = syllables(para)
  File "C:\Users\Waves\Desktop\gradeLevel.py", line 10, in syllables
    if word[0] in vowels:
IndexError: string index out of range

据我了解,这可能与 [0] 是数组中的第一个对象有关,而我相信原作者的意思是暗示“如果没有元音分隔符......”,但我'我不确定。如果您对代码有任何不相关的批评,请随时提出。先感谢您!

最佳答案

p 元素中有空文本

for para in paragraphs:
    print(para)
    # <p class="mw-empty-elt">   </p>

直接跳过

for para in paragraphs:
    para = para.text.strip()
    if not para:
        continue

关于python - 读取关卡脚本在某些站点上发送错误 "IndexError: string index out of range",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54299366/

相关文章:

c++ - 字符串的哈希函数不适用于某些字符串?

Python-位置参数跟随关键字参数

python - 将祖鲁时间字符串转换为 MST 日期时间对象

C#:通过检测空格从字符串中解析子字符串

c++ - 在成对的 vector 中查找元素的索引

c# - .NET 或 MySql 或其他解决方案,每天进行数百万次查找(以停止重复)

python - 在 Pandas 中删除特定行

python - 将 'dict' 对象从 Django 模板传递到 Angular Controller - 避免 jsonify 和解析

python - 导入错误:没有名为 'mirror' 的模块

ruby - 为什么拆分字符串不一致?