python - 如何解析epub中每一章的文本?

标签 python parsing lxml epub

我正在尝试将书籍中的内容从 epub 格式解析并转换为我自己的结构,但我在检测和提取每章之间的所有文本时遇到困难,我该如何完成呢?

这是我希望它可以工作的两个 epub 文件,最终也可以在其他文件上工作:http://www.gutenberg.org/ebooks/11.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca

http://www.gutenberg.org/ebooks/98.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca

我可以将每个章节的标题放入字典中,如下所示:

{'ALICE’S ADVENTURES IN WONDERLAND': [], 'THE MILLENNIUM FULCRUM EDITION 3.0': [], 'Contents': [], 'CHAPTER I. Down the Rabbit-Hole': [], 'CHAPTER II. The Pool of Tears': [], 'CHAPTER III. A Caucus-Race and a Long Tale': [], 'CHAPTER IV. The Rabbit Sends in a Little Bill': [], 'CHAPTER V. Advice from a Caterpillar': [], 'CHAPTER VI. Pig and Pepper': [], 'CHAPTER VII. A Mad Tea-Party': [], 'CHAPTER VIII. The Queen’s Croquet-Ground': [], 'CHAPTER IX. The Mock Turtle’s Story': [], 'CHAPTER X. The Lobster Quadrille': [], 'CHAPTER XI. Who Stole the Tarts?': [], 'CHAPTER XII. Alice’s Evidence': []}

我想将每章之间的文本放入该列表中,但我遇到了很多麻烦

这是我获取章节的方式:

import sys
import lxml
import ebooklib
from ebooklib import epub
from ebooklib.utils import debug
from lxml import etree
from io import StringIO, BytesIO
import csv, json

bookJSON = {}
chapterNav = {}
chapterTitle = {}
chapterCont = {}
def parseNAV(xml):
    """
    Parse the xml
    """

    root = etree.fromstring(xml)

    for appt in root.getchildren():
        for elem in appt.getchildren():
            #print(elem.tag)
            for child in elem.getchildren():
                #print(child.tag)
                if("content" in child.tag):
                    srcTag = child.get("src")
                    #print(child.tag + " src: " + srcTag)
                    contentList = srcTag.split("#")
                    #print(contentList[1])
                    chapterNav[contentList[1]] = text
                    chapterTitle[text.strip()] = []
                    chapterCont[text.strip()] = []
                for node in child.getchildren():
                    if not node.text:
                        text = "None"
                    else:
                        text = node.text
                    #print(node.tag + " => " + text)
            #print(elem.tag + " CLOSED"  + "\n")

def parseContent(xml):
    """
    Parse the xml
    """

    root = etree.fromstring(xml)
    chaptText = []
    chapter= ''
    for appt in root.getchildren():
        for elem in appt.getchildren():
            if(elem.text != None and stringify_children(elem) != None):
                if("h2" in elem.tag):
                    print(stringify_children(elem))
                if (elem.text).strip() in chapterTitle.keys():
                    chapterCont[elem.text.strip()] = chaptText
                    chaptText = []
                else:
                    chaptText.append(stringify_children(elem))
def stringify_children(node):
    return (''.join(node.itertext()).strip()).replace("H2 anchor","")

book = epub.read_epub(sys.argv[1])

# debug(book.metadata)

def getData(id,book,bookJSON):
    data = list(book.get_metadata('DC', id))
    if(len(data) != 0):
        bookJSON[id] = []
        for x in data:
            dataTuple = x
            bookJSON[id].append(str(dataTuple[0]))
        return bookJSON
    return bookJSON


bookJSON =  getData('title',book,bookJSON)
bookJSON = getData('creator',book,bookJSON)
bookJSON = getData('identifier',book,bookJSON)
bookJSON = getData('description',book,bookJSON)
bookJSON = getData('language',book,bookJSON)
bookJSON = getData('subject',book,bookJSON)
nav = list(book.get_items_of_type(ebooklib.ITEM_NAVIGATION))
navXml = etree.XML(nav[0].get_content())
#print(nav[0].get_content().decode("utf-8"))


parseNAV(etree.tostring(navXml))
print(bookJSON)

bookContent = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
for cont in bookContent:
    contentXml = etree.XML(cont.get_content())

    parseContent(etree.tostring(contentXml))
# print(chapterCont)
# print(chapterNav)
# print(chapterTitle)

ParseContent 是我尝试使用的函数,目前它适用于前几章,然后开始惨败。我只是希望能够将每一章的所有文本放入相应的列表中。非常感谢。我将继续努力。如果您能提供任何帮助或建议,我们将不胜感激。

最佳答案

找到了一个解决方案,使用章节开始位置的章节标题创建了一个索引并将其保存在一个元组中。然后使用该元组迭代内容并将所有内容附加到相应的章节。希望这可以帮助下一个想要解析 epub 的人。如果有人有更好的建议请告诉我。网上关于 epub 解析的信息并不多。

关于python - 如何解析epub中每一章的文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56410564/

相关文章:

python - 基于日期时间值(缺少日期)在数据框中进行计算,而不用空日期填充数据框

android - 如何解析来自 API 的 JSON 响应

parsing - LALR(1) 函数参数的空列表

python - Apache Airflow -mysql 'Specified key was too long; max key length is 1000 bytes'

python - 从 Pandas DataFrame 中删除许多索引范围

parsing - 如何使用PARSE方言从CSV中读取行?

python - 网页抓取返回空

python - lxml web-scraping,特定单词提取

python - 如何使用 python 和 lxml 检索某些子元素

Python:从一堆 "key: value"字符串创建字典?