python - 如何在处理 .text 文件时跳过标题？

在 Allen Downey 的 Think Python 中，练习 13-2 要求处理来自 gutenberg.org 的任何 .txt 文件并跳过以“Produced by”结尾的标题信息。这是作者给出的解决方案:

def process_file(filename, skip_header):
    """Makes a dict that contains the words from a file.
    box  = temp storage unit to combine two following word in one string
    res = dict
    filename: string
    skip_header: boolean, whether to skip the Gutenberg header

    returns: map from string of two word from file to list of words that comes 
    after them
    Last two word in text maps to None"""
    res = {}

    fp = open(filename)

    if skip_header:
        skip_gutenberg_header(fp)

    for line in fp:
        process_line(line, res)


    return res

def process_line(line, res):

    for word in line.split():

        word = word.lower().strip(string.punctuation)
        if word.isalpha():
            res[word] = res.get(word, 0) + 1


def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.

    fp: open file object
    """
    for line in fp:
        if line.startswith('Produced by'):
            break

我真的不明白这段代码执行的缺陷。一旦代码开始使用 skip_gutenberg_header(fp) 读取文件，其中包含“for line in fp:”；它找到所需的行并中断。然而，下一个循环从 break 语句离开的地方开始。但为什么？我的看法是这里有两个独立的迭代都包含“for line in fp:”，那么第二个不应该从头开始吗？

最佳答案

不，它不应该从头开始。打开的文件对象维护一个文件位置指示器，它会在您读取(或写入)文件时移动。您也可以通过文件的 .seek 方法移动位置指示器，并通过 .tell 方法查询它。

因此，如果您跳出 for line in fp: 循环，您可以使用另一个 for line in fp: 循环从中断处继续阅读。

顺便说一句，文件的这种行为并不是 Python 特有的:所有继承了 C 的流和文件概念的现代语言都是这样工作的。

.seek 和.tell 方法在the tutorial 中有简要提及。 .

要更深入地处理 Python 中的文件/流处理，请参阅 io 的文档模块。该文档中有很多信息，其中一些信息主要供高级编码人员使用。您可能需要多读几遍并编写一些测试程序来吸收其中的内容，所以在您第一次尝试阅读时可以随意浏览一下……或前几次。 ;)

关于python - 如何在处理 .text 文件时跳过标题？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45743273/

python - 如何在处理 .text 文件时跳过标题？

上一篇：python - 如何读取传入的松弛消息？

下一篇：python - .xls 格式不允许的 ValueError : row index was 65536,