java - 如何在java中使用apache tika从PDF文件中获取页眉和页脚

标签 java pdfbox apache-tika

我正在使用apache tika从pdf文件中抓取内容。抓取的内容(文本)也包含页眉和页脚。我的要求是获取没有页眉和页脚的文本。下面是我抓取内容的示例代码。 示例代码:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import org.apache.commons.io.FileUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.json.simple.JSONObject;

public class test {

    public static void main(String[] args) throws Exception {

            String file = "C://Sample.pdf";
            File file1 = new File(file);
            InputStream input = new FileInputStream(file1);
            Metadata metadata = new Metadata();
            BodyContentHandler handler = new BodyContentHandler(
                    10 * 1024 * 1024);
            AutoDetectParser parser = new AutoDetectParser();
            parser.parse(input, handler, metadata);
            String path = "C://AUG7th".concat("/").concat(file1.getName())
                    .concat(".txt");
            String content = handler.toString();
            File file2 = new File(path);
            FileWriter fw = new FileWriter(file2.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

    }

}

请建议我如何做到这一点。 谢谢

最佳答案

我还没有找到使用 Tika 解析 pdf 的标题或页脚的方法。您需要另一个 api 来执行此操作,例如 PDFTextSTream .

编辑:好的..Tika将(尝试)从pdf中提取原始文本和元数据。
您需要解析和分析原始文本才能删除标题和页脚。 我建议使用 PDFTextStream 而不是 Tika,因为它将简化为此目的实现算法的任务。 当您使用 PDFTextStream 解析 pdf 时,您可以提取不是简单字符的 TextUnit,但它们也“携带”其他信息。您还可以选择文本区域,此外还可以选择维护每个页面的视觉布局。

@Gagravarr pdf 的 XHTML 输出

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
**<head>**
<meta name="dcterms:modified" content="2012-11-21T16:08:42Z"/>
<meta name="meta:creation-date" content="2010-06-22T07:00:09Z"/>
<meta name="meta:save-date" content="2012-11-21T16:08:42Z"/>
<meta name="Content-Length" content="702419"/>
<meta name="Last-Modified" content="2012-11-21T16:08:42Z"/>
<meta name="dcterms:created" content="2010-06-22T07:00:09Z"/>
<meta name="date" content="2012-11-21T16:08:42Z"/>
<meta name="modified" content="2012-11-21T16:08:42Z"/>
<meta name="xmpTPg:NPages" content="20"/>
<meta name="Creation-Date" content="2010-06-22T07:00:09Z"/>
<meta name="created" content="Tue Jun 22 09:00:09 CEST 2010"/>
<meta name="producer" content="Atypon Systems, Inc."/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:CreatorTool" content="PDFplus"/>
<meta name="resourceName" content="Lessons from a High-Impact Observatory The Hubble Space Telescope.pdf"/>
<meta name="Last-Save-Date" content="2012-11-21T16:08:42Z"/>
<meta name="dc:title" content="Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008"/>
<title>Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008</title>
**</head>**
**<body>**<div class="page"><p/>
<p>Lessons from a High-Impact Observatory: The Hubble Space Telescope’s Science Productivity
between 1998 and 2008
Author(s): Dániel Apai, Jill Lagerstrom, Iain Neill Reid, Karen L. Levay, Elizabeth Fraser,
Antonella Nota, and Edwin Henneken
Reviewed work(s):
Source: Publications of the Astronomical Society of the Pacific, Vol. 122, No. 893 (July 2010),
pp. 808-826
Published by: The University of Chicago Press on behalf of the Astronomical Society of the Pacific
Stable URL: http://www.jstor.org/stable/10.1086/654851 .
Accessed: 21/11/2012 11:08
</p>
<p>Your use of the JSTOR archive indicates your acceptance of the Terms &amp; Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
</p>
<p> .
</p>
<p>JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
</p>................**</body>**

head中,Tika为我们提供了它找到的元数据,在body中,它为我们提供了分段的文本(看起来也有点笨拙),它也可以给我们注释链接。所以,我认为它没有多大帮助。

关于java - 如何在java中使用apache tika从PDF文件中获取页眉和页脚,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18186476/

相关文章:

java - 将 ConcurrentHashMap 转换为 Java 中的排序列表

java - 使用 new 运算符包装基元与使用 valueOf 包装基元

java - 将base64编码的pdf转换为文件输入流,而不将文件写入系统

java - 如何禁用 PDFBox 警告日志记录

java - PDFBox 使文本不可见

c# - 将 C# mono 用于 android 或 java?

java - ClassNotFoundException : in gnu. gcj.runtime.SystemClassLoader

java - Apache Tika 和文档元数据

hadoop - 使用 Behemoth 和 map reduce 转换为 Tika 时配置对象出错

java - Apache Tika ArchiveStreamFactory.detect 错误