java - Tika 解析给出最大限制达到错误

标签 java apache-tika

我正在使用 Apache Tika 从 PDF 文件获取内容。 当我运行它时,出现以下错误。我没有在任何地方看到这个错误的记录,这只是一个糟糕的惊喜。

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
    at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
    at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
    at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
    at org.apache.tika.parser.pdf.PDF2XHTML.writeWordSeparator(PDF2XHTML.java:318)
    at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1741)
    at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

只是想知道如何摆脱此错误并能够再次解析文件。或者如何使此限制不受限制。

最佳答案

您可以使用writeLimit来设置限制,甚至可以使用以下方法禁用它:

public BodyContentHandler(int writeLimit)

docs说如下:

writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

关于java - Tika 解析给出最大限制达到错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42392145/

相关文章:

java - 特殊字符未正确从 pdf 转换为文本

java - 使用 MyBatis Sql 查询在 Spring Boot 应用程序中选择别名列

java - CSS 样式停止工作?

java - 查找 : '' The specified procedure could not be found 时出错

java - 如何区分可搜索的 pdf 和不可搜索的 pdf?

java - 如何解决我的 Apache Tika 代码中的以下 "NoClassDefFoundError"问题?

java - Eclipse:无法将项目方面动态 Web 模块的版本更改为 2.3

java - 合并随机数游戏中按钮的 Action 监听器

solr - 导入丰富的文档时,SOLR 是否有最佳实践 schema.xml?

java - solr.extraction.ExtractingRequestHandler ClassNotFoundException