我正在从 WordExtractor
类 (apache POI) 中提取文本,但某些 .doc
文件出现错误。调试了一下,发现有问题的行是这里的最后一行:
HWPFDocument docx = new HWPFDocument(new FileInputStream(file));
WordExtractor we = new WordExtractor(docx);
String T = we.getText().replaceAll("\\n", " ").replaceAll("\\r", " ");
对于大多数 .docx
和 .doc
文件,它工作正常。
错误信息是:
Exception in thread "main" java.lang.RuntimeException:
java.lang.IllegalArgumentException: The end (4958) must not be before the start (4990)
如何修复它?
最佳答案
来自docs的XWPFWordExtractor :
Helper class to extract text from an OOXML Word file
所以这就是你的问题:)以及他们的解决方案 docs :
For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will return text for your document.
Those using POI 3.7 can also extract simple textual content from older Word 6 and Word 95 files, using the scratchpad class org.apache.poi.hwpf.extractor.Word6Extractor.
For .docx files, the relevant class is org.apache.poi.xwpf.extractor.XPFFWordExtractor
关于java - java中的getText().replaceAll()错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41488362/