java - 如何使用 Apache POI 列出 Microsoft Office 文档中的所有嵌入文件?

标签 java apache-poi text-extraction embedded-object

是否有机会列出 Office 文件(doc、docx、xls、xlsx、ppt、pptx、...)中的所有嵌入对象(doc、...、txt)?

我正在使用 Apache POI (Java) 库从 Office 文件中提取文本。我不需要从嵌入对象中提取所有文本,包含所有嵌入文档的文件名的日志文件会很好(例如:string objectFileNames = getEmbeddedFileNames(fileInputStream))。

示例:我有一个 Word 文档“test.doc”,其中包含另一个名为“excel.xls”的文件。我想将 excel.xls (在本例中)的文件名写入日志文件。

我使用 apache 主页 ( https://poi.apache.org/text-extraction.html ) 中的一些示例代码进行了尝试。但我的代码总是返回相同的(“页脚文本:页眉文本”)。

我尝试的是:

private static void test(String inputfile, String outputfile) throws Exception {

    String[] extractedText = new String[100];
    int emb = 0;//used for counter of embedded objects

    InputStream fis = new FileInputStream(inputfile);
    PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben

System.out.println("Emmbedded Search started. Inputfile: " + inputfile);

//Based on Apache sample Code
emb = 0;//Reset Counter

POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor = 
   ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
   ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);

for (POITextExtractor textExtractor : embeddedExtractors) {
   // If the embedded object was an Excel spreadsheet.
   if (textExtractor instanceof ExcelExtractor) {
      ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
      extractedText[emb] = (excelExtractor.getText());
   }
   // A Word Document
   else if (textExtractor instanceof WordExtractor) {
      WordExtractor wordExtractor = (WordExtractor) textExtractor;
      String[] paragraphText = wordExtractor.getParagraphText();
      for (String paragraph : paragraphText) {
          extractedText[emb] = paragraph;
      }
      // Display the document's header and footer text
      System.out.println("Footer text: " + wordExtractor.getFooterText());
      System.out.println("Header text: " + wordExtractor.getHeaderText());
   }
   // PowerPoint Presentation.
   else if (textExtractor instanceof PowerPointExtractor) {
      PowerPointExtractor powerPointExtractor =
         (PowerPointExtractor) textExtractor;
      extractedText[emb] = powerPointExtractor.getText();
      emb++;
      extractedText[emb] =  powerPointExtractor.getNotes();
   }
   // Visio Drawing
   else if (textExtractor instanceof VisioTextExtractor) {
      VisioTextExtractor visioTextExtractor = 
         (VisioTextExtractor) textExtractor;
      extractedText[emb] = visioTextExtractor.getText();
   }
   emb++;//Count Embedded Objects
}//Close For Each Loop POIText...

for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
    if (extractedText[x] != null){
        System.out.println(extractedText[x]);
        out.println(extractedText[x]);
    }
    else {
        break;
    }
}
out.close();

}

输入文件是xls,其中包含一个doc文件作为对象,输出文件是txt。

如果有人能帮助我,谢谢。

最佳答案

我认为嵌入的 OLE 对象不会保留其原始文件名,因此我认为您想要的实际上不可能。

我相信什么Microsoft writes about embedded images也适用于 OLE 对象:

You might notice that the file name of the image file has been changed from Eagle1.gif to image1.gif. This is done to address privacy concerns, in that a malicious person could derive a competitive advantage from the name of parts in a document, such as an image file. For example, an author might choose to protect the contents of a document by encrypting the textual part of the document file. However, if two images are inserted named old_widget.gif and new_reenforced_widget.gif, even though the text is protected, a malicious person could learn the fact that the widget is being upgraded. Using generic image file names such as image1 and image2 adds another layer of protection to Office Open XML Formats files.

但是,您可以尝试(对于 Word 2007 文件,又名 XWPFDocument,又名“.docx”,其他 MS Office 文件的工作方式类似):

try (FileInputStream fis = new FileInputStream("mydoc.docx")) {
    document = new XWPFDocument(fis);
    listEmbeds (document);
}


private static void listEmbeds (XWPFDocument doc) throws OpenXML4JException {
    List<PackagePart> embeddedDocs = doc.getAllEmbedds();
    if (embeddedDocs != null && !embeddedDocs.isEmpty()) {
        Iterator<PackagePart> pIter = embeddedDocs.iterator();
        while (pIter.hasNext()) {
            PackagePart pPart = pIter.next();
            System.out.print(pPart.getPartName()+", ");
            System.out.print(pPart.getContentType()+", ");
            System.out.println();
        }
    }
}

pPart.getPartName() 是我能找到的与嵌入文件的文件名最接近的。

关于java - 如何使用 Apache POI 列出 Microsoft Office 文档中的所有嵌入文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36131787/

相关文章:

java - 输入流仅返回 1 个字节

java - 使用相同的 Firefox 窗口在 Selenium WebDriver (Java) 中运行多个测试

java - 链接上的语言参数

java - 如何让用户下载我的文件? (Java、MVC、Excel、POI)

php - 如何在php中提取字符串的一部分

Python从段落中提取信息

java - 如何将映射键值收集到列表中,其中值是集合

java - 如何使用 Apache POI 动态写入第 0 行?

java - Apache POI 自动调整百分比格式单元格的大小

python:根据部分标题提取文件,并在标题中使用分割条件