java - 如何从 fileBytes 确定扩展名

我的应用程序允许用户下载文件。创建 header 时，我使用 Tika 设置扩展名，如下所示。这对于 pdf 文件效果很好。 DOC 和 EXCEL 文件失败。

private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
        final HttpHeaders headers = new HttpHeaders();

        TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
        Tika tika = new Tika();
        String mimeType = tika.detect(tikaStream);
        headers.setContentType(MediaType.valueOf(mimeType));

        MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
        String extension = defaultMimeTypes.forName(mimeType).getExtension();
        headers.add("file-ext", extension);

        return headers;
    }

我发现 pdf 文件的 mimeType 解析为 "application/pdf"，但 excel 和 word 文件的 mimeType 解析为 "application/x-tika-ooxml"这就是问题所在。如果我有一个以字节为单位的文件，如何获取 word(.docx) 和 excel(xlx、xlsx)格式。

为什么这对 pdf 有效？

最佳答案

摘要

简短的回答是:您必须使用 Tika 的检测器及其 MediaType 类 - 而不是 MimeTypes。

稍微长一点的答案是:即使这样也无法解决问题，因为较旧的 MS-Office 文件的结构不同。对于那些您还必须解析文件并检查其元数据。

术语“媒体类型”已取代术语“MIME 类型” - 请参阅 here :

[RFC2046] specifies that Media Types (formerly known as MIME types) and Media Subtypes will be assigned and listed by the IANA.

Office 97-2003

当 Tika 使用其检测器检查 Excel 和 Word 97-2003 文件时，它将返回 application/x-tika-msoffice 媒体类型。我假设(可能是错误的)这是它处理文件类型组的方式，其中检测器无法根据其分析确定 MS-Office 97-2003 文件的特定风格。这与您问题中的 application/x-tika-ooxml 类似。

预期结果

基于 IANA 列表 here ，以及 Mozilla 列表 here ，这些是我们期望为以下文件类型获得的媒体类型:

.pdf::应用程序/pdf
.xls::application/vnd.ms-excel
.doc::application/msword
.xlsx::application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx::application/vnd.openxmlformats-officedocument.wordprocessingml.document

计划

下面显示的程序使用以下 Maven 依赖项:

    <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.23</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.23</version>
        </dependency>
        <dependency>
            <groupId>javax.ws.rs</groupId>
            <artifactId>javax.ws.rs-api</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>

该程序(仅用于此演示 - 尚未准备好生产)如下所示。具体来说，请查看 tikaDetect() 和 tikaParse() 方法。

import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;

public class Main {

    private final Set<File> msOfficeFiles = new HashSet();

    public static void main(String[] args) throws IOException, MimeTypeException,
            SAXException, TikaException {
        Main main = new Main();
        main.doFileDetection();
    }

    private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
        File file1 = new File("C:/tmp/foo.pdf");
        File file2 = new File("C:/tmp/baz.xlsx");
        File file3 = new File("C:/tmp/bat.docx");
        // Excel 97-2003 format:
        File file4 = new File("C:/tmp/bar.xls");
        // Word 97-2003 format:
        File file5 = new File("C:/tmp/daz.doc");
        Set<File> files = new HashSet();
        files.add(file1);
        files.add(file2);
        files.add(file3);
        files.add(file4);
        files.add(file5);

        for (File file : files) {
            try (BufferedInputStream bis = new BufferedInputStream(
                    new FileInputStream(file))) {
                tikaDetect(file, bis);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        for (File file : msOfficeFiles) {
            tikaParse(file);
        }
    }

    private void tikaDetect(File file, BufferedInputStream bis)
            throws IOException, SAXException, TikaException {
        Detector detector = new DefaultDetector();
        Metadata metadata = new Metadata();
        MediaType mediaType = detector.detect(bis, metadata);
        if (mediaType.toString().equals("application/x-tika-msoffice")) {
            msOfficeFiles.add(file);
        } else {
            System.out.println("Media Type for " + file.getName()
                    + " is: " + mediaType.toString());
        }
    }

    private void tikaParse(File file) throws SAXException, TikaException {
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        try (BufferedInputStream bis = new BufferedInputStream(
                new FileInputStream(file))) {
            parser.parse(bis, handler, metadata, context);
            tikaDetect(file, bis);
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println("Media Type for " + file.getName()
                + " is: " + metadata.get("Content-Type"));
    }
}

实际结果

该程序会生成一些警告和信息消息。如果我们在本练习中忽略这些，我们会得到以下打印语句:

Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword

这些与预期的官方媒体 (MIME) 类型匹配。

关于java - 如何从 fileBytes 确定扩展名，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60535930/

java - 如何从 fileBytes 确定扩展名

摘要

Office 97-2003

预期结果

计划

实际结果

上一篇：java - 如何使用递归构建嵌套链表？

下一篇：java - 名为 X 的 EntityManager 没有持久性提供程序错误