java - 使用 PDFBox 拆分大型 Pdf 文件会得到大型结果文件

我正在使用 pdfbox 处理一些大型 pdf 文件(最多 100MB，大约 2000 页)。有些页面包含二维码，我想将这些文件拆分成更小的文件，页面从一个二维码到下一个。我明白了，但结果文件大小与源文件相同。我的意思是，如果我将一个 100MB 的 pdf 文件切成十个文件，我将得到十个文件，每个文件大小为 100MB。

这是代码:

 PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("myFile.pdf"), 
                           new RandomAccessFile(new File("./tmp/temp"), "rw"));

    int numPages = documentoPdf.getNumberOfPages();
    List pages = documentoPdf.getDocumentCatalog().getAllPages();

    int previusQR = 0;
    for(int i =0; i<numPages; i++){
       PDPage page = (PDPage) pages.get(i);
       BufferedImage firstPageImage =    
           page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);

       String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);

       if(qrText != null and i!=0){
         PDDocument outputDocument = new PDDocument();
         for(int j = previusQR; j<i; j++){
           outputDocument.importPage((PDPage)pages.get(j));
          }
         File f = new File("./splitting_files/"+previusQR+".pdf");
         outputDocument.save(f);
         outputDocument.close();
         documentoPdf.close();
    }

我还尝试了以下代码来存储新文件:

PDDocument outputDocument = new PDDocument();

for(int j = previusQR; j<i; j++){
 PDStream src = ((PDPage)pages.get(j)).getContents();
 PDStream streamD = new PDStream(outputDocument);
 streamD.addCompression();

 PDPage newPage = new PDPage(new   
           COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
 newPage.setContents(streamD);

 byte[] buf = new byte[10240];
 int amountRead = 0;
 InputStream is = null;
 OutputStream os = null;
 is = src.createInputStream();
 os = streamD.createOutputStream();
 while((amountRead = is.read(buf,0,10240)) > -1) {
    os.write(buf, 0, amountRead);
  }

 outputDocument.addPage(newPage);
}

File f = new File("./splitting_files/"+previusQR+".pdf");

outputDocument.save(f);
outputDocument.close();

但这段代码创建的文件缺少一些内容，而且大小与原始文件相同。

如何从较大的 pdf 文件创建较小的 pdf 文件？ PDFBox 可行吗？是否有任何其他库可以将单个页面转换为图像(用于 qr 识别)，并且还允许我将一个大的 pdf 文件拆分为多个较小的文件？

谢谢!

最佳答案

谢谢! Tilman 你是对的，PDFSplit 命令生成较小的文件。我检查了 PDFSplit 代码，发现它删除了页面链接以避免不需要的资源。

从 Splitter.class 中提取的代码:

private void processAnnotations(PDPage imported) throws IOException
    {
        List<PDAnnotation> annotations = imported.getAnnotations();
        for (PDAnnotation annotation : annotations)
        {
            if (annotation instanceof PDAnnotationLink)
            {
                PDAnnotationLink link = (PDAnnotationLink)annotation;   
                PDDestination destination = link.getDestination();
                if (destination == null && link.getAction() != null)
                {
                    PDAction action = link.getAction();
                    if (action instanceof PDActionGoTo)
                    {
                        destination = ((PDActionGoTo)action).getDestination();
                    }
                }
                if (destination instanceof PDPageDestination)
                {
                    // TODO preserve links to pages within the splitted result  
                    ((PDPageDestination) destination).setPage(null);
                }
            }
            else
            {
                // TODO preserve links to pages within the splitted result  
                annotation.setPage(null);
            }
        }
    }

所以最终我的代码看起来像这样:

PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));

        int numPages = documentoPdf.getNumberOfPages();
        List pages = documentoPdf.getDocumentCatalog().getAllPages();


        int previusQR = 0;
        for(int i =0; i<numPages; i++){
            PDPage firstPage = (PDPage) pages.get(i);
            String qrText ="";


            BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);


            firstPage =null;

            try {
                qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
            } catch (NotFoundException e) {
                e.printStackTrace();
            } finally {
                firstPageImage = null;
            }


        if(i != 0 && qrText!=null){
                    PDDocument outputDocument = new PDDocument();
                    outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
                    outputDocument.getDocumentCatalog().setViewerPreferences(
                            documentoPdf.getDocumentCatalog().getViewerPreferences());


                    for(int j = previusQR; j<i; j++){
                        PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));

                        importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
                        importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
                        // only the resources of the page will be copied
                        importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
                        importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );

                        processAnnotations(importedPage);


                    }


                    File f = new File("./splitting_files/"+previusQR+".pdf");

                    previusQR = i;

                    outputDocument.save(f);
                    outputDocument.close();
                }
            }


        }

非常感谢!!

关于java - 使用 PDFBox 拆分大型 Pdf 文件会得到大型结果文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35457952/

java - 使用 PDFBox 拆分大型 Pdf 文件会得到大型结果文件

上一篇：java - 单个查询中的ManyToOne Outer Join(1对0关系)

下一篇：java - 如果代理类具有包私有(private)默认构造函数，则代理实例化失败