pdf - iText - 在不清理整行的情况下清理矩形中的文本

我正在尝试使用 iText 清理 pdf 文档中矩形内的文本。

以下是我使用的代码片段:

PdfReader pdfReader = null;
PdfStamper stamper = null;
try 
{
    int pageNo = 1;

    List<Float> linkBounds = new ArrayList<Float>();
    linkBounds.add(0, (float) 202.3);
    linkBounds.add(1, (float) 588.6);
    linkBounds.add(2, (float) 265.8);
    linkBounds.add(3, (float) 599.7);

    pdfReader = new PdfReader("Test1.pdf");
    stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));

    Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));

    List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
    cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
    cleaner.cleanUp();
}
catch (Exception e)
{
    e.printStackTrace();
}
finally
{
    try {
        stamper.close();
    }
    catch (Exception e) {
        e.printStackTrace();
    }
    pdfReader.close();
}

执行这段代码后，它会清除整行文本，而不是只清除给定矩形内的文本。

为了更好地解释事情，我附上了 pdf 文档。

在输入的 pdf 中，我突出显示了文本以显示我指定要清理的矩形。

并且，在输出 pdf 中，您可以清楚地看到有灰色矩形，但如果您注意到它清理了整行文本。

我们将不胜感激。

最佳答案

OP 最初提供的文件 input.pdf 和 output.pdf 不允许重现问题，而是似乎根本不匹配。因此，有一个原始答案实质上表明该问题无法重现。

第二组文件 Test1.pdf 和 Test2.pdf 确实允许重现问题，从而产生更新的答案。 .

引用 OP 的第二组示例文件的更新答案

当前(最高 5.5.8)iText 清理代码确实存在一个问题:对于标记文件，此处使用的一些 PdfContentByte 方法在内容流中引入了额外的指令，这实际上损坏了它并重新定位了忽略损坏的 PDF 查看器眼中的一些文本。

更详细:

PdfCleanUpContentOperator.writeTextChunks 使用 canvas.setCharacterSpacing(0) 和 canvas.setWordSpacing(0) 初始设置字符和单词间距0. 不幸的是，这些方法在标记文件的情况下检查正在构建的 Canvas 当前是否在文本对象中，并且(如果不是)启动文本对象。此检查取决于 beginText 设置的本地标志；但在清理文本对象期间不会使用该方法启动。因此，writeTextChunks 在这里插入一个额外的 "BT 1 0 0 1 0 0 Tm" 序列破坏流并重新定位以下文本。

private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                             float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
    canvas.setCharacterSpacing(0);
    canvas.setWordSpacing(0);
    ...

PdfCleanUpContentOperator.writeTextChunks 应该使用手工制作的 Tc 和 Tw 指令，以免触发此副作用。

private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                             float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
    if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
        new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
        canvas.getInternalBuffer().append(Tc);
    }
    if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
        new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
        canvas.getInternalBuffer().append(Tw);
    }
    canvas.getInternalBuffer().append((byte) '[');

通过此更改，OP 的新示例文件“Test1.pdf”已由示例代码正确编辑

@Test
public void testRedactJavishsTest1() throws IOException, DocumentException
{
    try (   InputStream resource = getClass().getResourceAsStream("Test1.pdf");
            OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
    {
        PdfReader reader = new PdfReader(resource);
        PdfStamper stamper = new PdfStamper(reader, result);

        List<Float> linkBounds = new ArrayList<Float>();
        linkBounds.add(0, (float) 202.3);
        linkBounds.add(1, (float) 588.6);
        linkBounds.add(2, (float) 265.8);
        linkBounds.add(3, (float) 599.7);

        Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
        List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
        cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));

        PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
        cleaner.cleanUp();

        stamper.close();
        reader.close();
    }
}

( RedactText.java )

引用 OP 原始示例文件的原始答案

我只是尝试使用此测试方法重现您的问题

@Test
public void testRedactJavishsText() throws IOException, DocumentException
{
    try (   InputStream resource = getClass().getResourceAsStream("input.pdf");
            OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
    {
        PdfReader reader = new PdfReader(resource);
        PdfStamper stamper = new PdfStamper(reader, result);

        List<Float> linkBounds = new ArrayList<Float>();
        linkBounds.add(0, (float) 200.7);
        linkBounds.add(1, (float) 547.3);
        linkBounds.add(2, (float) 263.3);
        linkBounds.add(3, (float) 558.4);

        Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
        List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
        cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));

        PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
        cleaner.cleanUp();

        stamper.close();
        reader.close();
    }
}

( RedactText.java )

您的源 PDF 看起来像这样

结果是

而不是你的

我什至使用您在评论中提到的 iText 版本 5.5.5 和 5.5.4 重新测试，但在所有情况下我都得到了正确的结果。

因此，我无法重现您的问题。

我仔细查看了您的 output.pdf。它有点特殊，特别是它不包含由当前 iText 版本创建或操作的 PDF 的某些典型 block 。此外，内容流看起来非常不同。

因此，我假设在 iText 编辑了您的文件后，一些其他工具进行了后期处理并在这样做时损坏了它。

特别是准备插入编辑行的页面内容说明在您的 input.pdf 中看起来像这样:

q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
[...] TJ

在我直接从 iText 收到的版本中是这样的:

q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
0 Tc
0 Tw 
[...] TJ

但是你的 output.pdf 中相应的行看起来像这样

BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc 
[...] TJ

这里是你的 output.pdf 中的说明

无效，因为在文本对象 BT ... ET 中可能没有其他文本对象，但您有两个 BT 操作，没有 ET 中间;
有效地将文本定位在 0, 0 如果 PDF 查看器忽略上述错误。

事实上，如果您查看 output.pdf 页面的底部，您会看到:

因此，如果我假设有一些其他程序对 iText 结果进行后处理是正确的，那么您应该修复该后处理器。

如果没有这样的后处理器，您似乎没有正式发布的 iText 版本，而是完全不同的东西。

关于pdf - iText - 在不清理整行的情况下清理矩形中的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35374912/

pdf - iText - 在不清理整行的情况下清理矩形中的文本

引用 OP 的第二组示例文件的更新答案

引用 OP 原始示例文件的原始答案

上一篇：visual-studio-2010 - 有没有办法在 Visual Studio 2010 中更改智能感知弹出窗口的高度？

下一篇：sql - 提交表单时， azure 上的 Umbraco 轮廓失败