pdf - 如何替换/删除 PDF 文件中的文本

标签 pdf pdf-generation

如何从 PDF 文件中替换/删除文本?

我有一个从某处获得的 PDF 文件,我希望能够替换其中的一些文本。

或者,我有一个 PDF 文件,我想遮盖(编辑)其中的一些文本,使其不再可见 [并且使其看起来很酷,就像 CIA 文件一样]。

或者,我有一个包含全局 Javascript 的 PDF,我想阻止它干扰我对 PDF 的使用。

最佳答案

通过使用 iText/iTextSharp,这可以以有限的方式实现。 它仅适用于 Tj/TJ 操作码(即标准文本,而不是嵌入图像中的文本或用形状绘制的文本)。

您需要覆盖默认的 PdfContentStreamProcessor 才能作用于页面内容流,如 Mkl 此处所示 Removing Watermark from PDF iTextSharp 。从此类继承,并在您的新类中查找 Tj/TJ 操作码,操作数通常是文本元素(对于 TJ,这可能不是简单的文本,并且可能需要进一步解析所有操作数)。

此 github 存储库 https://github.com/bevanweiss/PdfEditor 提供了围绕 iTextSharp 的一些灵活性的非常基本的示例。 (下面还有代码摘录)

注意:这使用了 iTextSharp 的 AGPL 版本(因此也是 AGPL),因此如果您要分发从此代码派生的可执行文件或允许其他人以任何方式与这些可执行文件交互,那么您还必须提供修改后的源代码代码。也没有与此代码相关的任何暗示或明示的保证。使用后果自负。

PdfContentStream编辑器

using System.Collections.Generic;

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class PdfContentStreamEditor : PdfContentStreamProcessor
    {
        /**
         * This method edits the immediate contents of a page, i.e. its content stream.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public void EditPage(PdfStamper pdfStamper, int pageNum)
        {
            var pdfReader = pdfStamper.Reader;
            var page = pdfReader.GetPageN(pageNum);
            var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
            page.Remove(PdfName.CONTENTS);
            EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
        }

        /**
         * This method processes the content bytes and outputs to the given canvas.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            this.Canvas = canvas;
            ProcessContent(contentBytes, resources);
            this.Canvas = null;
        }

        /**
         * This method writes content stream operations to the target canvas. The default
         * implementation writes them as they come, so it essentially generates identical
         * copies of the original instructions the {@link ContentOperatorWrapper} instances
         * forward to it.
         *
         * Override this method to achieve some fancy editing effect.
         */

        protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
        {
            var index = 0;

            foreach (var pdfObject in operands)
            {
                pdfObject.ToPdf(null, Canvas.InternalBuffer);
                Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
            }
        }


        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor() : base(new DummyRenderListener())
        {
        }

        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
        {
        }

        //
        // Overrides of PdfContentStreamProcessor methods
        //

        public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
        {
            var wrapper = new ContentOperatorWrapper();
            wrapper.SetOriginalOperator(newOperator);
            var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
            return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
        }

        public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
        {
            this.Resources = resources; 
            base.ProcessContent(contentBytes, resources);
            this.Resources = null;
        }

        //
        // members holding the output canvas and the resources
        //
        protected PdfContentByte Canvas = null;
        protected PdfDictionary Resources = null;

        //
        // A content operator class to wrap all content operators to forward the invocation to the editor
        //
        class ContentOperatorWrapper : IContentOperator
        {
            public IContentOperator GetOriginalOperator()
            {
                return _originalOperator;
            }

            public void SetOriginalOperator(IContentOperator op)
            {
                this._originalOperator = op;
            }

            public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
            {
                if (_originalOperator != null && !"Do".Equals(oper.ToString()))
                {
                    _originalOperator.Invoke(processor, oper, operands);
                }
                ((PdfContentStreamEditor)processor).Write(processor, oper, operands);
            }

            private IContentOperator _originalOperator = null;
        }

        //
        // A dummy render listener to give to the underlying content stream processor to feed events to
        //
        class DummyRenderListener : IRenderListener
        {
            public void BeginTextBlock() { }

            public void RenderText(TextRenderInfo renderInfo) { }

            public void EndTextBlock() { }

            public void RenderImage(ImageRenderInfo renderInfo) { }
        }
    }
}

文本替换流编辑器

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextReplaceStreamEditor : PdfContentStreamEditor
    {
        public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
        {
            _matchPattern = MatchPattern;
            _replacePattern = ReplacePattern;
        }

        private string _matchPattern;
        private string _replacePattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            var operatorString = oper.ToString();
            if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
            {
                for(var i = 0; i < operands.Count; i++)
                {
                    if(!operands[i].IsString())
                        continue;

                    var text = operands[i].ToString();
                    if(Regex.IsMatch(text, _matchPattern))
                    {
                        operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
                    }
                }
            }

            base.Write(processor, oper, operands);
        }
    }
}

TextRedactStreamEditor

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextRedactStreamEditor : PdfContentStreamEditor
    {
        public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
        {
            _matchPattern = MatchPattern;
        }

        private string _matchPattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            base.Write(processor, oper, operands);
        }

        public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            ((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
            base.EditContent(contentBytes, resources, canvas);
        }
    }

    //
    // A pretty simple render listener, all we care about it text stuff.
    // We listen out for text blocks, look for our text, and then put a
    // black box over it.. text 'redacted'
    //
    class RedactRenderListener : IRenderListener
    {
        private PdfContentByte _canvas;
        private string _matchPattern;

        public RedactRenderListener(string MatchPattern)
        {
            _matchPattern = MatchPattern;
        }

        public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
        {
            _canvas = Canvas;
            _matchPattern = MatchPattern;
        }

        public void SetCanvas(PdfContentByte Canvas)
        {
            _canvas = Canvas;
        }

        public void BeginTextBlock() { }

        public void RenderText(TextRenderInfo renderInfo)
        {
            var text = renderInfo.GetText();

            var match = Regex.Match(text, _matchPattern);
            if(match.Success)
            {
                var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
                var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
                var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
                var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();

                _canvas.SaveState();
                _canvas.SetColorStroke(BaseColor.BLACK);
                _canvas.SetColorFill(BaseColor.BLACK);
                _canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
                _canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
                _canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
                _canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
                _canvas.ClosePathFillStroke();
                _canvas.RestoreState();
            }
        }

        public void EndTextBlock() { }

        public void RenderImage(ImageRenderInfo renderInfo) { }
    }
}

将它们与 iTextSharp 一起使用

var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);

pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);

// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;

// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
    "TEXT TO REPLACE HERE",
    "TEXT TO SUBSTITUTE IN HERE");

for(int i=1; i <= reader.NumberOfPages; i++)
    replaceTextProcessor.EditPage(pdfStamper, i);


// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
    "TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
    redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null, 
    Encoding.UTF8.GetBytes("ownerPassword"),
    PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
    PdfWriter.ENCRYPTION_AES_256);

// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";


// and then finally we close our files (saving it in the process) 
pdfStamper.Close();
reader.Close();

关于pdf - 如何替换/删除 PDF 文件中的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49490199/

相关文章:

pdf - 使用 GhostScript 将 PDF 转换为透明 PNG

ios - 从 UIWebview 应用程序崩溃创建 pdf 时

r - 如何禁用 knitr 图中的透明度?

R:使用重要性采样的蒙特卡罗积分

linux - Imagemagick:为 PDF 平面嵌入生成原始图像数据?

c# - ItextSharp 在页面中心添加图像,并在其下方添加文本

java - 如何读取pdf文件并将其写入outputStream

php - 防止Wkhtmltopdf输出Status

pdf 中 docinfo 元数据的 pdfmark 不接受关键字或主题中的重音字符

java - 使用 java、xml 等设计 PDF 模板并在运行时填充数据