c# - 使用ITextSharp从两个分隔线之间的PDF中提取文本

标签 c# pdf itextsharp

我有一个1500多页的pdf文件和一些'随机'文本,我必须从中提取一些文本…
我可以这样辨认那个街区:

bla bla bla bla bla 
...
...
...
-------------------------- (separator blue image)
XXX: TEXT TEXT TEXT
TEXT TEXT TEXT TEXT
...
-------------------------- (separator blue image)
bla bla bla bla
...
...
-------------------------- (separator blue image)
XXX: TEXT2 TEXT2 TEXT2
TEXT2 TEXT2 TEXT TEXT2
...
-------------------------- (separator blue image)

我需要从分隔符中提取所有文本(所有块)
“xxx”出现在所有块的开头,但我没有任何方法检测块的结尾。可以在解析器中使用图像分隔符吗?怎么用?
还有别的办法吗?
编辑更多信息
无背景,文本可复制和粘贴
示例pdf:1
例如第320页
谢谢

最佳答案

理论
如果您的sample PDF分隔器是使用矢量图形创建的:

0.58 0.17 0 0.47 K
q 1 0 0 1 56.6929 772.726 cm
0 0 m
249.118 0 l
S
Q
q 1 0 0 1 56.6929 690.9113 cm
0 0 m
249.118 0 l
S 

等。
解析矢量图形是itext(sharp)的一个新功能,在这方面,api需要做一些更改。目前(版本5.5.6),您可以使用接口ExtRenderListener(java)/IExtRenderListener(.net)的实现来解析矢量图形。
现在您有了一些完成任务的方法:
(multi pass)您可以仅收集行的方式实现上述接口。从这些行中,可以导出包含每个部分的矩形,对于每个矩形,可以应用区域文本过滤提取文本。
(两次通过)正如上面所述,您可以以一种仅收集行的方式实现上述接口,并从这些行中派生包含每个部分的矩形。然后使用LocationTextExtractionStrategy解析页面,并使用ITextChunkFilter重载使用适当的GetResultantText(ITextChunkFilter)请求每个矩形的文本。
(一次通过)您可以以收集行、收集文本片段、从行派生矩形并排列位于这些矩形中的文本片段的方式实现上述接口。
示例实现
(由于我的Java语言比C语言更流利,所以我用Java for iText实现了这个示例。应该很容易连接到C和iTextSharp。)
此实现尝试提取由分隔符分隔的文本节,如示例pdf中所示。
这是一个一次性的解决方案,同时从该策略中重新利用现有的LocationTextExtractionStrategy能力。
在同一过程中,这个策略收集文本块(感谢它的父类)和分隔线(由于它实现了ExtRenderListener额外的方法)。
在解析了一个页面之后,该策略通过Section方法提供了一个getSections()实例列表,每个实例都表示页面的一部分,由上下分隔线分隔。每个文本列的最上面和最下面部分在顶部或底部打开,由匹配的边距线隐式分隔。
Section实现了TextChunkFilter接口,因此,可以使用父类的方法getResultantText(TextChunkFilter)检索页面相应部分中的文本。
这只是一个poc,它被设计成使用与示例文档完全相同的分隔符从文档中提取节,即使用moveto lineto笔划尽可能宽的水平线,显示在按列排序的内容流中。对于示例pdf,可能还有更为隐含的假设。
public class DividerAwareTextExtrationStrategy extends LocationTextExtractionStrategy implements ExtRenderListener
{
    //
    // constructor
    //
    /**
     * The constructor accepts top and bottom margin lines in user space y coordinates
     * and left and right margin lines in user space x coordinates.
     * Text outside those margin lines is ignored. 
     */
    public DividerAwareTextExtrationStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin)
    {
        this.topMargin = topMargin;
        this.bottomMargin = bottomMargin;
        this.leftMargin = leftMargin;
        this.rightMargin = rightMargin;
    }

    //
    // Divider derived section support
    //
    public List<Section> getSections()
    {
        List<Section> result = new ArrayList<Section>();
        // TODO: Sort the array columnwise. In case of the OP's document, the lines already appear in the
        // correct order, so there was no need for sorting in the POC. 

        LineSegment previous = null;
        for (LineSegment line : lines)
        {
            if (previous == null)
            {
                result.add(new Section(null, line));
            }
            else if (Math.abs(previous.getStartPoint().get(Vector.I1) - line.getStartPoint().get(Vector.I1)) < 2) // 2 is a magic number... 
            {
                result.add(new Section(previous, line));
            }
            else
            {
                result.add(new Section(previous, null));
                result.add(new Section(null, line));
            }
            previous = line;
        }

        return result;
    }

    public class Section implements TextChunkFilter
    {
        LineSegment topLine;
        LineSegment bottomLine;

        final float left, right, top, bottom;

        Section(LineSegment topLine, LineSegment bottomLine)
        {
            float left, right, top, bottom;
            if (topLine != null)
            {
                this.topLine = topLine;
                top = Math.max(topLine.getStartPoint().get(Vector.I2), topLine.getEndPoint().get(Vector.I2));
                right = Math.max(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
                left = Math.min(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                top = topMargin;
                left = leftMargin;
                right = rightMargin;
            }

            if (bottomLine != null)
            {
                this.bottomLine = bottomLine;
                bottom = Math.min(bottomLine.getStartPoint().get(Vector.I2), bottomLine.getEndPoint().get(Vector.I2));
                right = Math.max(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
                left = Math.min(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                bottom = bottomMargin;
            }

            this.top = top;
            this.bottom = bottom;
            this.left = left;
            this.right = right;
        }

        //
        // TextChunkFilter
        //
        @Override
        public boolean accept(TextChunk textChunk)
        {
            // TODO: This code only checks the text chunk starting point. One should take the 
            // whole chunk into consideration
            Vector startlocation = textChunk.getStartLocation();
            float x = startlocation.get(Vector.I1);
            float y = startlocation.get(Vector.I2);

            return (left <= x) && (x <= right) && (bottom <= y) && (y <= top);
        }
    }

    //
    // ExtRenderListener implementation
    //
    /**
     * <p>
     * This method stores targets of <code>moveTo</code> in {@link #moveToVector}
     * and targets of <code>lineTo</code> in {@link #lineToVector}. Any unexpected
     * contents or operations result in clearing of the member variables.
     * </p>
     * <p>
     * So this method is implemented for files with divider lines exactly like in
     * the OP's sample file.
     * </p>
     *  
     * @see ExtRenderListener#modifyPath(PathConstructionRenderInfo)
     */
    @Override
    public void modifyPath(PathConstructionRenderInfo renderInfo)
    {
        switch (renderInfo.getOperation())
        {
        case PathConstructionRenderInfo.MOVETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            moveToVector = new Vector(x, y, 1);
            lineToVector = null;
            break;
        }
        case PathConstructionRenderInfo.LINETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            if (moveToVector != null)
            {
                lineToVector = new Vector(x, y, 1);
            }
            break;
        }
        default:
            moveToVector = null;
            lineToVector = null;
        }
    }

    /**
     * This method adds the current path to {@link #lines} if it consists
     * of a single line, the operation is no no-op, and the line is
     * approximately horizontal.
     *  
     * @see ExtRenderListener#renderPath(PathPaintingRenderInfo)
     */
    @Override
    public Path renderPath(PathPaintingRenderInfo renderInfo)
    {
        if (moveToVector != null && lineToVector != null &&
            renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
        {
            Vector from = moveToVector.cross(renderInfo.getCtm());
            Vector to = lineToVector.cross(renderInfo.getCtm());
            Vector extent = to.subtract(from);

            if (Math.abs(20 * extent.get(Vector.I2)) < Math.abs(extent.get(Vector.I1)))
            {
                LineSegment line;
                if (extent.get(Vector.I1) >= 0)
                    line = new LineSegment(from, to);
                else
                    line = new LineSegment(to, from);
                lines.add(line);
            }
        }

        moveToVector = null;
        lineToVector = null;
        return null;
    }

    /* (non-Javadoc)
     * @see com.itextpdf.text.pdf.parser.ExtRenderListener#clipPath(int)
     */
    @Override
    public void clipPath(int rule)
    {
    }

    //
    // inner members
    //
    final float topMargin, bottomMargin, leftMargin, rightMargin;
    Vector moveToVector = null;
    Vector lineToVector = null;
    final List<LineSegment> lines = new ArrayList<LineSegment>();
}

DividerAwareTextExtrationStrategy.java
可以这样用
String extractAndStore(PdfReader reader, String format, int from, int to) throws IOException
{
    StringBuilder builder = new StringBuilder();

    for (int page = from; page <= to; page++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        DividerAwareTextExtrationStrategy strategy = parser.processContent(page, new DividerAwareTextExtrationStrategy(810, 30, 20, 575));

        List<Section> sections = strategy.getSections();
        int i = 0;
        for (Section section : sections)
        {
            String sectionText = strategy.getResultantText(section);
            Files.write(Paths.get(String.format(format, page, i)), sectionText.getBytes("UTF8"));

            builder.append("--\n")
                   .append(sectionText)
                   .append('\n');
            i++;
        }
        builder.append("\n\n");
    }

    return builder.toString();
}

DividerAwareTextExtraction.java方法extractAndStore
将此方法应用于示例pdf的319和320页
PdfReader reader = new PdfReader("20150211600.PDF");
String content = extractAndStore(reader, new File(RESULT_FOLDER, "20150211600.%s.%s.txt").toString(), 319, 320);

DividerAwareTextExtraction.java测试test20150211600_320
结果
--
do(s) bem (ns) exceder o seu crédito, depositará, no prazo de 3 (três) 
dias, a diferença, sob pena de ser tornada sem efeito a arrematação 
[...]
EDITAL DE INTIMAÇÃO DE ADVOGADOS
RELAÇÃO Nº 0041/2015
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0033473-16.2010.8.24.0023 (023.10.033473-6) - Ação Penal
Militar - Procedimento Ordinário - Militar - Autor: Ministério Público 
do Estado de Santa Catarina - Réu: João Gabriel Adler - Publicada a 
sentença neste ato, lida às partes e intimados os presentes. Registre-se.
A defesa manifesta o interesse em recorrer da sentença.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), CARLOS ROBERTO PEREIRA (OAB 29179/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0025622-86.2011.8.24.0023 (023.11.025622-3) - Ação
[...]
1, NIVAEL MARTINS PADILHA, Mat. 928313-7, ANDERSON
VOGEL e ANTÔNIO VALDEMAR FORTES, no ato deprecado.


--

--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0006958-36.2013.8.24.0023 (023.13.006958-5) - Ação Penal
Militar - Procedimento Ordinário - Crimes Militares - Autor: Ministério
Público do Estado de Santa Catarina - Réu: Pedro Conceição Bungarten
- Ficam intimadas as partes, da decisão de fls. 289/290, no prazo de 
05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0006967-95.2013.8.24.0023 (023.13.006967-4) - Ação Penal
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0016809-02.2013.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ELIAS NOVAIS PEREIRA (OAB 30513/SC), ROBSON LUIZ 
CERON (OAB 22475/SC)
Processo 0021741-33.2013.8.24.0023 - Ação Penal Militar -
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0024568-17.2013.8.24.0023 - Ação Penal Militar -
[...]
do CPPM
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0034522-87.2013.8.24.0023 - Ação Penal Militar -
[...]
diligências, consoante o art. 427 do CPPM
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: M. P. E. - Réu: J. P. 
D. - Defiro a juntada dos documentos de pp. 3214-3217. Oficie-se com
urgência à Comarca de Porto União (ref. Carta Precatória n. 0000463-
--
15.2015.8.24.0052), informando a habilitação dos procuradores. Intime-
se, inclusive os novos constituídos da designação do ato.
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
[...]
imprescindível a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0043998-52.2013.8.24.0023 - Ação Penal Militar -
[...]
de parcelas para desconto remuneratório. Intimem-se.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0049304-02.2013.8.24.0023 - Ação Penal Militar -
[...]
Rel. Ângela Maria Silveira).
--
ADV: ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0000421-87.2014.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0003198-45.2014.8.24.0023 - Ação Penal Militar -
[...]
de 05 (cinco) dias.
--
ADV: ISAEL MARCELINO COELHO (OAB 13878/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0010380-82.2014.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: Ministério Público
Estadual - Réu: Vilson Diocimar Antunes - HOMOLOGO o pedido 
de desistência. Intime-se a defesa para o que preceitua o artigo 417, 
§2º, do Código de Processo Penal Militar.

(由于明显的原因缩短了一点)
在彩色标题处划分
在一篇评论中,评论写道:
还有一件事,我怎样才能识别一个字体大小/颜色变化的部分?我需要在某些情况下没有分隔符(只有一个更大的标题)(例如第346页,“armazém”应该结束部分)
作为一个例子,我扩展了上面的DividerAwareTextExtrationStrategy以将给定颜色的文本升序行添加到已经找到的分隔线:
public class DividerAndColorAwareTextExtractionStrategy extends DividerAwareTextExtrationStrategy
{
    //
    // constructor
    //
    public DividerAndColorAwareTextExtractionStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin, BaseColor headerColor)
    {
        super(topMargin, bottomMargin, leftMargin, rightMargin);
        this.headerColor = headerColor;
    }

    //
    // DividerAwareTextExtrationStrategy overrides
    //
    /**
     * As the {@link DividerAwareTextExtrationStrategy#lines} are not
     * properly sorted anymore (the additional lines come after all
     * divider lines of the same column), we have to sort that {@link List}
     * first.
     */
    @Override
    public List<Section> getSections()
    {
        Collections.sort(lines, new Comparator<LineSegment>()
        {
            @Override
            public int compare(LineSegment o1, LineSegment o2)
            {
                Vector start1 = o1.getStartPoint();
                Vector start2 = o2.getStartPoint();

                float v1 = start1.get(Vector.I1), v2 = start2.get(Vector.I1);
                if (Math.abs(v1 - v2) < 2)
                {
                    v1 = start2.get(Vector.I2);
                    v2 = start1.get(Vector.I2);
                }

                return Float.compare(v1, v2);
            }
        });

        return super.getSections();
    }

    /**
     * The ascender lines of text rendered using a fill color approximately
     * like the given header color are added to the divider lines.
     */
    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        if (approximates(renderInfo.getFillColor(), headerColor))
        {
            lines.add(renderInfo.getAscentLine());
        }

        super.renderText(renderInfo);
    }

    /**
     * This method checks whether two colors are approximately equal. As the
     * sample document only uses CMYK colors, only this comparison has been
     * implemented yet.
     */
    boolean approximates(BaseColor colorA, BaseColor colorB)
    {
        if (colorA == null || colorB == null)
            return colorA == colorB;
        if (colorA instanceof CMYKColor && colorB instanceof CMYKColor)
        {
            CMYKColor cmykA = (CMYKColor) colorA;
            CMYKColor cmykB = (CMYKColor) colorB;
            float c = Math.abs(cmykA.getCyan() - cmykB.getCyan());
            float m = Math.abs(cmykA.getMagenta() - cmykB.getMagenta());
            float y = Math.abs(cmykA.getYellow() - cmykB.getYellow());
            float k = Math.abs(cmykA.getBlack() - cmykB.getBlack());
            return c+m+y+k < 0.01;
        }
        // TODO: Implement comparison for other color types
        return false;
    }

    final BaseColor headerColor;
}

DividerAndColorAwareTextExtractionStrategy.java
renderText中,我们识别headerColor中的文本,并将它们各自的顶行添加到lines列表中。
注意:我们在给定的颜色中添加每个块的上升线。实际上,我们应该将所有文本块的升序行连接起来,形成一个标题行。由于示例文档中的蓝色标题行仅由一个块组成,因此不需要在此示例代码中使用。一个通用的解决方案必须得到适当的扩展。
由于lines不再正确排序(附加的升序行在同一列的所有分隔线之后),我们必须首先对该列表进行排序。
请注意,这里使用的Comparator并不真正正确:它忽略了x坐标中的某些差异,这使得它不是真正可传递的。只有当同一列的各个线具有与相同的列明显不同的起始x坐标时,它才起作用。
在一次试运行(参见DividerAndColorAwareTextExtraction.java方法test20150211600_346)中,发现的部分也在蓝色标题“armazém”和“balneário cambori_”处分开。
请注意我上面指出的限制。如果你想在你的示例文档中的灰色标题处进行拆分,你必须改进上面的方法,因为这些标题不是一个块。

关于c# - 使用ITextSharp从两个分隔线之间的PDF中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31730278/

相关文章:

ios - 创建 tableview 所有数据的 pdf

java - 无法使用 pdfbox 打印任何内容

c# - iTextSharp:表格单元格中的平铺图像

c# - 在 iTextSharp 中创建混合方向 PDF

php - 使用 imagick 使用 PHP 将 PDF 转换为 JPG 适用于平面 pdf,但在有多个图层时失败

c# - iTextSharp - 是否可以为同一单元格和行设置不同的字体颜色?

C# Vlc.DotNet 库 - setmedia 上的空引用异常

c# - 如何在 C# 中获取 facebook 签名请求

c# - 在 C# 中获取键值对列表的所有可能组合

C# - 从项目的发布版本中排除单元测试