Parse PDF Documents

GcPdf allows you to parse PDF documents by recognizing their logical text and document structure. The content elements like plain text, tables, paragraphs and elements in tagged PDF documents can be extracted by using GcPdf API as explained below:

Extract Text

To extract text from a PDF:

Load a PDF document using Load method of the GcPdfDocument class.
Extract text from the last page of the PDF using GetText method of the Page class.
Add the extracted text to another PDF document using the Graphics.DrawString method.

Save the document using Save method of the GcPdfDocument class.

GcPdfDocument doc = new GcPdfDocument();

FileStream fs = new FileStream("GcPdf.pdf",FileMode.Open,FileAccess.Read);
doc.Load(fs);

//Extract text present on the last page
String text=doc.Pages.Last.GetText();

//Add extracted text to a new pdf 
GcPdfDocument doc1 = new GcPdfDocument();
PointF textPt = new PointF(72, 72);
doc1.NewPage().Graphics.DrawString(text, new TextFormat()
        { FontName = "ARIAL", FontItalic = true }, textPt);

doc1.Save("NewDocument.pdf"); 

Console.WriteLine("Press any key to exit");  
Console.ReadKey();

Similarly, you can also extract all the text from a document by using GetText method of the GcPdfDocument class.

Extract Text using ITextMap

GcPdf provides ITextMap interface that represents the text map of a page in a GcPdf document. It helps you to find the geometric positions of the text lines on a page and extract the text from a specific position.

The text map for a specific page in the document can be retrieved using the GetTextMap method of the Page class, which returns an object of type ITextMap. ITextMap provides four overloads of the GetFragment method, which helps to retrieve the text range and the text within the range. The text range is represented by the TextMapFragment class and each line of text in this range is represented by the TextLineFragment class.

The example code below uses the GetFragment(out TextMapFragment range, out string text) overload to retrieve the geometric positions of all the text lines on a page and the GetFragment(MapPos startPos, MapPos endPos, out TextMapFragment range, out string text) overload to retrieve the text from a specific position in the page.

// Open an arbitrary PDF, load it into a temp document and use the map to find some texts:
using (var fs = new FileStream("Test.pdf", FileMode.Open, FileAccess.Read))
{
    var doc1 = new GcPdfDocument();
    doc1.Load(fs);
    var tmap = doc1.Pages[0].GetTextMap();

    // We retrieve the text at a specific (known to us) geometric location on the page:
    float tx0 = 2.1f, ty0 = 3.37f, tx1 = 3.1f, ty1 = 3.5f;
    HitTestInfo htiFrom = tmap.HitTest(tx0 * 72, ty0 * 72);
    HitTestInfo htiTo = tmap.HitTest(ty0 * 72, ty1 * 72);
    tmap.GetFragment(htiFrom.Pos, htiTo.Pos, out TextMapFragment range1, out string text1);
    tl.AppendLine($"Looked for text inside rectangle x={tx0:F2}\", y={ty0:F2}\", " +
        $"width={tx1 - tx0:F2}\", height={ty1 - ty0:F2}\", found:");
    tl.AppendLine(text1);
    tl.AppendLine();

    // Get all text fragments and their locations on the page:
    tl.AppendLine("List of all texts found on the page");
    tmap.GetFragment(out TextMapFragment range, out string text);
    foreach (TextLineFragment tlf in range)
    {
        var coords = tmap.GetCoords(tlf);
        tl.Append($"Text at ({coords.B.X / 72:F2}\",{coords.B.Y / 72:F2}\"):\t");
        tl.AppendLine(tmap.GetText(tlf));
    }
    // Print the results:
    tl.PerformLayout(true);
}

Extract Text Paragraphs

GcPdf allows extracting text paragraphs from a PDF document by using Paragraphs property of ITextMap interface. It returns a collection of ITextParagraph objects associated with the text map.

Sometimes, PDF documents might contain some repeating text (for example, overlap of same text to show it as bold) but GcPdf extracts such text without returning the redundant lines. Also the tables with multi-line text in cells are correctly recognized as text paragraphs.

The example code below shows how to extract all text paragraphs of a PDF document:

GcPdfDocument doc = new GcPdfDocument();
var page = doc.NewPage();
var tl = page.Graphics.CreateTextLayout();
tl.MaxWidth = doc.PageSize.Width;
tl.MaxHeight = doc.PageSize.Height;

//Text split options for widow/orphan control
TextSplitOptions to = new TextSplitOptions(tl)
{
    MinLinesInFirstParagraph = 2,
    MinLinesInLastParagraph = 2,
};

//Open a PDF, load it into a temp document and get all page texts
using (var fs=new FileStream("Wetlands.pdf", FileMode.Open, FileAccess.Read))
{
    var doc1 = new GcPdfDocument();
    doc1.Load(fs);

    for (int i = 0; i < doc1.Pages.Count; ++i)
    {
        tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1));

        var pg = doc1.Pages[i];
        var pars = pg.GetTextMap().Paragraphs;
        foreach (var par in pars)
        {
            tl.AppendLine(par.GetText());
        }
    }

    tl.PerformLayout(true);
    while (true)
    {
        //'rest' will accept the text that did not fit
        var splitResult = tl.Split(to, out TextLayout rest);
        doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
        if (splitResult != SplitResult.Split)
            break;
        tl = rest;
        doc.NewPage();
    }
    //Append the original document for reference
    doc.MergeWithDocument(doc1, new MergeDocumentOptions());
}
//Save document
doc.Save(stream);
return doc.Pages.Count;

Limitations

The structure elements of a PDF are not taken into account.
The order of paragraphs can be wrong sometimes, especially in complex cases where there are nested tables etc.
The text paragraphs found by GetTextMap() cannot span pages, which means that a page break will always break the last paragraph even if logically it is continued on the next page.
Graphics elements, particularly table borders, are not considered. So, sometimes text in a table layout may be parsed incorrectly.
In some situations, paragraphs found by GcPdf may not correspond correctly to the logical paragraphs as would be recognized by a human.

Extract Data from Tables

GcPdf allows you to extract data from tables in PDF documents. The GetTable method in Page class extracts data from the area specified as a table. The method takes table area as a parameter, parses that area and returns the data of rows, columns, cells and their textual content. You can also pass TableExtractOptions as a parameter to specify table formatting options like column width, row height, distance between rows or columns.

The example code below shows how to extract data from a table in a PDF document:

const float DPI = 72;
const float margin = 36;
var doc = new GcPdfDocument();
var tf = new TextFormat()
{
    Font = Font.FromFile(Path.Combine("segoeui.ttf")),
    FontSize = 9,
    ForeColor = Color.Black
};

var tfRed = new TextFormat(tf) { ForeColor = Color.Red };
var fs = File.OpenRead(Path.Combine("zugferd-invoice.pdf"));
{
    // The approx table bounds:
    var tableBounds = new RectangleF(0, 3 * DPI, 8.5f * DPI, 3.75f * DPI);

    var page = doc.NewPage();
    page.Landscape = true;
    var g = page.Graphics;

    var tl = g.CreateTextLayout();
    tl.MaxWidth = page.Bounds.Width;
    tl.MaxHeight = page.Bounds.Height;
    tl.MarginAll = margin;
    tl.DefaultTabStops = 150;
    tl.LineSpacingScaleFactor = 1.2f;

    var docSrc = new GcPdfDocument();
    docSrc.Load(fs);

    var itable = docSrc.Pages[0].GetTable(tableBounds);

    if (itable == null)
    {
        tl.AppendLine($"No table was found at the specified coordinates.", tfRed);
    }
    else
    {
        tl.Append($"\nThe table has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tf);
        tl.AppendParagraphBreak();
        for (int row = 0; row < itable.Rows.Count; ++row)
        {
            var tfmt = row == 0 ? tf : tf;
            for (int col = 0; col < itable.Cols.Count; ++col)
            {
                var cell = itable.GetCell(row, col);
                if (col > 0)
                    tl.Append("\t", tfmt);
                if (cell == null)
                    tl.Append("", tfRed);
                else
                    tl.Append(cell.Text, tfmt);
            }
            tl.AppendLine();
        }
    }
    TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 };
    tl.PerformLayout(true);
    while (true)
    {
        var splitResult = tl.Split(to, out TextLayout rest);
        doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
        if (splitResult != SplitResult.Split)
            break;
        tl = rest;
        doc.NewPage().Landscape = true;
    }
    // Append the original document for reference
    doc.MergeWithDocument(docSrc);
    doc.Save(stream);

Note: The font files used in the above sample can be downloaded from Get Table Data demo.

Limitation

Tables cannot be searched automatically in a PDF document. Their area needs to be specified.

Extract Content from Tagged PDF

GcPdf can recognize the logical structure of a source document from which the PDF document is generated. This structure recognition is further used to extract content elements from tagged PDF documents.

Based on the PDF specification, GcPdf recognizes the logical structure by using LogicalStructure class. It represents a parsed logical structure of a PDF document which is created on the basis of tags in the PDF structure tree. The StructElement property of Element class can be used to get the element type, such as TR for table row, H for headings, P for paragraphs etc.

The example code below shows how to extract headings, tables and TOC elements from a tagged PDF document:

static void ShowTable(Element e)
{
    List>> table = new List>>();
    
    // select all nested rows, elements with type TR
    void SelectRows(IList elements)
    {
        foreach (Element ec in elements)
        {
            if (ec.HasChildren)
            {
                if (ec.StructElement.Type == "TR")
                {
                    var cells = ec.Children.FindAll((e_) => e_.StructElement.Type == "TD").ToArray();
                    List> tableCells = new List>();
                    foreach (var cell in cells)
                        tableCells.Add(cell.GetParagraphs());
                    table.Add(tableCells);
                }
                else
                    SelectRows(ec.Children);
            }
        }
    }
    SelectRows(e.Children);

    // show table
    int colCount = table.Max((r_) => r_.Count);
    Console.WriteLine();
    Console.WriteLine();
    Console.WriteLine($"Table: {table.Count}x{colCount}");
    Console.WriteLine($"------");
    foreach (var r in table)
    {
        foreach (var c in r)
        {
            var s = c == null || c.Count <= 0 ? string.Empty : c[0].GetText();
            Console.Write(s);
            Console.Write("\t");
        }
        Console.WriteLine();
    }
}

static void Main(string[] args)
{
    
    GcPdfDocument doc = new GcPdfDocument();

    using (var s = new FileStream("C1Olap QuickStart.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        doc.Load(s);

        // get the LogicalStructure and top parent element
        LogicalStructure ls = doc.GetLogicalStructure();
        Element root = ls.Elements[0];

        // select all headings
        Console.WriteLine("TOC:");
        Console.WriteLine("----");
        // iterate over elements and select all heading elements
        foreach (Element e in root.Children)
        {
            string type = e.StructElement.Type;
            if (string.IsNullOrEmpty(type) || !type.StartsWith("H"))
                continue;
            int headingLevel;
            if (!int.TryParse(type.Substring(1), out headingLevel))
                continue;
            // get the element text
            string text = e.GetText();
            if (string.IsNullOrEmpty(text))
                text = "H" + headingLevel.ToString();
            text = new string(' ', (headingLevel - 1) * 2) + text;
            Console.WriteLine(text);
            
        }

        // select all tables
        var tables = root.Children.FindAll((e_) => e_.StructElement.Type == "Table").ToArray();
        foreach (var t in tables)
        {
            ShowTable(t);
        }
    }
}

The example code below shows how to extract all paragraphs from a PDF document and save them to a Word document:

// restore word document from pdf
using (var s = new FileStream("CharacterFormatting.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    doc.Load(s);

    // get the LogicalStructure and top parent element
    LogicalStructure ls = doc.GetLogicalStructure();
    Element root = ls.Elements[0];

    GcWordDocument wdoc = new GcWordDocument();

    // iterate over elements and select all paragraphs
    foreach (Element e in root.Children)
    {
        if (e.StructElement.Type != "P")
            continue;
        var tps = e.GetParagraphs();
        if (tps == null)
            continue;

        foreach (var tp in tps)
        {
            // build a Word paragraph from a ITextParagraph
            Paragraph p = wdoc.Body.Paragraphs.Add();
            foreach (var tr in tp.Runs)
            {
                var range = p.GetRange();
                var run = range.Runs.Add(tr.GetText());
                run.Font.Size = tr.Attrs.FontSize;
                if (tr.Attrs.NonstrokeColor.HasValue)
                    run.Font.Color.RGB = tr.Attrs.NonstrokeColor.Value;

                tr.Attrs.Font.GetFontAttributes(out string fontFamily,
                    out FontWeight? fontWeight,
                    out FontStretch? fontStretch,
                    out bool? fontItalic);
                if (!string.IsNullOrEmpty(fontFamily))
                    run.Font.Name = fontFamily;
                if (fontWeight.HasValue)
                    run.Font.Bold = fontWeight.Value >= FontWeight.Bold;
                if (fontItalic.HasValue)
                    run.Font.Italic = fontItalic.Value;
            }
        }
    }
    wdoc.Save("CharacterFormatting.docx");
}

Refer to Tagged PDF to know how to create tagged PDF files using GcPdf.