Cancel
Showing results for 
Search instead for 
Did you mean: 

Export comments from PDF file and import them as spreadsheet or text file

Solution Partner Valued Contributor Solution Partner Valued Contributor
Solution Partner Valued Contributor

Hello Administrators,

 

Have you ever encountered this task, where you are required to export the comments from a pdf file and save them into an word or excel file?

I am thinking to do this as an workflow which will select only the pdfs from the tagets and read them.

The next steps are to read only the comments and also save them as a new dataset which  is something new for me.

Does anyone have any ideas or directions or maybe an past implementation where you did that or something similar?

 

Thank you.

5 REPLIES

Re: Export comments from PDF file and import them as spreadsheet or text file

Solution Partner Honored Contributor Solution Partner Honored Contributor
Solution Partner Honored Contributor
I have not tried extracting comments from a PDF but my favorite tool will likely be able to do it. Look at iText.
https://itextpdf.com/
https://sourceforge.net/projects/itext/

Randy Ellsworth, Teamcenter Architect, Applied CAx, LLC
NX 11 | SW 2016 | Creo 4 | TcUA 11.4
Evaluating: AW 3.4

Re: Export comments from PDF file and import them as spreadsheet or text file

Solution Partner Valued Contributor Solution Partner Valued Contributor
Solution Partner Valued Contributor

Thank you Randy.

 

I found some examples that read the pdfs and export some data from them. 

I think this tool is exactly what i needed. 

I'll keep this updated while studying the power of this magician you gave me.

 

 

Re: Export comments from PDF file and import them as spreadsheet or text file

Phenom
Phenom

Good afternoon, I'm also interested in this topic. Tell me, what tool do you use to create comments in PDF? Is it a built-in comment in PDF or a third-party tool?

For commenting drawings in PDF format in the Workflow process, I'm thinking of using the built-in Visualization tool - 2D markup

IonutBilibou, it would be interesting to see what result you got.

Re: Export comments from PDF file and import them as spreadsheet or text file

Solution Partner Valued Contributor Solution Partner Valued Contributor
Solution Partner Valued Contributor

Hello, 

 

I have managed to Create the translator for extracting the comments from the pdfs using Itext( thank you Randy).

Next i have to define it as a custom translator for Dispatcher.

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp;
using iTextSharp.text.pdf;
using System.IO;

namespace Itextpdftoexcel
{
    class Program
    {
        static void Main(string[] args)
        {
            iTextSharp.text.pdf.PdfReader reader = new PdfReader("pdfname.pdf");
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                PdfArray array = reader.GetPageN(i).GetAsArray(PdfName.ANNOTS);
                if (array == null) continue;
                using (System.IO.StreamWriter file = new System.IO.StreamWriter(@"E:\location\WriteLines.txt"))
                    for (int j = 0; j < array.Size; j++)
                {

                    PdfDictionary annot = array.GetAsDict(j);
                    PdfString CreationText = annot.GetAsString(PdfName.CONTENTS);
                    PdfString CrName = annot.GetAsString(PdfName.T);
                    PdfString CreationDate = annot.GetAsString(PdfName.CREATIONDATE);

                        string CDate = CreationDate.ToString().Substring(0,10);
                       
                        string CText = CreationText.ToString();
                        string Cname = CrName.ToString();
                       
                            
                       
                            file.WriteLine(CreationText + "    " + Cname + "  " + CDate);
                           


                            }
            }
        }
    }
}

Re: Export comments from PDF file and import them as spreadsheet or text file

Solution Partner Valued Contributor Solution Partner Valued Contributor
Solution Partner Valued Contributor

Hello,

I have managed to achieve my task.:smileylol:

I have edited the service.properties located in Dispatcher Client/Config to add the TaskPrep and OperationLoad.

 

Translator.SIEMENS.pdfextractcomments.Prepare=com.teamcenter.ets.translator.ugs.basic.TaskPrep
Translator.SIEMENS.pdfextractcomments.Load=com.teamcenter.ets.translator.ugs.basic.DatabaseOperation

Then i created new preferences in Teamcenter to configure this prep and load.

 

 

 <preference name="SIEMENSpdfextractcomments_PDF_ets_dst_ds_type" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aa</preference_description>
      <context name="Teamcenter">
        <value>MSWord</value>
      </context>
    </preference>
    <preference name="SIEMENSpdfextractcomments_PDF_ets_dst_nr_type" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aaa</preference_description>
      <context name="Teamcenter">
        <value>word</value>
      </context>
    </preference>
    <preference name="SIEMENSpdfextractcomments_PDF_ets_dst_relation_to_src" type="Logical" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aaa</preference_description>
      <context name="Teamcenter">
        <value>false</value>
      </context>
    </preference>
    <preference name="SIEMENSpdfextractcomments_PDF_ets_dst_relation_type" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aaa</preference_description>
      <context name="Teamcenter">
        <value>IMAN_manifestation</value>
      </context>
    </preference>
    <preference name="SIEMENSpdfextractcomments_PDF_ets_nr_types" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aa</preference_description>
      <context name="Teamcenter">
        <value>PDF_Reference</value>
      </context>
    </preference>
    <preference name="SIEMENSpdfextractcomments_ets_ds_types" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aa</preference_description>
      <context name="Teamcenter">
        <value>PDF</value>
      </context>
    </preference>
 <preference name="ETS.PRIORITY.SIEMENS.PDFEXTRACTCOMMENTS" type="String" array="false" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>aa</preference_description>
      <context name="Teamcenter">
        <value>3</value>
      </context>
    </preference>
<preference name="ETS.TRANSLATOR_ARGS.SIEMENS.PDFEXTRACTCOMMENTS" type="String" array="true" disabled="false" protectionScope="Site" envEnabled="false" lsd="30-May-2018 10:45:53">
      <preference_description>aaa</preference_description>
    </preference>
<preference name="ETS.DATASETTYPES.SIEMENS.PDFEXTRACTCOMMENTS" type="String" array="true" disabled="false" protectionScope="Site" envEnabled="false">
      <preference_description>aaa</preference_description>
      <context name="Teamcenter">
        <value>PDF</value>
      </context>
    </preference>
<preference name="ETS.TRANSLATORS.SIEMENS" type="String" array="true" disabled="false" protectionScope="User" envEnabled="false">
      <preference_description>This preference lists the available translators for a given provider for the translation services.</preference_description>
      <context name="Teamcenter">
 <value>pdfextractcomments</value>
      </context>
    </preference>



Then the executant is this new code which also has the option of taking arguments from dispatcher server.

 

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp;
using iTextSharp.text.pdf;
using System.IO;

namespace Itextpdftoexcel
{
    class Program
    {

        static void Main(string[] args)
        {
            bool goboygo = true;
            string _input=null;
            string _output = null;
            if (args[0]!=null)
            {
                 _input = args[0];
                _output = args[1];

                goboygo = false;
            }
            
            else
            
                {
                _input = args[1];
                _output = args[2];
                goboygo = false;

                }


            if (goboygo==false)
            {


                string location = _input.Substring(0, 52);

            var pdfFiles = new DirectoryInfo(location).GetFiles("*.pdf");
                string filename=pdfFiles[0].Name;
                int index = filename.IndexOf(".");
                string realfilname = filename.Substring(0, index);
            
         
                
                
                    bool header = true;


                    PdfReader reader = new PdfReader(_input);

                        using (System.IO.StreamWriter file = new System.IO.StreamWriter(_output +@"\"+realfilname+".doc"))


                            for (int i = 1; i <= reader.NumberOfPages; i++)
                            {

                                PdfArray array = reader.GetPageN(i).GetAsArray(PdfName.ANNOTS);
                                if (array == null) continue;

                                for (int j = 0; j < array.Size; j++)
                                {

                                    PdfDictionary annot = array.GetAsDict(j);
                                    PdfString CreationText = annot.GetAsString(PdfName.CONTENTS);

                                    PdfString CreationName = annot.GetAsString(PdfName.T);
                                    PdfString CreationDate = annot.GetAsString(PdfName.CREATIONDATE);

                                    string CDate = CreationDate.ToString().Substring(2, 8);
                                    string cAn = CreationDate.ToString().Substring(2, 4);
                                    string cLuna = CreationDate.ToString().Substring(6, 2);
                                    string cZi = CreationDate.ToString().Substring(8, 2);
                                    string CRealDate = cZi + "-" + cLuna + "-" + cAn;


                                    string CText = CreationText.ToString();
                                    string Cname = CreationName.ToString();

                                    if (header)
                                    {
                                        file.WriteLine("Pagina, Comentariul, Autorul, Data");
                                    }
                                    file.WriteLine(i + "," + CreationText + "," + Cname + "," + CRealDate);
                                    header = false;




                                }
                            }  }
                }
            }
        }

The next step is editing the translator.xml located in module folder and making a new translator with only 2 arguments(that's what i have in my code right?)

<pdfextractcomments provider="SIEMENS" service="pdfextractcomments" isactive="true">
        <TransExecutable dir="&MODULEBASE;/Translators/pdfextractcomments" name="Itextpdftoexcel.exe"/>
        <Options>                           
            <Option name="inputpath" string="" description="Full path to the input file which will contain user input for complete dispatcher service."/>
            <Option name="outputdir" string="" description="Full path to the output directory."/>
       
        </Options>
    </pdfextractcomments>

Then when i tested it in Teamcenter Rich client it worked, it extracted all the boxes contents with autor and creation date to a new dataset MSWord in Teamcenter.

This could be modified and used with different purposes, like instead of sending the data to a new dataset, it could be sent to a collector database. 

In case something is unclear, feel free to contact me.

 

Have fun fellow admins.