What is pdf2xml?
  Compiling pdf2xml
  Running pdf2xml
  Output file format
Licensing and distribution
Download sources

What is pdf2xml?

pdf2xml is an Open Source project based on the Xpdf 3.01 project. It is a command line utility that converts Portable Document Format (PDF) files into XML files. pdf2xml was compiled and tested on Win32 but it should compile and run on many other POSIX compliant OSes (UNIX, VMS, OS/2, Linux, MacOS X...).

The main goal of pdf2xml is to convert PDF files into files that are a lot easier to manipulate. To do this, pdf2xml:

What pdf2xml DOES NOT do:


Compiling pdf2xml

Download the zipped sources. Unzip all files.

You probably want to define the two macros PNG_NO_READ_SUPPORTED and PNG_NO_MNG_FEATURES using -D or /D compiler options tho minimize the code generated by the pnglib.

You have to add the following include paths using -I or /I compiler options:


Once that is setup, simply compile and link all cpp files.

You may also want to download the latest version of xpdf and libpng to compile pxdf2xml.cpp against them

Running pdf2xml

On the command line, type:

pdf2xml file

If the file if of the form my_file.pdf, then a file named my_file.xml is created. The XML file and the image files are created in the current directory independently from the directlory in which the PDF file is contained.

Image file are named my_file_pic0001.png or my_file_pic0002.jpg, etc...

Ouput file format

XML files produced by pdf2xml have a very simple format:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>

Licensing and distribution

pdf2xml is licensed under the GNU General Public License (GPL), version 2 as is the Xpdf project.
pdf2xml is Copyright © 2005 Mobipocket.com. As stated in the GPL:

pdf2xml program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

pdf2xml uses the open source project xpdf,
xpdf is licensed under the GNU General Public License (GPL)

Copyright © 1996-2005 Glyph & Cog, LLC.

pdf2xml uses the open source project libpng

Copyright © 1998-2004 Glenn Randers-Pehrson
Copyright © 1996-1997 Andreas Dilger
Copyright © 1995-1996 Guy Eric Schalnat, Group 42, Inc.

The libpng uses the open source project zlib

Copyright © 1995-2003 Jean-loup Gailly and Mark Adler

PDF is a registered trademark of Adobe Systems, Inc.

Download sources

Download pdf2xml here

