| Home Beta programs |
PDF2XML |

Contents
What is pdf2xml?
Usage
Compiling pdf2xml
Running pdf2xml
Output file format
Licensing and distribution
Download sources
pdf2xml is an Open Source project based on the Xpdf 3.01 project. It is a command line utility that converts Portable Document Format (PDF) files into XML files. pdf2xml was compiled and tested on Win32 but it should compile and run on many other POSIX compliant OSes (UNIX, VMS, OS/2, Linux, MacOS X...).
The main goal of pdf2xml is to convert PDF files into files that are a lot easier to manipulate. To do this, pdf2xml:
<text> tag, so that separate
small text elements (usually a few characters at a time) are grouped into complete lines of text.<font>
tag.What pdf2xml DOES NOT do:
Download the zipped sources. Unzip all files.
You probably want to define the two macros PNG_NO_READ_SUPPORTED and PNG_NO_MNG_FEATURES
using -D or /D compiler options tho minimize the code generated by the pnglib.
You have to add the following include paths using -I or /I compiler options:
./xpdf
./xpdf/fofi
./xpdf/goo
./xpdf/xpdf
./image/zlib
./image/png
Once that is setup, simply compile and link all cpp files.
You may also want to download the latest version of xpdf
and libpng to compile pxdf2xml.cpp against them
On the command line, type:
pdf2xml fileIf the file if of the form my_file.pdf, then a file named my_file.xml is created.
The XML file and the image files are created in the current directory independently from the directlory in which
the PDF file is contained.
Image file are named my_file_pic0001.png or my_file_pic0002.jpg, etc...
XML files produced by pdf2xml have a very simple format:
<?xml version="1.0" encoding="utf-8" ?><pdf2xml> with a pages attribute that contains the number of pages.<title> with the title in the PDF, if the PDF file has a title.<page> nodes which have width and height attributes.<font> tags can appear in <page> tags. <font> tags should always have
a face attribute and a size attribute. The color attribute is an hexadecimal 6 digit value and is there only
if the color is not black.italic and the bold attributes are there only if their value is true.<text> tags appear in <font> tags. They always have x, y,
width and height attributes. The text is contained in the tag itself.<img> tags can appear in either <font> tags or directly in <page> tags.
<img> tags have x, y, width, height and src attributes.<link> tags can appear in either <font> tags or directly in <page> tags.
<img> tags always have x, y, width and height attributes.
They either have dext_page, dest_x and dest_y attributes for internal links or
a href attribute that is a URL.pdf2xml is licensed under the
GNU General Public License (GPL), version 2
as is the Xpdf project.
pdf2xml is Copyright © 2005 Mobipocket.com. As stated in the GPL:
pdf2xml program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
pdf2xml uses the open source project xpdf,
xpdf is licensed under the GNU General Public License (GPL)
Copyright © 1996-2005 Glyph & Cog, LLC.
derekn@foolabs.com
http://www.foolabs.com/xpdf/
pdf2xml uses the open source project libpng
Copyright © 1998-2004 Glenn Randers-Pehrson
Copyright © 1996-1997 Andreas Dilger
Copyright © 1995-1996 Guy Eric Schalnat, Group 42, Inc.
glennrp@users.sourceforge.net
http://www.libpng.org/
The libpng uses the open source project zlib
Copyright © 1995-2003 Jean-loup Gailly and Mark Adler
jloup@gzip.org
madler@alumni.caltech.edu
http://www.zlib.org/
PDF is a registered trademark of Adobe Systems, Inc.
© Copyright 2000-2007 Mobipocket.com