|Home Beta programs|
What is pdf2xml?
Output file format
Licensing and distribution
pdf2xml is an Open Source project based on the Xpdf 3.01 project. It is a command line utility that converts Portable Document Format (PDF) files into XML files. pdf2xml was compiled and tested on Win32 but it should compile and run on many other POSIX compliant OSes (UNIX, VMS, OS/2, Linux, MacOS X...).
The main goal of pdf2xml is to convert PDF files into files that are a lot easier to manipulate. To do this, pdf2xml:
<text>tag, so that separate small text elements (usually a few characters at a time) are grouped into complete lines of text.
What pdf2xml DOES NOT do:
Download the zipped sources. Unzip all files.
You probably want to define the two macros
/D compiler options tho minimize the code generated by the pnglib.
You have to add the following include paths using
/I compiler options:
Once that is setup, simply compile and link all cpp files.
On the command line, type:pdf2xml file
If the file if of the form
my_file.pdf, then a file named
my_file.xml is created.
The XML file and the image files are created in the current directory independently from the directlory in which
the PDF file is contained.
Image file are named
XML files produced by pdf2xml have a very simple format:<?xml version="1.0" encoding="utf-8" ?>
pagesattribute that contains the number of pages.
<title>with the title in the PDF, if the PDF file has a title.
<page>nodes which have
<font>tags can appear in
<font>tags should always have a
faceattribute and a size attribute. The
colorattribute is an hexadecimal 6 digit value and is there only if the color is not black.
boldattributes are there only if their value is
<text>tags appear in
<font>tags. They always have
heightattributes. The text is contained in the tag itself.
<img>tags can appear in either
<font>tags or directly in
<link>tags can appear in either
<font>tags or directly in
<img>tags always have
heightattributes. They either have
dest_yattributes for internal links or a
hrefattribute that is a URL.
pdf2xml is licensed under the
GNU General Public License (GPL), version 2
as is the Xpdf project.
pdf2xml is Copyright © 2005 Mobipocket.com. As stated in the GPL:
pdf2xml program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
pdf2xml uses the open source project xpdf,
xpdf is licensed under the GNU General Public License (GPL)
Copyright © 1996-2005 Glyph & Cog, LLC.
pdf2xml uses the open source project libpng
Copyright © 1998-2004 Glenn Randers-Pehrson
Copyright © 1996-1997 Andreas Dilger
Copyright © 1995-1996 Guy Eric Schalnat, Group 42, Inc.
The libpng uses the open source project zlib
Copyright © 1995-2003 Jean-loup Gailly and Mark Adler
PDF is a registered trademark of Adobe Systems, Inc.
© Copyright 2000-2007 Mobipocket.com