- The Zoef approach to PDF-

Norbert Ligterink, Department of Physics and Astronomy, University of Pittsburgh, USA, cr:17 Feb. 2003, ch:27 Feb. 2003

The Zoef approach to PDF is not a manual, or tutorial, or accurate, or anything honorable. It is my ranting notes on trying to understand PDF and hack it like I used to with Postscript. You might find things here you cannot find anywhere else, because, after futile searches on the web I started to edit PDF myself and see what one can do with the monster.

The Zoef approach is named after the dutch folk hero Zoef de Haas, fast as lightning but a bit sloppy.

Why PDF?

The PDF specifications are publicly available, and using it is license free (more or less). And slowly it seems to become the standard. But most importantly, it is not Microsoft. Ever tried living in a fascist society? Microsoft is the modern fascism. Sending me or requesting a *.doc, *.ppt, or a *.xls is just another form of oppression of free speech.

PDF Problems

There are several things that make dealing with PDF hard.

1) The BYTE COUNT: PDF seems to be designed by somebody with a tape drive for a brain. At the bottom there is a reference table by byte address for random access of the file. However, the count start at the beginning, and the first byte is 0, in C logic. So if you try to edit PDF by hand, make sure the byte count remains the same, or face updating the reference table.

2) BINARY CODE: PDF allows for stream data, which is unformatted and generally binary. Although people warn against this bad practice. So when you perform global replacements and so, be sure not to touch the stream data.

3) CROSS PLATFORM: since PDF is made and edited on UNIX, Mac, and the Unspeakable, it has all combinations of LF (\n) and CR (\r). A simple replacement destroys the integrity, and deleting CR changes the BYTE COUNT. I use at the moment:
s/\r/\n/g
s/\n\n/ \n/g

4) IMPLICIT CROSS REFERENCES and SHARED RESOURCES: PDF is made out of objects, which are made out of other objects, and most of the time trying to read PDF code you will spend tracing the objects, and the dependencies. PDF has no hash table or dictionary to aid this. Actually the reference table is just a piece of tape-drive-brain junk at the bottom.

5) REFERENCE MANUAL: it is a lyrical piece, going on about typesetting and clipping, but poor on actual information. The examples are infuriating ambiguous. And for people who give such a high priority to annotations, they can't be bothered to provide any comments. To figure out the markers I had to count the bytes myself.

6) ANNOTATIONS: The actual content (i.e. text and images) is only the third sublayer in PDF. You start at the second line from the bottom, which has a f**** BYTE COUNT to the head of the reference table "xref", which has a BYTE COUNT to "Root" which has references to "obj" numbers of "Pages" and "Annotations", and each of these have a "Type" variable with an argument. However, which object is "Root" you only know by reading the trailer before the end and after the "xref". It seems somebody wanted to implement Knuthy stuff (linked lists etc.) to the point of being religious about it.

7) MONEY: everybody wants to make money with PDF. The only decent freeware seems to be pdflatex, which I adore. But I want to be able to do more. I wasted money on buying Adobe Acrobat, which allows you to do next to nothing and just seems to be promoting the new features of higher releases of PDF. Most of the shareware stuff I tried seems pretty crap, and tuned towards endusers. Most annoying hardly any system READS pdf in a decent way, except for the minimal alterations.

8) FILTERS: Please anybody tell me how to program the Filters in a decent way, to make stuff readable, instead of encoded, manipulations starts at reading.

PERL? PERL! PERL!!!

I think that perl would be great to tackle PDF. The objects stand on their own, except for the BYTE COUNT and SHARED RESOURCES. So read-write most of the stuff, and change the things one can, and update the tables.
I see a number of little programs:
oinfo.pl: print object info and dependencies.
updat.pl: makes a new xref table and trailer.
indel.pl: insert and delete pages.
chnbb.pl: changes the MediaBox (bounding box).
insim.pl: insert an image.
extri.pl: extract an image.
clean.pl: change to unix format.
rotat.pl: rotate an obj.
trans.pl: translate non-binary parts into ascii.

Someday I will get around to it. Here is a little example of chnbb.pl without updating the tables; it just pads the file with blanks, but fails when the ouput file is longer than the input file: (unix format, otherwise more than one object might be stringed together, with CR (\r) between them.)

#!/usr/bin/perl
if($#ARGV ne 5){
print "TO CHANGE the bounding box\n";
print "USAGE: chnbb.pl xl yl xu yu infile.pdf outfile.pdf\n";
}else{
$infile = $ARGV[4];
open(Lista,"&#$infile")||die "Can't open the file";
$outfile = $ARGV[5];
open(OUT,">$outfile")||die "Can't open the file";
#
##################################
#
$count = 0;
@listind = < Lista> ;
select(OUT);
for $i (@listind){
$ll = length($i);
$count += $ll;
$i =~ s/MediaBox \[[^\]]*\]/MediaBox \[$ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3]\]/;
if(length($i) < $ll){
$i = $i." " x ($ll - length($i));}
print $i;
}
close(Lista);
close(OUT);
}

clpdf.pl cleans out the CR, \r, or ^M from the file, so that it can more easily be attacked by brute force perlocity.

#!/usr/bin/perl
if($#ARGV ne 1){
print "TO remove CR with LF outside stream\n";
print "USAGE: clpdf.pl infile.pdf outfile.pdf\n";
}else{
$infile = $ARGV[0];
$outfile = $ARGV[1];
#
open(Lista,"< $infile")||die "Can't open the file";
open(OUT,"> $outfile")||die "Can't open the file";
#
# according to the pdf spec the keyword
# "stream" should be followed by a CR \r and a LF \n or
# just a LF. So there can only be one stream and/or one
# endstream one a line.
#
##################################
#
@listind = < Lista>
#
$off = 1;
#
select(OUT);
for $i (@listind){
if($i =~ /(.*)endstream(.*)/){
#
# $st is stream, $as is ascii
#
$st = $1;
$as = $2;
$as =~ s/\r/\n/g;
#
# substitution patterns remove the \n
# (don't use chop because \n\n should be replaced with a blank\n
# in for example the xref)
#
$as = $as."\n";
$as =~ s/\n\n/ \n/g;
$i = $st."endstream".$as;
#
# if a new stream appears turn off the global subsitution
#
if($as =~ /stream/i){
$off = 0;
}else{
$off = 1;
}
}else{
#
# in sentences without "stream" rely on $off and check for "stream"
#
if($off eq 1){
$i =~ s/\r/\n/g;
$i =~ s/\n\n/ \n/g;
}
if(($i =~ /[^d]stream/i) || ($i =~ /^stream/i)){
$off = 0;
}
}
print $i;
}
close(Lista);
close(OUT);
}

Quick and dirty:
extracting pages from PDF:
acroread -toPostScript -start $1 -end $2 < $3 > tmp.ps; ps2pdf tmp.ps $4;rm tmp.ps

uppdf.pl
This is the program that does all the counting, if you hack a PDF in a text editor: remove a page, swap pages, change a bounding box, run this program update the xref table and the startxref address at the end. If there is more than one xref, this program might fail without warning, however, it attempts to construct a single xref table out of multiple tables.

Here some stuff you might find inside a PDF file, and what it means.

$number1 $number2 obj
.....
endobj
$number1 is the object number, $number2 the generation number (in a fresh PDF it is usually 0). It is the identifier of an object following below till "endobj" at the end.
The first byte of $number1 is the BYTE COUNT address of the object.

$number1 $number2 R
a reference to the object above, basically: the INSERT obj HERE command.

stream
...................
endstream
some stream data, usually everything useful about the contents (text, images, fonts) encoded as binary. It is important to know that the keyword " stream should always be followed by a \n, possibly as: \r\n.
it appears inside an object like:
$obj_number 0 obj
<< .... >>
stream
.......
endstream
endobj
where "<< ...>>" should contain useful information like the "/Length" and the type of "/Filter" or "/Encoding".

%PDF-1.3
0226 0227 0207 0211
the header of a PDF file, where the numbers are the ascii codes.

<< $X1 $Y1 $X2 $Y2 >> a dictionary, generally a set of pairs with multiple functions, e.g., NEWCOMMAND: $X1 $Y1 can be a "/newname argument" pair.

[ x y ... z ] an array, for example for a composite argument, or a list of widths for fonts.

xref
0000000000 65535 f
$nn $nn
0000025190 00000 n
$nnnnnnnn1 $nnn2 n
$nnnnnnnn3 $nnn4 f
trailer
<<
/Size number_of_objs
/ID bla_bla
/Root obj_number_root obj_gen_number R
....
>>
startxref
reference_point_xref_of_root_BYTE_COUNT
%%EOF

The tail of the file. For a "functioning object" the identifier is "n" at the end with a trailing blank!, $nnnnnnnn1 is the BYTE COUNT location, $nnn2 the generation number. For an empty object number the identifier is "f", $nnnnnnnn3 and $nnn4 construe a linked list with 0000000000 65535 f as the top element with fixed format, $nnnnnnnn3 the object pointer and $nnn4 the generation number. Sometimes there are $nn $nn numbers present, these are subsection annotations and so.
The twenty character lines (including LF (\n)) are all the objects starting with object 0 (empty), object 1, etc.
Note, there might be several tails in the file; amendments can be added to the end. (MicroSods Word produces such stuff.)

Changing pages is quite simple:
The structure of a PDF starts with a root object, to which the trailer points: /Root 1 0:

1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj

The object Pages contains references to all the pages:
<<
2 0 obj
/Type /Pages
/Kids [ 3 0 R 6 0 R .....@objnumber[$pagenumber] @gennumber[$pagenumber] R ]
/Count $number
>>
swapping two entries $objnumber $gennumber R will swap the respective pages.

The simplest jpeg image in PDF form is given by:
%PDF-1.2
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/Font << /F0 5 0 R>>
/XObject << /Im0 6 0 R >>
/ProcSet [ /PDF /Text /ImageC ] >>
/MediaBox [0 0 612 792] USletter
/CropBox [ x y (x+w) (y+h)] "position"
/Contents 4 0 R
>>
endobj
4 0 obj
<<
/Length 35
>>
stream
q
w 0 0 h x y cm "width + position"
/Im0 Do
Q
endstream
endobj
5 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F0
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
6 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
/Filter [ /DCTDecode ]
/Width w "image width"
/Height h "image height"
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Length "the length of the jpg file"
>>
stream
... here goes the whole jpg file as a stream ...
endstream
xref
0 7
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000118 00000 n
0000000329 00000 n
0000000413 00000 n
0000000521 00000 n
trailer
<<
/Root 1 0 R
/Size 9
>>
startxref
697 + length jpg (+/- a few bytes)
%%EOF

make sure you update the image size and the xref table.
Even an image requires font resources it seems. I edited this PDF from a slightly longer one generated by Imagick.
(shell# convert image.jpg image.pdf).