Analyzing PDF Streams

Published: 2024-05-09
Last Updated: 2024-05-09 15:02:37 UTC
by Didier Stevens (Version: 1)
1 comment(s)

Occasionaly, Xavier and Jim will ask me specific students' questions about my tools when they teach FOR610: Reverse-Engineering Malware.

Recently, a student wanted to know if my pdf-parser.py tool can extract all the PDF streams with a single command.

Since version 0.7.9, it can.

A stream is (binary) data, part of an object (optional), and can be compressed, or otherwise transformed. To view a single stream with pdf-parser, one selects the object of interest and uses option -f to apply the filters (like zlib decompression) to the stream:

 

I added a feature that is present in several of my tools, like oledump.py and zipdump.py: extract al of the "stored items" into a single JSON document.

When you use pdf-parser's option -j (--jsonoutput), all objects with a stream, will have the raw data (e.g., unfiltered) extracted and put into a JSON document that is sent to stdout:

To have the filtered (e.g., decompressed data), use option -f together with option -j:

What can you do with this JSON data? It depends on what your goals are. I have several tools that can take this JSON data as input, like file-magic.py and strings.py.

Here I use file-magic.py to identify the type of each raw data stream:

From this we can learn, for example, that object 143's stream contains a JPEG image.

And here I use file-magic.py to identify the type of each filtered data stream:

From this we can learn, for example, that object 881's stream contains a compressed TrueType Font file.

What if you want to write all stream data to disk, in individual files, for further analysis (that's what the student wanted to do, I guess)?

Then you can use my tool myjson-filter.py. It's a tool designed to filter JSON data produced by my tools, but it can also write items to disk.

When you use option -l, this tool will just produce a listing of the items contained in de JSON data:

And you can use option -W to write the streams to disk. -W takes a value that specifies what aming convention must be used to write the file to disk. vir will write items to disk with their sanitized name and extension .vir:

hashvir will write items to disk with their sha256 value as name and extension .vir:

Didier Stevens
Senior handler
Microsoft MVP
blog.DidierStevens.com

Keywords:
1 comment(s)

Comments


Diary Archives