some small fixes

This commit is contained in:
Mari Wahl 2014-10-07 14:44:17 -04:00
parent b2447582cc
commit 73da4eeecd
56 changed files with 8326 additions and 0 deletions

View file

@ -0,0 +1,171 @@
# Forensics-200: Obscurity
The third forensics problem starts with the following text:
> see or do not see
>
> Written by marc
>
> [pdf.pdf]
>
Hacking PDFs, what fun!
In general, when dealing with reverse-engineering malicious documents, we follow these steps:
1. We search for malicious embedded code (shell code, JavaScript).
2. We extract any suspicious code segments
3. If we see shell code, we disassemble or debug it. If we see JavaScript (or ActionScript or VB macro code), we try to examine it.
However, this problem turned out to be very simple...
---
## Finding the Flag in 10 Seconds
Yeap, this easy:
1. Download the PDF file.
2. Open it in any PDF viewer.
3. CTRL+A (select all the contend).
4. You see the flag!
![](http://i.imgur.com/b03EehK.png)
OK, we were luck. Keep reading if you think this was too easy.
## Analysing the ID and the Streams in a PDF File
Let's suppose we had no clue that the flag would just be a text in the file. In this case, we would want to examine the file's structure. For this task we use the [PDF Tool] suite, which is written in Python.
#### pdfid
We start with *pdfid.py*, which parses the PDF looking for certain keywords. We download and unzip that script, and then we make it an executable:
```sh
$ unzip pdfid_v0_1_2.zip
$ chmod a+x pdfid.py
```
Running over our file gives:
```sh
$ ./pdfid.py pdf.pdf
PDFiD 0.1.2 pdf.pdf
PDF Header: %PDF-1.3
obj 20
endobj 19
stream 10
endstream 10
xref 1
trailer 1
startxref 1
/Page 1
/Encrypt 0
/ObjStm 0
/JS 0
/JavaScript 0
/AA 0
/OpenAction 0
/AcroForm 0
/JBIG2Decode 0
/RichMedia 0
/Launch 0
/EmbeddedFile 0
/XFA 0
/Colors > 2^24 0
```
All right, no funny stuff going on here. We need to look deeper into each of the these streams.
#### pdf-parser
We download *pdf-parser.py*, which is used to search for all the fundamental elements in a PDF file. Let's take a closer look:
```sh
$ unzip pdf-parser_V0_4_3.zip
$ chmod a+x pdf-parser.py
$ ./pdf-parser.py
Usage: pdf-parser.py [options] pdf-file|zip-file|url
pdf-parser, use it to parse a PDF document
Options:
--version show program's version number and exit
-s SEARCH, --search=SEARCH
string to search in indirect objects (except streams)
-f, --filter pass stream object through filters (FlateDecode,
ASCIIHexDecode, ASCII85Decode, LZWDecode and
RunLengthDecode only)
-o OBJECT, --object=OBJECT
id of indirect object to select (version independent)
-r REFERENCE, --reference=REFERENCE
id of indirect object being referenced (version
independent)
-e ELEMENTS, --elements=ELEMENTS
type of elements to select (cxtsi)
-w, --raw raw output for data and filters
-a, --stats display stats for pdf document
-t TYPE, --type=TYPE type of indirect object to select
-v, --verbose display malformed PDF elements
-x EXTRACT, --extract=EXTRACT
filename to extract malformed content to
-H, --hash display hash of objects
-n, --nocanonicalizedoutput
do not canonicalize the output
-d DUMP, --dump=DUMP filename to dump stream content to
-D, --debug display debug info
-c, --content display the content for objects without streams or
with streams without filters
--searchstream=SEARCHSTREAM
string to search in streams
--unfiltered search in unfiltered streams
--casesensitive case sensitive search in streams
--regex use regex to search in streams
```
Very interesting! We run it with our file, searching for the string */ProcSet*:
```sh
$ ./pdf-parser.py pdf.pdf | grep /ProcSet
/ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ]
```
Awesome! Even though we don't see any text in the file (when we opened it in the PDF viewer), there is text somewhere!
## Getting Text from PDF
A good way to extract text from a pdf is using [pdftotext]:
```sh
$ pdftotext pdf.pdf
```
You should get a ```pdf.txt``` file. Reading it with Linux's commands ```cat``` or ```strings```gives you the flag:
```sh
$ strings pdf.txt
flag{security_through_obscurity}
```
As a note, there are several other PDF forensics tools that are worth to be mentioned: [Origami] (pdfextract extracts JavaScript from PDF files), [PDF Stream Dumper] (several PDF analysis tools), [Peepdf] (command-line shell for examining PDF), [PDF X-RAY Lite] (creates an HTML report with decoded file structure and contents), [SWF mastah] (extracts SWF objects), [Pyew](for examining and decoding structure and content of PDF files).
**Hack all the things!**
[PDF Tool]:http://blog.didierstevens.com/programs/pdf-tools/
[Origami]: http://esec-lab.sogeti.com/pages/Origami
[PDF Stream Dumper]: http://blog.zeltser.com/post/3235995383/pdf-stream-dumper-malicious-file-analysis
[Peepdf]: http://blog.zeltser.com/post/6780160077/peepdf-malicious-pdf-analysis
[SWF mastah]: http://blog.zeltser.com/post/12615013257/extracting-swf-from-pdf-using-swf-mastah
[PDF X-RAY Lite]: https://github.com/9b/pdfxray_lite
[Pyew]: http://code.google.com/p/pyew/wiki/PDFAnalysis
[this website]: http://blog.didierstevens.com/programs/pdf-tools/
[pdf-tools]: https://apps.fedoraproject.org/packages/pdf-tools
[pdf.pdf]: https://ctf.isis.poly.edu/static/uploads/883c7046854e04138c55680ffde90a61/pdf.pdf
[pdftotext]: http://en.wikipedia.org/wiki/Pdftotext

File diff suppressed because it is too large Load diff

Binary file not shown.

View file

@ -0,0 +1,714 @@
#!/usr/bin/env python
__description__ = 'Tool to test a PDF file'
__author__ = 'Didier Stevens'
__version__ = '0.1.2'
__date__ = '2013/03/13'
"""
Tool to test a PDF file
Source code put in public domain by Didier Stevens, no Copyright
https://DidierStevens.com
Use at your own risk
History:
2009/03/27: start
2009/03/28: scan option
2009/03/29: V0.0.2: xml output
2009/03/31: V0.0.3: /ObjStm suggested by Dion
2009/04/02: V0.0.4: added ErrorMessage
2009/04/20: V0.0.5: added Dates
2009/04/21: V0.0.6: added entropy
2009/04/22: added disarm
2009/04/29: finished disarm
2009/05/13: V0.0.7: added cPDFEOF
2009/07/24: V0.0.8: added /AcroForm and /RichMedia, simplified %PDF header regex, extra date format (without TZ)
2009/07/25: added input redirection, option --force
2009/10/13: V0.0.9: added detection for CVE-2009-3459; added /RichMedia to disarm
2010/01/11: V0.0.10: relaxed %PDF header checking
2010/04/28: V0.0.11: added /Launch
2010/09/21: V0.0.12: fixed cntCharsAfterLastEOF bug; fix by Russell Holloway
2011/12/29: updated for Python 3, added keyword /EmbeddedFile
2012/03/03: added PDFiD2JSON; coded by Brandon Dixon
2013/02/10: V0.1.0: added http/https support; added support for ZIP file with password 'infected'
2013/03/11: V0.1.1: fixes for Python 3
2013/03/13: V0.1.2: Added error handling for files; added /XFA
Todo:
- update XML example (entropy, EOF)
- code review, cleanup
"""
import optparse
import os
import re
import xml.dom.minidom
import traceback
import math
import operator
import os.path
import sys
import json
import zipfile
try:
import urllib2
urllib23 = urllib2
except:
import urllib.request
urllib23 = urllib.request
#Convert 2 Bytes If Python 3
def C2BIP3(string):
if sys.version_info[0] > 2:
return bytes([ord(x) for x in string])
else:
return string
class cBinaryFile:
def __init__(self, file):
self.file = file
if file == '':
self.infile = sys.stdin
elif file.lower().startswith('http://') or file.lower().startswith('https://'):
try:
if sys.hexversion >= 0x020601F0:
self.infile = urllib23.urlopen(file, timeout=5)
else:
self.infile = urllib23.urlopen(file)
except urllib23.HTTPError:
print('Error accessing URL %s' % file)
print(sys.exc_info()[1])
sys.exit()
elif file.lower().endswith('.zip'):
try:
self.zipfile = zipfile.ZipFile(file, 'r')
self.infile = self.zipfile.open(self.zipfile.infolist()[0], 'r', C2BIP3('infected'))
except:
print('Error opening file %s' % file)
print(sys.exc_info()[1])
sys.exit()
else:
try:
self.infile = open(file, 'rb')
except:
print('Error opening file %s' % file)
print(sys.exc_info()[1])
sys.exit()
self.ungetted = []
def byte(self):
if len(self.ungetted) != 0:
return self.ungetted.pop()
inbyte = self.infile.read(1)
if not inbyte or inbyte == '':
self.infile.close()
return None
return ord(inbyte)
def bytes(self, size):
if size <= len(self.ungetted):
result = self.ungetted[0:size]
del self.ungetted[0:size]
return result
inbytes = self.infile.read(size - len(self.ungetted))
if inbytes == '':
self.infile.close()
if type(inbytes) == type(''):
result = self.ungetted + [ord(b) for b in inbytes]
else:
result = self.ungetted + [b for b in inbytes]
self.ungetted = []
return result
def unget(self, byte):
self.ungetted.append(byte)
def ungets(self, bytes):
bytes.reverse()
self.ungetted.extend(bytes)
class cPDFDate:
def __init__(self):
self.state = 0
def parse(self, char):
if char == 'D':
self.state = 1
return None
elif self.state == 1:
if char == ':':
self.state = 2
self.digits1 = ''
else:
self.state = 0
return None
elif self.state == 2:
if len(self.digits1) < 14:
if char >= '0' and char <= '9':
self.digits1 += char
return None
else:
self.state = 0
return None
elif char == '+' or char == '-' or char == 'Z':
self.state = 3
self.digits2 = ''
self.TZ = char
return None
elif char == '"':
self.state = 0
self.date = 'D:' + self.digits1
return self.date
elif char < '0' or char > '9':
self.state = 0
self.date = 'D:' + self.digits1
return self.date
else:
self.state = 0
return None
elif self.state == 3:
if len(self.digits2) < 2:
if char >= '0' and char <= '9':
self.digits2 += char
return None
else:
self.state = 0
return None
elif len(self.digits2) == 2:
if char == "'":
self.digits2 += char
return None
else:
self.state = 0
return None
elif len(self.digits2) < 5:
if char >= '0' and char <= '9':
self.digits2 += char
if len(self.digits2) == 5:
self.state = 0
self.date = 'D:' + self.digits1 + self.TZ + self.digits2
return self.date
else:
return None
else:
self.state = 0
return None
def fEntropy(countByte, countTotal):
x = float(countByte) / countTotal
if x > 0:
return - x * math.log(x, 2)
else:
return 0.0
class cEntropy:
def __init__(self):
self.allBucket = [0 for i in range(0, 256)]
self.streamBucket = [0 for i in range(0, 256)]
def add(self, byte, insideStream):
self.allBucket[byte] += 1
if insideStream:
self.streamBucket[byte] += 1
def removeInsideStream(self, byte):
if self.streamBucket[byte] > 0:
self.streamBucket[byte] -= 1
def calc(self):
self.nonStreamBucket = map(operator.sub, self.allBucket, self.streamBucket)
allCount = sum(self.allBucket)
streamCount = sum(self.streamBucket)
nonStreamCount = sum(self.nonStreamBucket)
return (allCount, sum(map(lambda x: fEntropy(x, allCount), self.allBucket)), streamCount, sum(map(lambda x: fEntropy(x, streamCount), self.streamBucket)), nonStreamCount, sum(map(lambda x: fEntropy(x, nonStreamCount), self.nonStreamBucket)))
class cPDFEOF:
def __init__(self):
self.token = ''
self.cntEOFs = 0
def parse(self, char):
if self.cntEOFs > 0:
self.cntCharsAfterLastEOF += 1
if self.token == '' and char == '%':
self.token += char
return
elif self.token == '%' and char == '%':
self.token += char
return
elif self.token == '%%' and char == 'E':
self.token += char
return
elif self.token == '%%E' and char == 'O':
self.token += char
return
elif self.token == '%%EO' and char == 'F':
self.token += char
return
elif self.token == '%%EOF' and (char == '\n' or char == '\r' or char == ' ' or char == '\t'):
self.cntEOFs += 1
self.cntCharsAfterLastEOF = 0
if char == '\n':
self.token = ''
else:
self.token += char
return
elif self.token == '%%EOF\r':
if char == '\n':
self.cntCharsAfterLastEOF = 0
self.token = ''
else:
self.token = ''
def FindPDFHeaderRelaxed(oBinaryFile):
bytes = oBinaryFile.bytes(1024)
index = ''.join([chr(byte) for byte in bytes]).find('%PDF')
if index == -1:
oBinaryFile.ungets(bytes)
return ([], None)
for endHeader in range(index + 4, index + 4 + 10):
if bytes[endHeader] == 10 or bytes[endHeader] == 13:
break
oBinaryFile.ungets(bytes[endHeader:])
return (bytes[0:endHeader], ''.join([chr(byte) for byte in bytes[index:endHeader]]))
def Hexcode2String(char):
if type(char) == int:
return '#%02x' % char
else:
return char
def SwapCase(char):
if type(char) == int:
return ord(chr(char).swapcase())
else:
return char.swapcase()
def HexcodeName2String(hexcodeName):
return ''.join(map(Hexcode2String, hexcodeName))
def SwapName(wordExact):
return map(SwapCase, wordExact)
def UpdateWords(word, wordExact, slash, words, hexcode, allNames, lastName, insideStream, oEntropy, fOut):
if word != '':
if slash + word in words:
words[slash + word][0] += 1
if hexcode:
words[slash + word][1] += 1
elif slash == '/' and allNames:
words[slash + word] = [1, 0]
if hexcode:
words[slash + word][1] += 1
if slash == '/':
lastName = slash + word
if slash == '':
if word == 'stream':
insideStream = True
if word == 'endstream':
if insideStream == True and oEntropy != None:
for char in 'endstream':
oEntropy.removeInsideStream(ord(char))
insideStream = False
if fOut != None:
if slash == '/' and '/' + word in ('/JS', '/JavaScript', '/AA', '/OpenAction', '/JBIG2Decode', '/RichMedia', '/Launch'):
wordExactSwapped = HexcodeName2String(SwapName(wordExact))
fOut.write(C2BIP3(wordExactSwapped))
print('/%s -> /%s' % (HexcodeName2String(wordExact), wordExactSwapped))
else:
fOut.write(C2BIP3(HexcodeName2String(wordExact)))
return ('', [], False, lastName, insideStream)
class cCVE_2009_3459:
def __init__(self):
self.count = 0
def Check(self, lastName, word):
if (lastName == '/Colors' and word.isdigit() and int(word) > 2^24): # decided to alert when the number of colors is expressed with more than 3 bytes
self.count += 1
def PDFiD(file, allNames=False, extraData=False, disarm=False, force=False):
"""Example of XML output:
<PDFiD ErrorOccured="False" ErrorMessage="" Filename="test.pdf" Header="%PDF-1.1" IsPDF="True" Version="0.0.4" Entropy="4.28">
<Keywords>
<Keyword Count="7" HexcodeCount="0" Name="obj"/>
<Keyword Count="7" HexcodeCount="0" Name="endobj"/>
<Keyword Count="1" HexcodeCount="0" Name="stream"/>
<Keyword Count="1" HexcodeCount="0" Name="endstream"/>
<Keyword Count="1" HexcodeCount="0" Name="xref"/>
<Keyword Count="1" HexcodeCount="0" Name="trailer"/>
<Keyword Count="1" HexcodeCount="0" Name="startxref"/>
<Keyword Count="1" HexcodeCount="0" Name="/Page"/>
<Keyword Count="0" HexcodeCount="0" Name="/Encrypt"/>
<Keyword Count="1" HexcodeCount="0" Name="/JS"/>
<Keyword Count="1" HexcodeCount="0" Name="/JavaScript"/>
<Keyword Count="0" HexcodeCount="0" Name="/AA"/>
<Keyword Count="1" HexcodeCount="0" Name="/OpenAction"/>
<Keyword Count="0" HexcodeCount="0" Name="/JBIG2Decode"/>
</Keywords>
<Dates>
<Date Value="D:20090128132916+01'00" Name="/ModDate"/>
</Dates>
</PDFiD>
"""
word = ''
wordExact = []
hexcode = False
lastName = ''
insideStream = False
keywords = ('obj',
'endobj',
'stream',
'endstream',
'xref',
'trailer',
'startxref',
'/Page',
'/Encrypt',
'/ObjStm',
'/JS',
'/JavaScript',
'/AA',
'/OpenAction',
'/AcroForm',
'/JBIG2Decode',
'/RichMedia',
'/Launch',
'/EmbeddedFile',
'/XFA',
)
words = {}
dates = []
for keyword in keywords:
words[keyword] = [0, 0]
slash = ''
xmlDoc = xml.dom.minidom.getDOMImplementation().createDocument(None, 'PDFiD', None)
att = xmlDoc.createAttribute('Version')
att.nodeValue = __version__
xmlDoc.documentElement.setAttributeNode(att)
att = xmlDoc.createAttribute('Filename')
att.nodeValue = file
xmlDoc.documentElement.setAttributeNode(att)
attErrorOccured = xmlDoc.createAttribute('ErrorOccured')
xmlDoc.documentElement.setAttributeNode(attErrorOccured)
attErrorOccured.nodeValue = 'False'
attErrorMessage = xmlDoc.createAttribute('ErrorMessage')
xmlDoc.documentElement.setAttributeNode(attErrorMessage)
attErrorMessage.nodeValue = ''
oPDFDate = None
oEntropy = None
oPDFEOF = None
oCVE_2009_3459 = cCVE_2009_3459()
try:
attIsPDF = xmlDoc.createAttribute('IsPDF')
xmlDoc.documentElement.setAttributeNode(attIsPDF)
oBinaryFile = cBinaryFile(file)
if extraData:
oPDFDate = cPDFDate()
oEntropy = cEntropy()
oPDFEOF = cPDFEOF()
(bytesHeader, pdfHeader) = FindPDFHeaderRelaxed(oBinaryFile)
if disarm:
(pathfile, extension) = os.path.splitext(file)
fOut = open(pathfile + '.disarmed' + extension, 'wb')
for byteHeader in bytesHeader:
fOut.write(C2BIP3(chr(byteHeader)))
else:
fOut = None
if oEntropy != None:
for byteHeader in bytesHeader:
oEntropy.add(byteHeader, insideStream)
if pdfHeader == None and not force:
attIsPDF.nodeValue = 'False'
return xmlDoc
else:
if pdfHeader == None:
attIsPDF.nodeValue = 'False'
pdfHeader = ''
else:
attIsPDF.nodeValue = 'True'
att = xmlDoc.createAttribute('Header')
att.nodeValue = repr(pdfHeader[0:10]).strip("'")
xmlDoc.documentElement.setAttributeNode(att)
byte = oBinaryFile.byte()
while byte != None:
char = chr(byte)
charUpper = char.upper()
if charUpper >= 'A' and charUpper <= 'Z' or charUpper >= '0' and charUpper <= '9':
word += char
wordExact.append(char)
elif slash == '/' and char == '#':
d1 = oBinaryFile.byte()
if d1 != None:
d2 = oBinaryFile.byte()
if d2 != None and (chr(d1) >= '0' and chr(d1) <= '9' or chr(d1).upper() >= 'A' and chr(d1).upper() <= 'F') and (chr(d2) >= '0' and chr(d2) <= '9' or chr(d2).upper() >= 'A' and chr(d2).upper() <= 'F'):
word += chr(int(chr(d1) + chr(d2), 16))
wordExact.append(int(chr(d1) + chr(d2), 16))
hexcode = True
if oEntropy != None:
oEntropy.add(d1, insideStream)
oEntropy.add(d2, insideStream)
if oPDFEOF != None:
oPDFEOF.parse(d1)
oPDFEOF.parse(d2)
else:
oBinaryFile.unget(d2)
oBinaryFile.unget(d1)
(word, wordExact, hexcode, lastName, insideStream) = UpdateWords(word, wordExact, slash, words, hexcode, allNames, lastName, insideStream, oEntropy, fOut)
if disarm:
fOut.write(C2BIP3(char))
else:
oBinaryFile.unget(d1)
(word, wordExact, hexcode, lastName, insideStream) = UpdateWords(word, wordExact, slash, words, hexcode, allNames, lastName, insideStream, oEntropy, fOut)
if disarm:
fOut.write(C2BIP3(char))
else:
oCVE_2009_3459.Check(lastName, word)
(word, wordExact, hexcode, lastName, insideStream) = UpdateWords(word, wordExact, slash, words, hexcode, allNames, lastName, insideStream, oEntropy, fOut)
if char == '/':
slash = '/'
else:
slash = ''
if disarm:
fOut.write(C2BIP3(char))
if oPDFDate != None and oPDFDate.parse(char) != None:
dates.append([oPDFDate.date, lastName])
if oEntropy != None:
oEntropy.add(byte, insideStream)
if oPDFEOF != None:
oPDFEOF.parse(char)
byte = oBinaryFile.byte()
(word, wordExact, hexcode, lastName, insideStream) = UpdateWords(word, wordExact, slash, words, hexcode, allNames, lastName, insideStream, oEntropy, fOut)
# check to see if file ended with %%EOF. If so, we can reset charsAfterLastEOF and add one to EOF count. This is never performed in
# the parse function because it never gets called due to hitting the end of file.
if byte == None and oPDFEOF != None:
if oPDFEOF.token == '%%EOF':
oPDFEOF.cntEOFs += 1
oPDFEOF.cntCharsAfterLastEOF = 0
oPDFEOF.token = ''
except SystemExit:
sys.exit()
except:
attErrorOccured.nodeValue = 'True'
attErrorMessage.nodeValue = traceback.format_exc()
if disarm:
fOut.close()
attEntropyAll = xmlDoc.createAttribute('TotalEntropy')
xmlDoc.documentElement.setAttributeNode(attEntropyAll)
attCountAll = xmlDoc.createAttribute('TotalCount')
xmlDoc.documentElement.setAttributeNode(attCountAll)
attEntropyStream = xmlDoc.createAttribute('StreamEntropy')
xmlDoc.documentElement.setAttributeNode(attEntropyStream)
attCountStream = xmlDoc.createAttribute('StreamCount')
xmlDoc.documentElement.setAttributeNode(attCountStream)
attEntropyNonStream = xmlDoc.createAttribute('NonStreamEntropy')
xmlDoc.documentElement.setAttributeNode(attEntropyNonStream)
attCountNonStream = xmlDoc.createAttribute('NonStreamCount')
xmlDoc.documentElement.setAttributeNode(attCountNonStream)
if oEntropy != None:
(countAll, entropyAll , countStream, entropyStream, countNonStream, entropyNonStream) = oEntropy.calc()
attEntropyAll.nodeValue = '%f' % entropyAll
attCountAll.nodeValue = '%d' % countAll
attEntropyStream.nodeValue = '%f' % entropyStream
attCountStream.nodeValue = '%d' % countStream
attEntropyNonStream.nodeValue = '%f' % entropyNonStream
attCountNonStream.nodeValue = '%d' % countNonStream
else:
attEntropyAll.nodeValue = ''
attCountAll.nodeValue = ''
attEntropyStream.nodeValue = ''
attCountStream.nodeValue = ''
attEntropyNonStream.nodeValue = ''
attCountNonStream.nodeValue = ''
attCountEOF = xmlDoc.createAttribute('CountEOF')
xmlDoc.documentElement.setAttributeNode(attCountEOF)
attCountCharsAfterLastEOF = xmlDoc.createAttribute('CountCharsAfterLastEOF')
xmlDoc.documentElement.setAttributeNode(attCountCharsAfterLastEOF)
if oPDFEOF != None:
attCountEOF.nodeValue = '%d' % oPDFEOF.cntEOFs
attCountCharsAfterLastEOF.nodeValue = '%d' % oPDFEOF.cntCharsAfterLastEOF
else:
attCountEOF.nodeValue = ''
attCountCharsAfterLastEOF.nodeValue = ''
eleKeywords = xmlDoc.createElement('Keywords')
xmlDoc.documentElement.appendChild(eleKeywords)
for keyword in keywords:
eleKeyword = xmlDoc.createElement('Keyword')
eleKeywords.appendChild(eleKeyword)
att = xmlDoc.createAttribute('Name')
att.nodeValue = keyword
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('Count')
att.nodeValue = str(words[keyword][0])
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('HexcodeCount')
att.nodeValue = str(words[keyword][1])
eleKeyword.setAttributeNode(att)
eleKeyword = xmlDoc.createElement('Keyword')
eleKeywords.appendChild(eleKeyword)
att = xmlDoc.createAttribute('Name')
att.nodeValue = '/Colors > 2^24'
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('Count')
att.nodeValue = str(oCVE_2009_3459.count)
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('HexcodeCount')
att.nodeValue = str(0)
eleKeyword.setAttributeNode(att)
if allNames:
keys = sorted(words.keys())
for word in keys:
if not word in keywords:
eleKeyword = xmlDoc.createElement('Keyword')
eleKeywords.appendChild(eleKeyword)
att = xmlDoc.createAttribute('Name')
att.nodeValue = word
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('Count')
att.nodeValue = str(words[word][0])
eleKeyword.setAttributeNode(att)
att = xmlDoc.createAttribute('HexcodeCount')
att.nodeValue = str(words[word][1])
eleKeyword.setAttributeNode(att)
eleDates = xmlDoc.createElement('Dates')
xmlDoc.documentElement.appendChild(eleDates)
dates.sort(key=lambda x: x[0])
for date in dates:
eleDate = xmlDoc.createElement('Date')
eleDates.appendChild(eleDate)
att = xmlDoc.createAttribute('Value')
att.nodeValue = date[0]
eleDate.setAttributeNode(att)
att = xmlDoc.createAttribute('Name')
att.nodeValue = date[1]
eleDate.setAttributeNode(att)
return xmlDoc
def PDFiD2String(xmlDoc, force):
result = 'PDFiD %s %s\n' % (xmlDoc.documentElement.getAttribute('Version'), xmlDoc.documentElement.getAttribute('Filename'))
if xmlDoc.documentElement.getAttribute('ErrorOccured') == 'True':
return result + '***Error occured***\n%s\n' % xmlDoc.documentElement.getAttribute('ErrorMessage')
if not force and xmlDoc.documentElement.getAttribute('IsPDF') == 'False':
return result + ' Not a PDF document\n'
result += ' PDF Header: %s\n' % xmlDoc.documentElement.getAttribute('Header')
for node in xmlDoc.documentElement.getElementsByTagName('Keywords')[0].childNodes:
result += ' %-16s %7d' % (node.getAttribute('Name'), int(node.getAttribute('Count')))
if int(node.getAttribute('HexcodeCount')) > 0:
result += '(%d)' % int(node.getAttribute('HexcodeCount'))
result += '\n'
if xmlDoc.documentElement.getAttribute('CountEOF') != '':
result += ' %-16s %7d\n' % ('%%EOF', int(xmlDoc.documentElement.getAttribute('CountEOF')))
if xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF') != '':
result += ' %-16s %7d\n' % ('After last %%EOF', int(xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF')))
for node in xmlDoc.documentElement.getElementsByTagName('Dates')[0].childNodes:
result += ' %-23s %s\n' % (node.getAttribute('Value'), node.getAttribute('Name'))
if xmlDoc.documentElement.getAttribute('TotalEntropy') != '':
result += ' Total entropy: %s (%10s bytes)\n' % (xmlDoc.documentElement.getAttribute('TotalEntropy'), xmlDoc.documentElement.getAttribute('TotalCount'))
if xmlDoc.documentElement.getAttribute('StreamEntropy') != '':
result += ' Entropy inside streams: %s (%10s bytes)\n' % (xmlDoc.documentElement.getAttribute('StreamEntropy'), xmlDoc.documentElement.getAttribute('StreamCount'))
if xmlDoc.documentElement.getAttribute('NonStreamEntropy') != '':
result += ' Entropy outside streams: %s (%10s bytes)\n' % (xmlDoc.documentElement.getAttribute('NonStreamEntropy'), xmlDoc.documentElement.getAttribute('NonStreamCount'))
return result
def Scan(directory, allNames, extraData, disarm, force):
try:
if os.path.isdir(directory):
for entry in os.listdir(directory):
Scan(os.path.join(directory, entry), allNames, extraData, disarm, force)
else:
result = PDFiD2String(PDFiD(directory, allNames, extraData, disarm, force), force)
print(result)
logfile = open('PDFiD.log', 'a')
logfile.write(result + '\n')
logfile.close()
except:
pass
#function derived from: http://blog.9bplus.com/pdfidpy-output-to-json
def PDFiD2JSON(xmlDoc, force):
#Get Top Layer Data
errorOccured = xmlDoc.documentElement.getAttribute('ErrorOccured')
errorMessage = xmlDoc.documentElement.getAttribute('ErrorMessage')
filename = xmlDoc.documentElement.getAttribute('Filename')
header = xmlDoc.documentElement.getAttribute('Header')
isPdf = xmlDoc.documentElement.getAttribute('IsPDF')
version = xmlDoc.documentElement.getAttribute('Version')
entropy = xmlDoc.documentElement.getAttribute('Entropy')
#extra data
countEof = xmlDoc.documentElement.getAttribute('CountEOF')
countChatAfterLastEof = xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF')
totalEntropy = xmlDoc.documentElement.getAttribute('TotalEntropy')
streamEntropy = xmlDoc.documentElement.getAttribute('StreamEntropy')
nonStreamEntropy = xmlDoc.documentElement.getAttribute('NonStreamEntropy')
keywords = []
dates = []
#grab all keywords
for node in xmlDoc.documentElement.getElementsByTagName('Keywords')[0].childNodes:
name = node.getAttribute('Name')
count = int(node.getAttribute('Count'))
if int(node.getAttribute('HexcodeCount')) > 0:
hexCount = int(node.getAttribute('HexcodeCount'))
else:
hexCount = 0
keyword = { 'count':count, 'hexcodecount':hexCount, 'name':name }
keywords.append(keyword)
#grab all date information
for node in xmlDoc.documentElement.getElementsByTagName('Dates')[0].childNodes:
name = node.getAttribute('Name')
value = node.getAttribute('Value')
date = { 'name':name, 'value':value }
dates.append(date)
data = { 'countEof':countEof, 'countChatAfterLastEof':countChatAfterLastEof, 'totalEntropy':totalEntropy, 'streamEntropy':streamEntropy, 'nonStreamEntropy':nonStreamEntropy, 'errorOccured':errorOccured, 'errorMessage':errorMessage, 'filename':filename, 'header':header, 'isPdf':isPdf, 'version':version, 'entropy':entropy, 'keywords': { 'keyword': keywords }, 'dates': { 'date':dates} }
complete = [ { 'pdfid' : data} ]
result = json.dumps(complete)
return result
def Main():
oParser = optparse.OptionParser(usage='usage: %prog [options] [pdf-file|zip-file|url]\n' + __description__, version='%prog ' + __version__)
oParser.add_option('-s', '--scan', action='store_true', default=False, help='scan the given directory')
oParser.add_option('-a', '--all', action='store_true', default=False, help='display all the names')
oParser.add_option('-e', '--extra', action='store_true', default=False, help='display extra data, like dates')
oParser.add_option('-f', '--force', action='store_true', default=False, help='force the scan of the file, even without proper %PDF header')
oParser.add_option('-d', '--disarm', action='store_true', default=False, help='disable JavaScript and auto launch')
(options, args) = oParser.parse_args()
if len(args) == 0:
if options.disarm:
print('Option disarm not supported with stdin')
options.disarm = False
print(PDFiD2String(PDFiD('', options.all, options.extra, options.disarm, options.force), options.force))
elif len(args) == 1:
if options.scan:
Scan(args[0], options.all, options.extra, options.disarm, options.force)
else:
print(PDFiD2String(PDFiD(args[0], options.all, options.extra, options.disarm, options.force), options.force))
else:
oParser.print_help()
print('')
print(' %s' % __description__)
print(' Source code put in the public domain by Didier Stevens, no Copyright')
print(' Use at your own risk')
print(' https://DidierStevens.com')
return
if __name__ == '__main__':
Main()