PDFassassin is a module for SpamAssassin that allows for the scanning of PDF files in email message attachments. Email bodies are scanned upon connection and checked for PDF attachments. Text is extracted from the PDF via pdftotext and scanned by SpamAssassin. Should the PDF contain images, the gocr program is called to extract the text content. The total spam score of the PDF is compared against the global required_score setting; if it’s higher, a score equal to the one specified in pdf.cf is appended to the overall score of the email message.
With the recent torrent of PDF spam, we created a module for SpamAssassin that allows for the scanning of PDF files. The module, linked below this post, works in the following way:
- Email bodies are scanned upon connection, and checked for PDF attachments.
- Text is extracted from the PDF via pdftotext, and scanned by SpamAssassin.
- Should the PDF contain images, the gocr binary is called to extract the text content.
- The total spam score of the PDF is compared against the global required_score setting; if it’s higher, a score equal to the one specified in pdf.cf (default of 10) is appended to the overall score of the email message.
This approach is a departure from the usual method as it scans the content against the SpamAssassin engine, instead of using a word list filter.
Should you need to install the module, download it from: http://atmail.com/members/Pdf.tgz.
Installation directions can be found in the README file inside the archive.
PDFassassin forum: http://forum.atmail.com/viewforum.php?id=10