Improvising the Malware Detection Accuracy in Portable Document Format (PDFs) through Machine Learning Classifiers

Authors

  • Muhammad Ahmad Shahid Department of Computer Science, Government College University, Lahore, Pakistan
  • Muhammad Safyan Department of Computer Science, Government College University, Lahore, Pakistan
  • Zeeshan Pervez Department of Computer Science, University of Wolverhampton, UK

DOI:

https://doi.org/10.47067/ramss.v7i4.373

Keywords:

Malware detection, supervised learning, Features Extraction, PDF

Abstract

Every time a spike is observed in cyber-attacks, a huge financial loss is incurred that has surpassed $2 trillion according to some estimates. Apart from monetary setbacks, the damages in terms of loss of credibility and quality of services are way more difficult to recover. PDF documents have been considered as innocuous source of data sharing with static meta-tags. However, the structure of PDF file contains different objects, features and characteristics that make it easier for the attackers to invade (by attaching the malicious code with tags). Current research is focused on analyzing the structure of PDF files, data-structures exploited and extracting feature tags with associated feature values of PDFs. A total of 17,884 documents were used, having two categories (8942 malicious documents and 8942 benign files).

References

Bensaoud, A., Kalita, J., & Bensaoud, M. (2024). A survey of malware detection using deep learning. Machine Learning with Applications, 16, 100546.

P. O’Kane, S. Sezer, and K. McLaughlin, “Obfuscation: The hidden malware,” Security & Privacy, IEEE, vol. 9, no. 5, pp. 41–47, 2011.

Internet Security Threat Reports. 2011 Trends. Symantec, April 2012.

J. Zhang and J. Rabaiotti, “The PDF Exploit: Same Crime, Different Face,” https://www.symantec.com/ connect/blogs/pdf-exploit-same-crime-different-face/, accessed: 2018-03.

S. Porst. A brief analysis of a malicious pdf ?le which exploits this week’s ?ash 0-day. http://blog.zynamics.com/, 2010.

http://www.cvedetails.com/product/497/adobe-acrobatreader.html?vendor id=53,” [Accessed 21 June 2020].

“The rise in the exploitation of old pdf vulnerabilities,” http://blogs.technet.com/b/mmpc/archive/2013/04/29/ the-rise-in-the-exploitation-of-old-pdf-vulnerabilities.aspx.

K. Selvaraj and N. F. Gutierrez, “The rise of pdf malware,” Symantec, Tech. Rep., 2010.

M. A. Rahman. Getting owned by malicious pdf analysis. Technical report, SANS Institute, 2008.

D. Stevens. Malicious pdf documents explained. IEEE Security and Privacy, 9(1):80–82, Jan. 2011.

Young-Seob Jeong, Jiyoung Woo, and Ah Reum Kang. Malware detection on byte streams of pdf ?les using convolutional neural networks. Security and Communication Networks, 2019, 2019.

Nedim ?Srndi´c and Pavel Laskov. Hidost: a static machine-learning-based detector of malicious ?les. EURASIP Journal on Information Security, 2016(1):22, 2016.

Bonan Cuan, Ali´enor Damien, Claire Delaplace, and Mathieu Valois. Malware detection in pdf ?les using machine learning. 2018.

Charles Smutz and Angelos Stavrou. Malicious pdf detection using metadata and structural features. In Proceedings of the 28th annual computer security applications conference, pages 239–248. ACM, 2012.

Min Li, Yunzheng Liu, Min Yu, Gang Li, Yongjian Wang, and Chao Liu. Fepdf: a robust feature extractor for malicious pdf detection. In 2017 IEEE Trustcom/BigDataSE/ICESS, pages 218–224. IEEE, 2017.

Zhang, J. MLPdf: an effective machine learning based approach for PDF malware detection. arXiv preprint arXiv:1808.06991, 2018

Laskov, P., & Šrndi?, N, Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the 27th annual computer security applications conference pp. 373-382, 2011

Smutz, C., & Stavrou, A, Malicious PDF detection using metadata and structural features. In Proceedings of the 28th annual computer security applications conference pp. 239-248, 2012

Maiorca, D., Corona, I., & Giacinto, G, Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious pdf files detection. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security (pp. 119-130, 2013

Corona, I., Maiorca, D., Ariu, D., & Giacinto, G, Lux0r: Detection of malicious pdf-embedded javascript code through discriminant analysis of api references. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop pp. 47-57, 2014.

Egele, M., Scholte, T., Kirda, E., & Kruegel, C., A survey on automated dynamic malware-analysis techniques and tools. ACM computing surveys (CSUR), 44(2), 1-42, 2008

Juwono, J. T., Lim, C., & Erwin, A, A comparative study of behavior analysis sandboxes in malware detection. In International Conference on New Media (CONMEDIA), p. 73, 2015.

Moisejevs, I. Adversarial Attacks and Defenses in Malware Classification: A Survey, 2011

Firdausi, I., Erwin, A., & Nugroho, A. S., Analysis of machine learning techniques used in behavior-based malware detection. In 2010 Second international conference on advances in computing, control, and telecommunication technologies pp. 201-203, 2010

Ranveer, S., & Hiray, S., Comparative analysis of feature extraction methods of malware detection. International Journal of Computer Applications, 120(5), 2015

Saad, S., Briguglio, W., & Elmiligi, H., The curious case of machine learning in malware detection. arXiv preprint arXiv:1905.07573, 2019

Ray, A., & Nath, A., Introduction to Malware and Malware Analysis: A brief overview. International Journal, 4(10), 2016

Nath, H. V., & Mehtre, B. M, Static malware analysis using machine learning methods. Springer International Conference on Security in Computer Networks and Distributed Systems, pp. 440-450, 2015

Awad, Y., Nassar, M., & Safa, H., Modeling malware as a language. In 2018 IEEE International Conference on Communications (ICC) (pp. 1-6), 2018

Vinayakumar, R., Alazab, M., Soman, K. P., Poornachandran, P., & Venkatraman, S., Robust intelligent malware detection using deep learning. IEEE Access, 7, 46717-46738, 2019

Lu, X., Zhuge, J., Wang, R., Cao, Y., & Chen, Y., De-obfuscation and detection of malicious PDF files with high accuracy. 46th Hawaii International Conference on System Sciences pp. 4890-4899, 2013

Cross, J. S., & Munson, M. A., Deep pdf parsing to extract features for detecting embedded malware. Sandia National Labs, Albuquerque, New Mexico, Unlimited Release SAND2011-7982, 2011

N. Nissim et al., “ALPD: Active learning framework for enhancing the detection of malicious pdf files aimed at organizations,” in Proc. JISIC, 2014, pp. 91–98, 2015.

Nissim, N., Cohen, A., Glezer, C., & Elovici, Y., Detection of malicious PDF files and directions for enhancements: A state-of-the art survey. Computers & Security, 48, 246-266, 2015

Šrndic, N., & Laskov, P., Detection of malicious pdf files based on hierarchical document structure. In Proceedings of the 20th Annual Network & Distributed System Security Symposium (pp. 1-16), 2013

Liu, D., Wang, H., & Stavrou, A,. Detecting malicious javascript in pdf through document instrumentation. 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks pp. 100-111, 2014

Torres, J., & Santos, S. D. L., Malicious PDF Documents Detection using Machine Learning Techniques. In Proceedings of the 4th International Conference on Information Systems Security and Privacy, pp. 337-344, 2018

Maiorca, D., Giacinto, G., & Corona, I., A pattern recognition system for malicious pdf files detection. In International Workshop on Machine Learning and Data Mining in Pattern Recognition pp. 510-524, 2018

Stevens, D., Malicious PDF documents explained. IEEE Security & Privacy, 9(1), 80-82, 2011

J. Zhang, “Make “Invisible” Visible - Case Studies in PDF Malware,” in Proceedings of Hacktivity 2015, Budapest, Hungary, 2015.

Baptista, I., Shiaeles, S., & Kolokotronis, N., A novel malware detection system based on machine learning and binary visualization. In 2019 IEEE International Conference on Communications Workshops (ICC Workshops) (pp. 1-6). IEEE, 2019

Rhode, M., Tuson, L., Burnap, P., & Jones, K, LAB to SOC: Robust Features for Dynamic Malware Detection. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks–Industry Track (pp. 13-16). IEEE, 2019

Contagio Dump http://contagiodump.blogspot.com/ [Accessed 21 July 2022].

En.wikipedia.org. 2020. Mean Squared Error. [online] Available at: https://en.wikipedia.org/wiki/Mean_squared_error> [Accessed 21 July 2022].

En.wikipedia.org. 2020. Mean Squared Error. [online] Available at: https://en.wikipedia.org/wiki/Mean_squared_error> [Accessed 21 July 2022].

Downloads

Published

2024-10-15

How to Cite

Shahid, M. A., Safyan, M. ., & Pervez, Z. (2024). Improvising the Malware Detection Accuracy in Portable Document Format (PDFs) through Machine Learning Classifiers. Review of Applied Management and Social Sciences, 7(4), 201-221. https://doi.org/10.47067/ramss.v7i4.373