University Presentation Showcase: Undergraduate Division
MapReduce based string-matching algorithms for malicious domain detection
Presenter Hometown
Manchester, KY
Major
Computer Science
Department
Computer Science
Degree
Undergraduate
Mentor
Dae Wook Kim
Mentor Department
Computer Science
Recommended Citation
Roberts, Allen C. and Rice, Trevor D., "MapReduce based string-matching algorithms for malicious domain detection" (2021). University Presentation Showcase Event. 25.
https://encompass.eku.edu/swps/2021/undergraduate/25
Abstract
The existence of malicious websites scattered throughout the Internet calls for a method to prevent network users from involuntarily accessing them. Every day, there are innocent users that come into contact with malicious websites and material, some without even realizing they have. In order to protect the users on a given network, an intrusion detection system (IDS) that performs well in real-time constraints and determines if a domain is malicious must be created. The signature-based IDS system needs to accept a domain as input, compare it to potentially millions of malicious domains in the database to see if the domain exists in the malicious domain set. However, computing this measure for massive malicious domain dataset is expensive in both memory and time. To handle heavy computation cost, distributed computing methods such as MapReduce need to be developed to make the string-matching algorithms scalable. In this study, we present a prototype implementation of MapReduce-based system which provides the performance comparison of the common string matching algorithms including Naïve string-matching, Levenshtein distance, Hamming distance, Rabin-Karp, and Knuth-Morris-Pratt algorithm for distributed computations on large scale data. Our prototype implementation shows the efficiency of run times to detect a malicious domain from a dataset of publicly accessible domains containing 10,000 records.
Presentation format
Poster
MapReduce based string-matching algorithms for malicious domain detection
The existence of malicious websites scattered throughout the Internet calls for a method to prevent network users from involuntarily accessing them. Every day, there are innocent users that come into contact with malicious websites and material, some without even realizing they have. In order to protect the users on a given network, an intrusion detection system (IDS) that performs well in real-time constraints and determines if a domain is malicious must be created. The signature-based IDS system needs to accept a domain as input, compare it to potentially millions of malicious domains in the database to see if the domain exists in the malicious domain set. However, computing this measure for massive malicious domain dataset is expensive in both memory and time. To handle heavy computation cost, distributed computing methods such as MapReduce need to be developed to make the string-matching algorithms scalable. In this study, we present a prototype implementation of MapReduce-based system which provides the performance comparison of the common string matching algorithms including Naïve string-matching, Levenshtein distance, Hamming distance, Rabin-Karp, and Knuth-Morris-Pratt algorithm for distributed computations on large scale data. Our prototype implementation shows the efficiency of run times to detect a malicious domain from a dataset of publicly accessible domains containing 10,000 records.