Dae Wook Kim
The existence of malicious websites scattered throughout the Internet calls for a method to prevent network users from involuntarily accessing them. Every day, there are innocent users that come into contact with malicious websites and material, some without even realizing they have. In order to protect the users on a given network, an intrusion detection system (IDS) that performs well in real-time constraints and determines if a domain is malicious must be created. The signature-based IDS system needs to accept a domain as input, compare it to potentially millions of malicious domains in the database to see if the domain exists in the malicious domain set. However, computing this measure for massive malicious domain dataset is expensive in both memory and time. To handle heavy computation cost, distributed computing methods such as MapReduce need to be developed to make the string-matching algorithms scalable. In this study, we present a prototype implementation of MapReduce-based system which provides the performance comparison of the common string matching algorithms including Naïve string-matching, Levenshtein distance, Hamming distance, Rabin-Karp, and Knuth-Morris-Pratt algorithm for distributed computations on large scale data. Our prototype implementation shows the efficiency of run times to detect a malicious domain from a dataset of publicly accessible domains containing 10,000 records.