Results 1 - 10
of
22
BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis
"... The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for large-scale malware simil ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
(Show Context)
The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for large-scale malware similarity analysis and clustering, and for automatically uncovering semantic inter- and intra-family relationships within clusters. The key idea behind Bit-Shred is using feature hashing to dramatically reduce the highdimensional feature spaces that are common in malware analysis. Feature hashing also allows us to mine correlated features between malware families and samples using co-clustering techniques. Our evaluation shows that BitShred speeds up typical malware triage tasks by up to 2,365x and uses up to 82x less memory on a single CPU, all with comparable accuracy to previous approaches. We also develop a parallelized version of BitShred, and demonstrate scalability within the Hadoop framework.
McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables
"... In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
(Show Context)
In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign executables, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction. The McBoost framework consists of a classifier specialized in detecting whether an executable is packed or not, a universal unpacker based on dynamic binary analysis, and a classifier specialized in distinguishing between malicious or benign code. We developed a proof-of-concept version of McBoost and evaluated it on 5,586 malware and 2,258 benign programs. McBoost has an accuracy of 87.3%, and an Area Under the ROC curve (AUC) equal to 0.977. Our evaluation also shows that McBoost reduces the overall time of analysis to only a fraction (e.g., 13.4%) of the computation time that would otherwise be required to analyze large sets of mixed malicious and benign executables. 1
PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime
- in: Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, 2009
"... Abstract. In this paper, we present an accurate and realtime PE-Miner framework that automatically extracts distinguishing features from portable executables (PE) to detect zero-day (i.e. previously unknown) malware. The distinguishing features are extracted using the structural informa-tion standar ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we present an accurate and realtime PE-Miner framework that automatically extracts distinguishing features from portable executables (PE) to detect zero-day (i.e. previously unknown) malware. The distinguishing features are extracted using the structural informa-tion standardized by the Microsoft Windows operating system for exe-cutables, DLLs and object files. We follow a threefold research method-ology: (1) identify a set of structural features for PE files which is com-putable in realtime, (2) use an efficient preprocessor for removing re-dundancy in the features ’ set, and (3) select an efficient data mining algorithm for final classification between benign and malicious executa-bles. We have evaluated PE-Miner on two malware collections, VX Heavens and Malfease datasets which contain about 11 and 5 thousand malicious PE files respectively. The results of our experiments show that PE-Miner achieves more than 99 % detection rate with less than 0.5 % false alarm rate for distinguishing between benign and malicious executables. PE-Miner has low processing overheads and takes only 0.244 seconds on the average to scan a given PE file. Finally, we evaluate the robustness and reliability of PE-Miner under several regression tests. Our results show that the extracted features are robust to different packing techniques and PE-Miner is also resilient to majority of crafty evasion strategies.
Opcode sequences as representation of executables for data-mining-based unknown malware detection
- INFORMATION SCIENCES 227
, 2013
"... Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signa ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signature-based detection is the most widespread method used in commercial antivirus. In spite of the broad use of this method, it can detect malware only after the malicious executable has already caused damage and provided the malware is adequately documented. Therefore, the signature-based method consistently fails to detect new malware. In this paper, we propose a new method to detect unknown malware families. This model is based on the frequency of the appearance of opcode sequences. Furthermore, we describe a technique to mine the relevance of each opcode and assess the frequency of each opcode sequence. In addition, we provide empirical validation that this new method is capable of detecting unknown malware.
Malware detection using statistical analysis of byte-level file content
- In ACM SIGKDD Workshop CyberSecurity and Intelligence Informatics
, 2009
"... Commercial anti-virus software are unable to provide pro-tection against newly launched (a.k.a “zero-day”) malware. In this paper, we propose a novel malware detection tech-nique which is based on the analysis of byte-level file con-tent. The novelty of our approach, compared with existing content b ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Commercial anti-virus software are unable to provide pro-tection against newly launched (a.k.a “zero-day”) malware. In this paper, we propose a novel malware detection tech-nique which is based on the analysis of byte-level file con-tent. The novelty of our approach, compared with existing content based mining schemes, is that it does not memo-rize specific byte-sequences or strings appearing in the ac-tual file content. Our technique is non-signature based and therefore has the potential to detect previously unknown and zero-day malware. We compute a wide range of statistical and information-theoretic features in a block-wise manner to quantify the byte-level file content. We leverage standard data mining algorithms to classify the file content of every block as normal or potentially malicious. Finally, we corre-late the block-wise classification results of a given file to cat-egorize it as benign or malware. Since the proposed scheme operates at the byte-level file content; therefore, it does not require any a priori information about the filetype. We have tested our proposed technique using a benign dataset com-prising of six different filetypes — DOC, EXE, JPG, MP3, PDF and ZIP and a malware dataset comprising of six different malware types — backdoor, trojan, virus, worm, construc-tor and miscellaneous. We also perform a comparison with existing data mining based malware detection techniques. The results of our experiments show that the proposed non-signature based technique surpasses the existing techniques and achieves more than 90 % detection accuracy.
A static, packer-agnostic filter to detect similar malware samples
, 2010
"... A static, packer-agnostic filter to detect similar malware samples The steadily increasing number of malware variants is becoming a significant problem, clogging the input queues of automated analysis tools and polluting malware repositories. The generation of malware variants is made easy by automa ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
A static, packer-agnostic filter to detect similar malware samples The steadily increasing number of malware variants is becoming a significant problem, clogging the input queues of automated analysis tools and polluting malware repositories. The generation of malware variants is made easy by automatic packers and polymorphic engines, which can produce many distinct versions of a single executable using compression and encryption. Malware analysis tools and repositories rely on executable digests (hashes) for indexing malware programs and discarding duplicates. Unfortunately, these executable digests are different for each malware variant. Thus, a great deal of time and resources are wasted by analyzing, running, and storing numerous instances of almost identical programs. To address this problem, we require a more robust similarity measure that can quickly identify and filter these variants, avoiding repeated (costly) analyses that provide no additional insights to a malware analyst. In this paper, we present a robust filter to quickly determine when a malware program is similar to a previously-seen sample. Compared to previous work, our similarity measure is efficient because it does not require the costly task of preliminary unpacking, but instead, operates directly on packed code. Our approach exploits the fact that current packers use compression and weak encryption schemes that do not break all connections between the original programs and their transformed version (that is, some indicators of similarity between two original programs can still be extracted from their packed version). In addition, we introduce a packer detection technique that is able to distinguish between different levels of protection, such as unpacked, compressed, encrypted, and multi-layer encrypted code. This allows us to configure (optimize) the sensitivity parameter for the similarity computation. We performed experiments on a large malware repository containing 795 thousand samples. Our results show that the similarity measure is highly effective in filtering out malware variants obtained by simple re-packing or re-encryption, and can reduce the number of samples that need to be analyzed by a factor of three to five.
A close look on N-grams in intrusion detection: Anomaly detection vs. classification
- in Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, ser. AISec ’13
"... Detection methods based on n-gram models have been widely studied for the identication of attacks and malicious soft-ware. These methods usually build on one of two learning schemes: anomaly detection, where a model of normality is constructed from n-grams, or classication, where a discrim-ination b ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Detection methods based on n-gram models have been widely studied for the identication of attacks and malicious soft-ware. These methods usually build on one of two learning schemes: anomaly detection, where a model of normality is constructed from n-grams, or classication, where a discrim-ination between benign and malicious n-grams is learned. Although successful in many security domains, previous work falls short of explaining why a particular scheme is used and more importantly what renders one favorable over the other for a given type of data. In this paper we provide a close look on n-gram models for intrusion detection. We specically study anomaly detection and classication using n-grams and develop criteria for data being used in one or the other scheme. Furthermore, we apply these criteria in the scope of web intrusion detection and empirically validate their ef-fectiveness with dierent learning-based detection methods for client-side and service-side attacks.
Collective classification for packed executable identification
- In ACM CEAS
, 2011
"... Malware is any software designed to harm computers. Com-mercial anti-virus are based on signature scanning, which is a technique effective only when the malicious executa-bles have been previously analysed and identified. Malware writers employ several techniques in order to hide their ac-tual behav ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Malware is any software designed to harm computers. Com-mercial anti-virus are based on signature scanning, which is a technique effective only when the malicious executa-bles have been previously analysed and identified. Malware writers employ several techniques in order to hide their ac-tual behaviour. Executable packing consists in encrypting or hiding the real payload of the executable. Generic unpack-ing techniques do not depend on the packer used, as they execute the binary within an isolated environment (namely ‘sandbox’) to gather the real code of the packed executable. However, this approach is slow and, therefore, a filter step is required to determine when an executable has been packed. To this end, supervised machine learning approaches trained with static features from the executables have been pro-posed. Notwithstanding, supervised learning methods need the identification and labelling of a high number of packed and not packed executables. In this paper, we propose a new method for packed executable detection that adopts a collec-tive learning approach to reduce the labelling requirements of completely supervised approaches. We performed an em-pirical validation demonstrating that the system maintains a high accuracy rate while the labelling efforts are lower than when using supervised learning.
Lines of Malicious Code: Insights Into the Malicious Software Industry
"... Malicious software installed on infected computers is a fundamental component of online crime. Malware development thus plays an essential role in the underground economy of cyber-crime. Malware authors regularly update their software to defeat defenses or to support new or improved criminal busines ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Malicious software installed on infected computers is a fundamental component of online crime. Malware development thus plays an essential role in the underground economy of cyber-crime. Malware authors regularly update their software to defeat defenses or to support new or improved criminal business models. A large body of research has focused on detecting malware, defending against it and identifying its functionality. In addition to these goals, however, the analysis of malware can provide a glimpse into the software development industry that develops malicious code. In this work, we present techniques to observe the evolution of a malware family over time. First, we develop techniques to compare versions of malicious code and quantify their differences. Furthermore, we use behavior observed from dynamic analysis to assign semantics to binary code and to identify functional components within a malware binary. By combining these techniques, we are able to monitor the evolution of a malware’s functional components. We implement these techniques in a system we call BEAGLE, and apply it to the observation of 16 malware strains over several months. The results of these experiments provide insight into the effort involved in updating malware code, and show that BEAGLE can identify changes to individual malware components.
SigMal: A Static Signal Processing Based Malware Triage
, 2013
"... In this work, we propose SigMal, a fast and precise malware detection framework based on signal processing techniques. SigMal is designed to operate with systems that process large amounts of binary samples. It has been observed that many samples received by such systems are variants of previously-s ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
In this work, we propose SigMal, a fast and precise malware detection framework based on signal processing techniques. SigMal is designed to operate with systems that process large amounts of binary samples. It has been observed that many samples received by such systems are variants of previously-seen malware, and they retain some similarity at the binary level. Previous systems used this notion of malware similarity to detect new variants of previously-seen malware. SigMal improves the state-of-the-art by leveraging techniques borrowed from signal processing to extract noiseresistant similarity signatures from the samples. SigMal uses an efficient nearest-neighbor search technique, which is scalable to millions of samples. We evaluate SigMal on 1.2 million recent samples, both packed and unpacked, observed over a duration of three months. In addition, we also used a constant dataset of known benign executables. Our results show that SigMal can classify 50 % of the recent incoming samples with above 99 % precision. We also show that Sig-Mal could have detected, on average, 70 malware samples per day before any antivirus vendor detected them.