Results 1 - 10
of
23
Automated classification and analysis of internet malware
- In Proceedings of Recent Advances in Intrusion Detection (RAID’07
, 2007
"... Abstract. Numerous attacks, such as worms, phishing, and botnets, threaten the availability of the Internet, the integrity of its hosts, and the privacy of its users. A core element of defense against these attacks is anti-virus (AV) software—a service that detects, removes, and characterizes these ..."
Abstract
-
Cited by 41 (5 self)
- Add to MetaCart
Abstract. Numerous attacks, such as worms, phishing, and botnets, threaten the availability of the Internet, the integrity of its hosts, and the privacy of its users. A core element of defense against these attacks is anti-virus (AV) software—a service that detects, removes, and characterizes these threats. The ability of these products to successfully characterize these threats has far-reaching effects—from facilitating sharing across organizations, to detecting the emergence of new threats, and assessing risk in quarantine and cleanup. In this paper, we examine the ability of existing host-based anti-virus products to provide semantically meaningful information about the malicious software and tools (or malware) used by attackers. Using a large, recent collection of malware that spans a variety of attack vectors (e.g., spyware, worms, spam), we show that different AV products characterize malware in ways that are inconsistent across AV products, incomplete across malware, and that fail to be concise in their semantics. To address these limitations, we propose a new classification technique that describes malware behavior in terms of system state changes (e.g., files written, processes created) rather than in sequences or patterns of system calls. To address the sheer volume of malware and diversity of its behavior, we provide a method for automatically categorizing these profiles of malware into groups that reflect similar classes of behaviors and demonstrate how behavior-based clustering provides a more direct and effective way of classifying and analyzing Internet malware. 1
Learning and Classification of Malware Behavior
- In Fifth Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 08
, 2008
"... Abstract. Malicious software in form of Internet worms, computer viruses, and Trojan horses poses a major threat to the security of networked systems. The diversity and amount of its variants severely undermine the e ectiveness of classical signature-based detection. Yet variants of malware families ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Abstract. Malicious software in form of Internet worms, computer viruses, and Trojan horses poses a major threat to the security of networked systems. The diversity and amount of its variants severely undermine the e ectiveness of classical signature-based detection. Yet variants of malware families share typical behavioral patterns reflecting its origin and purpose. We aim to exploit these shared patterns for classification of malware and propose a method for learning and discrimination of malware behavior. Our method proceeds in three stages: (a) behavior of collected malware is monitored in a sandbox environment, (b) based on a corpus of malware labeled by an anti-virus scanner a malware behavior classifier is trained using learning techniques and (c) discriminative features of the behavior models are ranked for explanation of classification decisions. Experiments with di erent heterogeneous test data collected over several months using honeypots demonstrate the e ectiveness of our method, especially in detecting novel instances of malware families previously not recognized by commercial anti-virus software. 1
Language Models for Detection of Unknown Attacks in Network Traffic
, 2006
"... In this paper we propose a method for network intrusion detection based on language models. Our method proceeds by extracting language features such as n-grams and words from connection payloads and applying unsupervised anomaly detection – without prior learning phase or presence of labeled data. T ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
In this paper we propose a method for network intrusion detection based on language models. Our method proceeds by extracting language features such as n-grams and words from connection payloads and applying unsupervised anomaly detection – without prior learning phase or presence of labeled data. The essential part of this procedure is linear-time computation of similarity measures between language models of connection payloads. Particular patterns in these models decisive for differentiation of attacks and normal data can be traced back to attack semantics and utilized for automatic generation of attack signatures. Results of experiments conducted on two datasets of network traffic demonstrate the importance of higher-order n-grams and variable-length language models for detection of unknown network attacks. An implementation of our system achieved detection accuracy of over 80 % with no false positives on instances of recent remote-to-local attacks in HTTP, FTP and SMTP traffic.
Scalable, Behavior-Based Malware Clustering
"... Anti-malware companies receive thousands of malware samples every day. To process this large quantity, a number of automated analysis tools were developed. These tools execute a malicious program in a controlled environment and produce reports that summarize the program’s actions. Of course, the pro ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Anti-malware companies receive thousands of malware samples every day. To process this large quantity, a number of automated analysis tools were developed. These tools execute a malicious program in a controlled environment and produce reports that summarize the program’s actions. Of course, the problem of analyzing the reports still remains. Recently, researchers have started to explore automated clustering techniques that help to identify samples that exhibit similar behavior. This allows an analyst to discard reports of samples that have been seen before, while focusing on novel, interesting threats. Unfortunately, previous techniques do not scale well and frequently fail to generalize the observed activity well enough to recognize related malware. In this paper, we propose a scalable clustering approach to identify and group malware samples that exhibit similar behavior. For this, we first perform dynamic analysis to obtain the execution traces of malware programs. These execution traces are then generalized into behavioral profiles, which characterize the activity of a program in more abstract terms. The profiles serve as input to an efficient clustering algorithm that allows us to handle sample sets that are an order of magnitude larger than previous approaches. We have applied our system to real-world malware collections. The results demonstrate that our technique is able to recognize and group malware programs that behave similarly, achieving a better precision than previous approaches. To underline the scalability of the system, we clustered a set of more than 75 thousand samples in less than three hours. 1
Large-Scale Malware Indexing Using Function-Call Graphs
"... A major challenge of the anti-virus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previously-seen malware program. In this paper, we design ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
A major challenge of the anti-virus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previously-seen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s function-call graphs, which is a structural representation known to be less susceptible to instruction-level obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearest-neighbor search problem in a graph database. To speed
Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning
"... Extracting useful knowledge from large network datasets has become a fundamental challenge in many domains, from scientific literature to social networks and the web. We introduce Apolo, a system that uses a mixed-initiative approach— combining visualization, rich user interaction and machine learni ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Extracting useful knowledge from large network datasets has become a fundamental challenge in many domains, from scientific literature to social networks and the web. We introduce Apolo, a system that uses a mixed-initiative approach— combining visualization, rich user interaction and machine learning—to guide the user to incrementally and interactively explore large network data and make sense of it. Apolo engages the user in bottom-up sensemaking to gradually build up an understanding over time by starting small, rather than starting big and drilling down. Apolo also helps users find relevant information by specifying exemplars, and then using a machine learning method called Belief Propagation to infer which other nodes may be of interest. We evaluated Apolo with twelve participants in a between-subjects study, with the task being to find relevant new papers to update an existing survey paper. Using expert judges, participants using Apolo found significantly more relevant papers. Subjective feedback of Apolo was also very positive.
Improving Malware Detection by Applying Multi-Inducer Ensemble
"... Abstract. Detection of malicious software (malware) using machine learning methods has been explored extensively to enable fast detection of new released malware. The performance of these classifiers depends on the induction algorithms being used. In order to benefit from multiple different classifi ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. Detection of malicious software (malware) using machine learning methods has been explored extensively to enable fast detection of new released malware. The performance of these classifiers depends on the induction algorithms being used. In order to benefit from multiple different classifiers, and exploit their strengths we suggest using an ensemble method that will combine the results of the individual classifiers into one final result to achieve overall higher detection accuracy. In this paper we evaluate several combining methods using five different base inducers (C4.5 Decision Tree, Naïve Bayes, KNN, VFI and OneR) on five malware datasets. The main goal is to find the best combining method for the task of detecting malicious files in terms of accuracy, AUC and execution time.
McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables
"... In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this work, we propose Malware Collection Booster (McBoost), a fast statistical malware detection tool that is intended to improve the scalability of existing malware collection and analysis approaches. Given a large collection of binaries that may contain both hitherto unknown malware and benign executables, McBoost reduces the overall time of analysis by classifying and filtering out the least suspicious binaries and passing only the most suspicious ones to a detailed binary analysis process for signature extraction. The McBoost framework consists of a classifier specialized in detecting whether an executable is packed or not, a universal unpacker based on dynamic binary analysis, and a classifier specialized in distinguishing between malicious or benign code. We developed a proof-of-concept version of McBoost and evaluated it on 5,586 malware and 2,258 benign programs. McBoost has an accuracy of 87.3%, and an Area Under the ROC curve (AUC) equal to 0.977. Our evaluation also shows that McBoost reduces the overall time of analysis to only a fraction (e.g., 13.4%) of the computation time that would otherwise be required to analyze large sets of mixed malicious and benign executables. 1
Extracting Compiler Provenance from Program Binaries
"... We present a novel technique that identifies the source compiler of program binaries, an important element of program provenance. Program provenance answers fundamental questions of malware analysis and software forensics, such as whether programs are generated by similar tool chains; it also can al ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We present a novel technique that identifies the source compiler of program binaries, an important element of program provenance. Program provenance answers fundamental questions of malware analysis and software forensics, such as whether programs are generated by similar tool chains; it also can allow development of debugging, performance analysis, and instrumentation tools specific to particular compilers. We formulate compiler identification as a structured learning problem, automatically building models to recognize sequences of binary code generated by particular compilers. We evaluate our techniques on a large set of real-world test binaries, showing that our models identify the source compiler of binary code with over 90 % accuracy, even in the presence of interleaved code from multiple compilers. A case study demonstrates the use of inferred compiler provenance to augment stripped binary parsing, reducing parsing errors by 18%.
Polonium: Tera-Scale Graph Mining and Inference for Malware Detection
- SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM)
, 2011
"... We present Polonium, a novel Symantec technology that detects malware through large-scale graph inference. Based on the scalable Belief Propagation algorithm, Polonium infers every file’s reputation, flagging files with low reputation as malware. We evaluated Polonium with a billion-node graph const ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We present Polonium, a novel Symantec technology that detects malware through large-scale graph inference. Based on the scalable Belief Propagation algorithm, Polonium infers every file’s reputation, flagging files with low reputation as malware. We evaluated Polonium with a billion-node graph constructed from the largest file submissions dataset ever published (60 terabytes). Polonium attained a high true positive rate of 87 % in detecting malware; in the field, Polonium lifted the detection rate of existing methods by 10 absolute percentage points. We detail Polonium’s design and implementation features instrumental to its success. Polonium has served 120 million people and helped answer more than one trillion queries for file reputation.

