Drift Forensics of Malware Classifiers

Repository containing code for our AISec23 paper:

Theo Chow, Zeliang Kan, Lorenz Linhardt, Lorenzo Cavallaro, Daniel Arp, and Fabio Pierazzi , Drift Forensics of Malware Classifiers , In Proc. of the ACM Workshop on Artificial Intelligence and Security (AISec), 2023

If you use this repository in your own research, please cite our AISec23 paper as follows:

@inproceedings{chow2023driftforensics,
  title = {Drift Forensics of Malware Classifiers},
  author = {Chow, Theo and Kan, Zeliang and Linhardt, Lorenz and Cavallaro, Lorenzo and Arp, Daniel and Pierazzi, Fabio},
  booktitle = {Proc. of the {ACM} Workshop on Artificial Intelligence and Security ({AISec})},
  year = {2023},
}

Link to dataset can be found Here

Getting Started

Installation

This project requires Python 3 as well as the statistical learning stack of NumPy, SciPy and Scikit-learn, secml.

First, install package dependencies using the listing in requirements.txt.

pip install -r requirements.txt

Run experiments

To reproduce the paper results, run

python paper_results.py

Load dataset

First load in the desired dataset and obtain the X predictors X, y predictors y, timestamps t, family labels f, feature names feature_names and md5 md5.

PATH = "../Datasets/extended-features/"
X, y, t, f, feature_names, md5 = load_transcend(f"{PATH}extended-features-X-updated.json",
                                                f"{PATH}extended-features-y-updated.json",
                                                f"{PATH}extended-features-meta-updated.json",
                                                f"{PATH}meta_info_file.tsv")

Reduce feature space

Reduce the feature space to a manageable amount and save the feature indexes as a pkl file

X, feature_names = util.feature_reduction(X, y, feature_names, "pkl_files/feature_index_1000.pkl", feature_size=1000)

Dataset class

Put the data in to a dataset class, this gives us flexibiliy when selecting samples. Currently there are 2 main functions in the dataset class, splitting the dataset in to time aware splits for analysis and finding occurences of features in the dataset.

dataset = Dataset(X, y, t, f, feature_names, md5)

Search up feature name IDs

ids = dataset.get_feature_id_from_name("android")

Find IDs in family

dataset.sample_select_from_feature_id(families=['Dowgin','Dnotua','Kuguo','Airpush','Revmob'],ids=ids,contains=True, year=2015, month=1)

Split dataset and return time aware indexes for training and test

train, test = dataset.time_aware_split_index('month', 6, 1)

Analysis

The analysis class runs the experiment outlined in the paper. Currently, there are 3 main experiments, base, half and snoop. The results of this will be logged in a MySQL database and the results in a pkl file. By default, a file name pkl_files needs to be created.

analyse = Analysis(X, y, t, f, feature_names, train, test)

training_family = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
testing_family = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']

analyse.run(training_family=training_family, testing_family=testing_family,experiment='snoop', dataset='Transcend')

Visualizing data

To visualise the results, we first load in the corresponding data in question. The ResultsLoader() class gives an easy way in accessing saved experiments.

training_familes = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
testing_families = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
ResultsLoader().query_database_for_ID('half',training_familes,testing_families,'Transcend')

Load in the desired data using the ID returned by ResultsLoader()

result1 = ResultsLoader().load_file_from_id(5)
result2 = ResultsLoader().load_file_from_id(6)

For performance, distribution and difference plots

Viz(result1,result2).plot_performance_distribution()
Viz(result1,result2).plot_single('difference')

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
colors.py		colors.py
dataset.py		dataset.py
explanations.py		explanations.py
load.py		load.py
paper_results.py		paper_results.py
requirements.txt		requirements.txt
util.py		util.py
visual.py		visual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drift Forensics of Malware Classifiers

Getting Started

Installation

Run experiments

Load dataset

Reduce feature space

Dataset class

Analysis

Visualizing data

About

Releases

Packages

Contributors 3

Languages

License

isneslab/DriftAnalysis

Folders and files

Latest commit

History

Repository files navigation

Drift Forensics of Malware Classifiers

Getting Started

Installation

Run experiments

Load dataset

Reduce feature space

Dataset class

Analysis

Visualizing data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages