Date Approved

2023

Degree Type

Open Access Senior Honors Thesis

Department or School

College of Engineering and Technology

First Advisor

Omar Darwish, Ph.D.

Second Advisor

James M. Banfield, Ph.D.

Third Advisor

Sean Che, Ph.D.

Abstract

The recent rapid growth of the field of data science has made available to all fields opportunities to leverage machine learning. Computer network traffic classification has traditionally been performed using static, pre-written rules that are easily made ineffective if changes, legitimate or not, are made to the applications or protocols underlying a particular category of network traffic. This paper explores the problem of network traffic classification and analyzes the viability of having the process performed using a multitude of classical machine learning techniques against significant statistical similarities between classes of network traffic as opposed to traditional static traffic identifiers.

To accomplish this, network data was captured, processed, and evaluated for 10 application labels under the categories of video conferencing, video streaming, video gaming, and web browsing as described later in Table 1. Flow-based statistical features for the dataset were derived from the network captures in accordance with the “Flow Data Feature Creation” section and were analyzed against a nearest centroid, k-nearest neighbors, Gaussian naïve Bayes, support vector machine, decision tree, random forest, and multi-layer perceptron classifier. Tools and techniques broadly available to organizations and enthusiasts were used. Observations were made on working with network data in a machine learning context, strengths and weaknesses of different models on such data, and the overall efficacy of the tested models.

Ultimately, it was found that simple models freely available to anyone can achieve high accuracy, recall, and F1 scores in network traffic classification, with the best-performing model, random forest, having 89% accuracy, a macro average F1 score of .77, and a macro average recall of 76%, with the most common feature of successful classification being related to maximum packet sizes in a network flow.

Recommended Citation

Dunham, Isaac M., "Applying machine learning to categorize distinct categories of network traffic" (2023). Senior Honors Theses and Projects. 785.
https://commons.emich.edu/honors/785

Download

Included in

Information Security Commons

COinS

Senior Honors Theses and Projects

Applying machine learning to categorize distinct categories of network traffic

Date Approved

Degree Type

Department or School

First Advisor

Second Advisor

Third Advisor

Abstract

Recommended Citation

Included in

Search

Links

Browse

Author Corner

Links

Senior Honors Theses and Projects

Applying machine learning to categorize distinct categories of network traffic

Author

Date Approved

Degree Type

Department or School

First Advisor

Second Advisor

Third Advisor

Abstract

Recommended Citation

Included in

Share

Search

Links

Browse

Author Corner

Links