top of page

System activity and network traffic to train a machine learning model

What?

As part of reading for my MSc degree, I am working on understanding the applicability and measuring the usefulness of machine learning towards cybersecurity. While almost every cybersecurity product has machine learning to optimise its capabilities focusing on detection, few products apply machine learning towards "active" (counter) network layer containment and measures. A possible cause could be the lack of confidence in the machine learning model, especially in an enterprise environment.

What if machine learning could detect anomalies in network traffic and apply active measures? From predicting an impending attack to the containment of an attack? Hence as part of one of my dissertations, I am trying to measure the usefulness and finding the application at the network layer (firewall / IPS / layer 4-7). I am hoping for real-world datasets. I will try and combine this with system activities to train the model. However, I will, as of now, limit the scope to the network layer.

How can you assist?

I am inviting those in cybersecurity or otherwise to try out your cracking abilities (go-red teamers, skiddies and crackers) on my IP: 116.72.137.140. It does not matter if you use default tools or customised versions of your cracking arsenal. What matters is that I get traffic that I can feed to my machine learning model. 

There are periodic scans that many online scanners carry out, such as - shaodan.io. I want to use traffic from such scanners with real-world attempts to mature the model to detect between an actual attack and a harmless scan, which may later be used for an attack.

I already have datasets from Symantec (I will miss the original Symantec :( ) and a few others from GitHub. I have been running 30+ honeypots on AWS, Azure, Google Cloud, Linode. However, honeypot traffic is limited to the service being emulated 

 

You may choose my IP to play around with your cracking abilities. There are no penalties if you end up breaking something :) - so go for it! Refer to ROE rule 1, though.

ROE (Rules Of Engagement):

  1. No DoS or DDoS; I am testing this on my home internet connection. I do not have the budget to cloud host this (as I need the credits to compute the model). This is the only internet connection I have. If D/DoSed, my wife will get furious because of the lack of NetFlix and Instagram. If that doesn't convince you, be advised that I use the same connection to play Starcraft 2, CS:GO and DoTA 2. Don't be that person who causes D/DoS to prove a point. It's lame. Finally, ISP may give up on me, and that will cause severe issues in my research and education. So no D/DoS - Please.

  2. To have a comprehensive dataset, I have put a file inside the systemroot directory of the host machine. If you can get its content, message me and ask for the bounty. Filename: hackme.txt I have set a scheduled task for checking the file's last access time - just in case :).

  3. This isn't a designed "hackthebox." I am looking for real-world traffic. There is an OS that hasn't been patched for about four months, with three web-facing applications running in default install without any hardening. What would an attacker do and how can I use it to train a model is what I am interested in. From access to this very webpage to scanning, the system is collecting datasets from all possible locations.

  4. The system is interactive; once you break in any web-facing applications, you will get interactive access. At this point, commands you type will be recorded and responded to. This is the crucial part as I will be recording system output with network traffic output. The fun starts by combining these two layers. Don't be disheartened if you don't get to systemroot in one go. Remember, there is a real possibility that others will be trying simultaneously, and the system can only handle five concurrent connections.

  5. Use any generic tool you prefer and if possible, let me know your IP address. In a few cases, I may contact you to identify the tool used along with the parameters.

​​What is in it for you?

As a self-funded student, I don't have a plentiful bounty to give out. I can, however, offer you a brand new Raspberry Pi 4 Model B (RPi) with the configuration of your choosing (albeit choosing the one with 8 GB RAM would be the most prudent choice). Should you decide not to take the RPi, I suggest something more rudimentary like alcohol for the spirited hacker inside you (yup, a pun). Or an Oxford University hoodie - I am willing to offer anything equal to the monetary value of an RPi (approximately 54 GBP).

I will include your name in the citation of my paper.

 

Thank you very much for your assistance and may the force be with you.

bottom of page