top of page

Analyzing Cyber Attacks and Scams with WACY-COM Dataset in R Programming

Cyber-attacks and scams continue to evolve, posing serious threats to individuals, organizations, and governments worldwide. Detecting and understanding these threats requires access to reliable data and effective analytical tools. The WA Cyber Command (WACY-COM) dataset offers a rich source of information on cyber incidents, enabling researchers and analysts to identify patterns and trends. Using R programming, a powerful tool for data analysis, we can uncover insights that help improve cybersecurity defenses.


Eye-level view of a computer screen displaying R code analyzing cyber attack data
Visualizing cyber attack data with R programming

Understanding the WACY-COM Dataset


The WACY-COM dataset is a collection of cyber-attack and scam records compiled by the Western Australia Cyber Command. It includes detailed information about various types of cyber incidents such as phishing, malware infections, ransomware attacks, and social engineering scams. The dataset typically contains fields like:


  • Incident date and time

  • Attack type

  • Target sector (e.g., government, finance, healthcare)

  • Attack vector (e.g., email, website, network)

  • Severity level

  • Response actions


This structured data allows analysts to track how cyber threats change over time and which sectors are most vulnerable.


Preparing the Dataset for Analysis in R


Before diving into analysis, it is essential to clean and prepare the WACY-COM dataset. Common steps include:


  • Loading the data: Import the dataset into R using functions like `read.csv()` or `readr::read_csv()`.

  • Handling missing values: Identify and address missing or incomplete records to avoid skewed results.

  • Converting data types: Ensure date fields are in date format, categorical variables are factors, and numerical data is correctly typed.

  • Filtering relevant data: Focus on specific time frames or attack types depending on the analysis goal.


For example, to load and prepare the data, you might use:


```r

library(dplyr)

library(lubridate)


wacy_data <- read.csv("wacy_com_dataset.csv")


wacy_data <- wacy_data %>%

mutate(IncidentDate = ymd_hms(IncidentDate)) %>%

filter(!is.na(AttackType))

```


Identifying Patterns in Cyber Attacks


Once the data is ready, R’s powerful packages like `ggplot2` and `dplyr` help uncover patterns. Here are some key analyses:


Frequency of Attacks Over Time


Plotting the number of attacks by month or year reveals trends and spikes. For instance, a sudden increase in phishing attacks during a certain period might indicate a targeted campaign.


```r

library(ggplot2)


wacy_data %>%

group_by(month = floor_date(IncidentDate, "month")) %>%

summarise(count = n()) %>%

ggplot(aes(x = month, y = count)) +

geom_line(color = "red") +

labs(title = "Monthly Cyber Attacks Recorded in WACY-COM Dataset",

x = "Month",

y = "Number of Attacks")

```


Attack Types and Their Distribution


Understanding which types of attacks are most common helps prioritize defense strategies. A bar chart showing the frequency of each attack type can highlight the most prevalent threats.


Target Sectors and Vulnerability


Analyzing which sectors are targeted most often reveals vulnerabilities. For example, if the finance sector faces the highest number of ransomware attacks, organizations in this sector should strengthen their defenses accordingly.


Detecting Scam Patterns


Scams often follow recognizable patterns. Using clustering techniques in R, such as k-means clustering, analysts can group similar scam incidents based on features like attack vector, target, and timing. This grouping helps identify common tactics scammers use.


```r

library(cluster)


scam_data <- wacy_data %>%

filter(AttackType == "Scam") %>%

select(AttackVector, SeverityLevel) %>%

mutate(AttackVector = as.numeric(factor(AttackVector)))


set.seed(123)

kmeans_result <- kmeans(scam_data, centers = 3)


wacy_data$Cluster <- NA

wacy_data$Cluster[wacy_data$AttackType == "Scam"] <- kmeans_result$cluster

```


Visualizing Cyber Threats for Better Insight


Visualizations make complex data easier to understand. Heatmaps, line graphs, and bar charts can show attack trends, peak times, and vulnerable sectors. For example, a heatmap of attacks by day and hour can reveal when cybercriminals are most active.


Practical Example: Detecting Phishing Campaigns


Phishing remains one of the most common cyber threats. By filtering the dataset for phishing incidents and analyzing their timing and targets, analysts can detect coordinated campaigns.


  • Extract phishing incidents

  • Group by target sector and date

  • Identify clusters of high activity


This approach helps cybersecurity teams prepare for and respond to phishing waves more effectively.


Using R to Automate Cyber Threat Monitoring


R scripts can be scheduled to run regularly, automatically updating reports and visualizations as new data arrives. This automation supports continuous monitoring and quick response to emerging threats.


Challenges and Considerations


Working with cyber-attack data has challenges:


  • Data quality: Incomplete or inconsistent records can affect analysis accuracy.

  • Evolving threats: Attack methods change rapidly, requiring ongoing dataset updates.

  • Privacy concerns: Sensitive information must be handled carefully to protect identities.


Despite these challenges, the WACY-COM dataset combined with R programming offers a strong foundation for understanding cyber threats.


---------------------------------

Sample Assessment

Scenario

WA Cyber Command – WACY-COM has acquired aggregate data about 200,000 identified cyber-attacks and scans. The data are sourced from a Honey-pot project which places fake servers across the globe and records attacker activity and techniques. As Honeypots are simulated networks and devices, they allow researchers to safely monitor malicious traffic without endangering real computers or networks.


When analysing cyber-attacks, the level of sophistication of attackers can range in from low-level scammers, right up to Advanced Persistent Threats (APTs) which are often associated with state-sponsored cyber-attacks. The attacker tools and techniques generally vary depending on the sophistication of the attacker.


A research project has been undertaken by WACY-COM to determine what patterns exist in state-sponsored APT attacks.


Typically, a complex attack can involve multiple attacking computers (with different source-IP addresses) and different payloads and targets. By coordinating attacks from multiple devices, the attacks can become more difficult to detect and stop.


Note: The scenario and data are loosely based on real-world cyber threats and attacks. However, this data set has been curated entirely to help you understand the types of data, correlations and issues that you may experience when handling real-world cyber security data.


Data description

The aggregated data available to WACY-COM are described by the following features (with data types given in square brackets):

[Categorical] Port – The port or service that was being attacked on the honey-pot network. Well known ports include 80/443 (Web traffic), 25 (Email reception), 993 (Email collection) [Categorical] Protocol – The Internet Protocol in use to conduct the attack

[Numeric] Hits – How many ‘hits’ the attacker made against the network [Numeric] Average Request Size (Bytes) – Average ‘payload’ sent by the attacker

[Numeric] Attack Window (Seconds) – Duration of the attack

[Numeric] Average Attacker Payload Entropy (Bits) – An attempt to qualify whether payload data were encrypted (higher Shannon entropy may indicate random data, data obfuscation or encryption)

[Categorical] Target Honeypot Server OS – The Operating System of the simulated server

[Numeric] Attack Source IP Address Count – How many unique IP addresses were used in the attack

[Numeric] Average ping to attacking IP (milliseconds) – Used to detect ‘distance’ to the attacker. The average ping time ‘back‘ to the attacker’s IP addresses were calculated. [Numeric] Average ping variability (st.dev) – High variability pings can indicate a saturated or unreliable link.

[Numeric] Individual URLs requested – How many different URLs were probed or attacked (Only relevant for Web Server ports)

[Categorical] Source OS (Detected) – The detected operating system of the attacking IP address. Acquired by scanning and fingerprinting the IP address of the attacking server [Categorical] Source Port Range – What range of source ports were used by the attacker. Typically, ‘low’ ports are reserved for system services. Higher ports are used by end user applications.

[Categorical] Source IP Type (Detected) – Whether the IP of the attacker can be linked to known proxies/VPNs or TOR (technologies that can be used to hide the real source of the attack), or Likely ISP traffic (which may indicate the attacker is leveraging compromised end-user computers)

[Numeric] IP Range Trust Score – A trust score generated by an existing WACY-COM system. This system integrates with open-source intelligence (OS-Int) databases to identify potentially compromised on malicious IP addresses

[Binary] APT – Was the attack conducted by a known Advanced Persistent Threat actor (APT).


The raw data for the above variables are contained in the WACY-COM.csv file.


Objectives

You have been brought on as part of a data analysis team to determine if APT activity can be inferred from other attack parameters.


Task

You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other.


Part 1 – General data preparation and cleaning.

a) Import the WACY-COM.csv (same version as Assignment 1) into R Studio.

b) Write the appropriate code in R Studio to prepare and clean the WACY-COM master dataset as follows:

i.Clean the whole dataset based on the feedback received for Assignment 1.

ii.For the feature Source.OS.Detected, merge its categories Windows 10 and Windows Server 2008 together to form a new category, say Windows_All. Similarly for Target.Honeypot.Server.OS, merge its categories Windows (Desktops) and Windows

(Servers) to form the new category named Windows_DeskServ. Further, combine

Linux and MacOS (All) to form the category MacOS_Linux. Hint: use the forcats::

fct_collapse(.) function.

iii.Log-transform Average.ping.variability using the log(.) function, and remove the

original Average.ping.variability column from the dataset (unless you have overwritten it with the log-transformed data). Similarly, transform the following features using the

square root, i.e. sqrt(.), function instead.

1. Hits;

2. Attack.Source.IP.Address.Count;

3. Average.ping.to.attacking.IP.milliseconds;

4. Individual.URLs.requested.


iv. Select only the complete cases using the na.omit(.) function, and name the dataset WACY-COM_cleaned.


Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.


c) Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files. You may be asked to provide these for verification purpose.


Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you are asked to use 30% of the data only to train your ML models to save time.


Part 2 – Compare the performances of different ML algorithms

a) Determine your THREE randomly selected supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 ML approaches are given by myModels.



e) Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, and to a lesser extent, interpretability maybe taken into account if the decision is close. You may outline your model coefficients (which you can place in the appendix) for your penalised logistic regression model if it helps your argument.


What to submit

Gather your findings into a report (maximum of 4 pages) and citing sources, if applicable. You may include an appendix (maximum of 2 pages) if appropriate. The minimum required font size is 11.


Outline how and why the data was manipulated, how the ML models were tuned and finally how they performed against each other. You may use graphs and tables where appropriate to help your reader understand your findings.


Make a final recommendation on which ML modelling approach is the best for this task.


Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.



For any support or help Contact us(realcode4you@gmail.com)



Comments


REALCODE4YOU

Realcode4you is the one of the best website where you can get all computer science and mathematics related help, we are offering python project help, java project help, Machine learning project help, and other programming language help i.e., C, C++, Data Structure, PHP, ReactJs, NodeJs, React Native and also providing all databases related help.

Hire Us to get Instant help from realcode4you expert with an affordable price.

USEFUL LINKS

Discount

ADDRESS

Noida, Sector 63, India 201301

Follows Us!

  • Facebook
  • Twitter
  • Instagram
  • LinkedIn

OUR CLIENTS BELONGS TO

  • india
  • australia
  • canada
  • hong-kong
  • ireland
  • jordan
  • malaysia
  • new-zealand
  • oman
  • qatar
  • saudi-arabia
  • singapore
  • south-africa
  • uae
  • uk
  • usa

© 2023 IT Services provided by Realcode4you.com

bottom of page