Automobile accidents Data Analysis in the United States | Sample Work| Realcode4you
- realcode4you
- 3 hours ago
- 6 min read
The file accidents.csv contains information on automobile accidents in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).
Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”
Assuming that no information or initial reports about the accident itself are available at the me of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis?
Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.
What is the overall error for the validation set?
What is the percent improvement relative to the naive rule (using the validation set)?
Examine the conditional probabilities in the pivot tables. Why do we get a probability of zero for P(INJURY = No ∣ SPD_LIM = 5)?


Code Implementation
# Importing all required ML modules >>
# Importing numpy for mathematical calculations >>
import numpy as np
import warnings
# Importing Pandas for data dealing >>
import pandas as pd
# Importing matplotlib and seaborn for data visualization >>
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
# Importing Sklean modules for splitting data >>
from sklearn.model_selection import train_test_split
# Impoting Labelencoder for encoding data >>
from sklearn.preprocessing import LabelEncoder
# Importing CategoricalNB modules for classification >>
from sklearn.naive_bayes import CategoricalNB
# Importing sklearn accuracy metrics >>
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report#Dataset Loading
# Loading Dataset into df variables >>
df = pd.read_csv('./accidents.csv')#Explore Dataset
# Explore the Dataset take a look at top 5 Rows >>
df.head()
# Check the shape of Dataset rows and columns >>
print("[$] Rows of dataset >> ",df.shape[0])
print("[$] Columns of dataset >> ",df.shape[1])output:
[$] Rows of dataset >> 42183
[$] Columns of dataset >> 24# Check the information of Dataset >>
df.info()output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42183 entries, 0 to 42182
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 HOUR_I_R 42183 non-null int64
1 ALCHL_I 42183 non-null int64
2 ALIGN_I 42183 non-null int64
3 STRATUM_R 42183 non-null int64
4 WRK_ZONE 42183 non-null int64
5 WKDY_I_R 42183 non-null int64
6 INT_HWY 42183 non-null int64
7 LGTCON_I_R 42183 non-null int64
8 MANCOL_I_R 42183 non-null int64
9 PED_ACC_R 42183 non-null int64
10 RELJCT_I_R 42183 non-null int64
11 REL_RWY_R 42183 non-null int64
12 PROFIL_I_R 42183 non-null int64
13 SPD_LIM 42183 non-null int64
14 SUR_COND 42183 non-null int64
15 TRAF_CON_R 42183 non-null int64
16 TRAF_WAY 42183 non-null int64
17 VEH_INVL 42183 non-null int64
18 WEATHER_R 42183 non-null int64
19 INJURY_CRASH 42183 non-null int64
20 NO_INJ_I 42183 non-null int64
21 PRPTYDMG_CRASH 42183 non-null int64
22 FATALITIES 42183 non-null int64
23 MAX_SEV_IR 42183 non-null int64
dtypes: int64(24)
memory usage: 7.7 MB# Check for any null values >>
print("[$] Null values in dataset >> ",df.isnull().sum().sum())
print("[$] Duplicate values in dataset >> ",df.duplicated().sum())output:
[$] Null values in dataset >> 0
[$] Duplicate values in dataset >> 18366# Dropping the duplicate Values >>
df.drop_duplicates(inplace=True)# Check the Statstical description of dataset >>
df.describe().Toutput:
count | mean | std | min | 25% | 50% | 75% | max | |
HOUR_I_R | 42183.0 | 0.429344 | 0.494988 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
ALCHL_I | 42183.0 | 1.912832 | 0.282084 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 |
ALIGN_I | 42183.0 | 1.131546 | 0.338000 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 |
STRATUM_R | 42183.0 | 0.491620 | 0.499936 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
WRK_ZONE | 42183.0 | 0.022616 | 0.148677 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
WKDY_I_R | 42183.0 | 0.771614 | 0.419797 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
INT_HWY | 42183.0 | 0.150321 | 0.418952 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
LGTCON_I_R | 42183.0 | 1.492521 | 0.789874 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 |
MANCOL_I_R | 42183.0 | 1.337079 | 0.929756 | 0.0 | 0.0 | 2.0 | 2.0 | 2.0 |
PED_ACC_R | 42183.0 | 0.040514 | 0.197164 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
RELJCT_I_R | 42183.0 | 0.557926 | 0.496639 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
REL_RWY_R | 42183.0 | 0.766541 | 0.423037 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
PROFIL_I_R | 42183.0 | 0.243226 | 0.429035 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SPD_LIM | 42183.0 | 43.547875 | 12.948396 | 5.0 | 35.0 | 40.0 | 55.0 | 75.0 |
SUR_COND | 42183.0 | 1.290710 | 0.780524 | 1.0 | 1.0 | 1.0 | 1.0 | 9.0 |
TRAF_CON_R | 42183.0 | 0.516322 | 0.749417 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
TRAF_WAY | 42183.0 | 1.477491 | 0.584851 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 |
VEH_INVL | 42183.0 | 1.816964 | 0.684843 | 1.0 | 1.0 | 2.0 | 2.0 | 23.0 |
WEATHER_R | 42183.0 | 1.142783 | 0.349855 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 |
INJURY_CRASH | 42183.0 | 0.497736 | 0.500001 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
NO_INJ_I | 42183.0 | 0.778702 | 1.035169 | 0.0 | 0.0 | 1.0 | 1.0 | 31.0 |
PRPTYDMG_CRASH | 42183.0 | 0.491217 | 0.499929 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
FATALITIES | 42183.0 | 0.011047 | 0.104524 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
MAX_SEV_IR | 42183.0 | 0.519830 | 0.521256 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
Summary:
Injuries occur in nearly half (49.8%) of all crashes, showing an even distribution between injury and non-injury crashes.
Alcohol involvement (ALCHL_I) is recorded in a small fraction of cases (mean: 1.91, mostly non-alcohol-related crashes).
Crashes on weekdays (WKDY_I_R) are significantly more common (77.2%) than on weekends.
Only 2.26% of crashes occur in work zones (WRK_ZONE), indicating a low frequency of construction-related incidents.
Speed limits range from 5 to 75 mph, with a median of 40 mph, showing that most crashes occur in mid-range speed zones.
Most crashes (77%) happen on roads not near railways (REL_RWY_R), suggesting railway crossings are relatively safer zones.
Intersections and highways (INT_HWY) contribute to 15% of crashes, indicating moderate risk in these areas.
Multi-vehicle crashes are more common (mean vehicles involved: 1.82, max: 23), with most crashes involving two vehicles.
Fatalities are rare (only 1.1% of crashes result in fatalities), while injury severity (MAX_SEV_IR) varies across cases.
Property damage crashes (PRPTYDMG_CRASH) occur at nearly the same rate as injury crashes (49.1%), showing a balanced impact between damage and injuries.
Straight road crashes: 86.8% of accidents happened on straight roads rather than curves.
Road conditions: Majority of accidents occurred on dry roads with clear weather conditions.
#Data Visualization
# Check the Distrubution of 0=no injury, 1=non-fatal inj., 2=fatal inj. >>
df['MAX_SEV_IR'].value_counts().plot(kind='pie',autopct='%1.0f%%',figsize=(6,6))
# >> We can see Blance in Dataset for Injury and Non Injury 50% almost approximate >>output:

(Q) Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”
# Convert MAX_SEV_IR into a binary target variable 'INJURY' >>
df['INJURY'] = np.where(df['MAX_SEV_IR'].isin([1, 2]), 'yes', 'no')
# Drop MAX_SEV_IR as it is now encoded into INJURY >>
df.drop(columns=['MAX_SEV_IR'], inplace=True)
# Convert injury categorical to numerical values > yes:1 and no:0 >>
df['INJURY'] = df['INJURY'].replace({'yes':1,'no':0})
#Categorical Distribution
# Select all our columns >>
categorical_vars = df.columns.tolist()
# Set up the grid layout >>
num_vars = len(categorical_vars)
cols = 4
rows = 6
fig, axes = plt.subplots(rows, cols, figsize=(20, 5 * rows))
axes = axes.flatten()
# Plot each categorical variable >>
for i, var in enumerate(categorical_vars):
sns.countplot(x=df[var], ax=axes[i], palette='viridis',hue=df['INJURY'])
axes[i].set_title(var)
axes[i].set_ylabel('Count')
axes[i].set_xlabel('')
plt.tight_layout()
plt.show()
output:

Summary of Distribution:
Injuries occur in 54.5% of all crashes, with a higher proportion on weekends (56.9%) than weekdays (54.5%).
Head-on collisions have the highest injury rate (71.2%), while rear-end crashes have the lowest (43.6%).
Pedestrian and cyclist crashes result in injuries 99.1% of the time, making them the most dangerous.
Speed limits between 35-55 mph account for nearly half (47.1%) of all injury crashes.
Alcohol-related crashes lead to injuries 59.9% of the time, higher than non-alcohol-related crashes (53.8%).
Dark, unlit roads have a higher injury severity risk, despite fewer crashes compared to well-lit areas.
Intersections are involved in 47% of injury cases, making them high-risk zones.
Most injuries happen in daylight (58.6%) and on dry roads (67.3%), despite better driving conditions.
Multi-vehicle crashes (61.2%) cause more injuries than single-vehicle crashes (51%).
Work zone crashes are rare but have a high injury rate (53.6%).



Comments