Automobile accidents Data Analysis in the United States | Sample Work| Realcode4you

realcode4you
3 hours ago
6 min read

The file accidents.csv contains information on automobile accidents in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).

Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”
Assuming that no information or initial reports about the accident itself are available at the me of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis?
Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.
What is the overall error for the validation set?
What is the percent improvement relative to the naive rule (using the validation set)?
Examine the conditional probabilities in the pivot tables. Why do we get a probability of zero for P(INJURY = No ∣ SPD_LIM = 5)?

Code Implementation

# Importing all required ML modules >>
# Importing numpy for mathematical calculations >>
import numpy as np
import warnings
# Importing Pandas for data dealing >>
import pandas as pd
# Importing matplotlib and seaborn for data visualization >>
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
# Importing Sklean modules for splitting data >>
from sklearn.model_selection import train_test_split
# Impoting Labelencoder for encoding data >>
from sklearn.preprocessing import LabelEncoder
# Importing CategoricalNB modules for classification >>
from sklearn.naive_bayes import CategoricalNB
# Importing sklearn accuracy metrics >>
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Dataset Loading

# Loading Dataset into df variables >>
df = pd.read_csv('./accidents.csv')

#Explore Dataset

# Explore the Dataset take a look at top 5 Rows >>
df.head()

# Check the shape of Dataset rows and columns >>
print("[$] Rows of dataset >> ",df.shape[0])
print("[$] Columns of dataset >> ",df.shape[1])

output:

[$] Rows of dataset >>  42183
[$] Columns of dataset >>  24

# Check the information of Dataset >>
df.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42183 entries, 0 to 42182
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   HOUR_I_R        42183 non-null  int64
 1   ALCHL_I         42183 non-null  int64
 2   ALIGN_I         42183 non-null  int64
 3   STRATUM_R       42183 non-null  int64
 4   WRK_ZONE        42183 non-null  int64
 5   WKDY_I_R        42183 non-null  int64
 6   INT_HWY         42183 non-null  int64
 7   LGTCON_I_R      42183 non-null  int64
 8   MANCOL_I_R      42183 non-null  int64
 9   PED_ACC_R       42183 non-null  int64
 10  RELJCT_I_R      42183 non-null  int64
 11  REL_RWY_R       42183 non-null  int64
 12  PROFIL_I_R      42183 non-null  int64
 13  SPD_LIM         42183 non-null  int64
 14  SUR_COND        42183 non-null  int64
 15  TRAF_CON_R      42183 non-null  int64
 16  TRAF_WAY        42183 non-null  int64
 17  VEH_INVL        42183 non-null  int64
 18  WEATHER_R       42183 non-null  int64
 19  INJURY_CRASH    42183 non-null  int64
 20  NO_INJ_I        42183 non-null  int64
 21  PRPTYDMG_CRASH  42183 non-null  int64
 22  FATALITIES      42183 non-null  int64
 23  MAX_SEV_IR      42183 non-null  int64
dtypes: int64(24)
memory usage: 7.7 MB

# Check for any null values >>
print("[$] Null values in dataset >> ",df.isnull().sum().sum())
print("[$] Duplicate values in dataset >> ",df.duplicated().sum())

output:

[$] Null values in dataset >>  0
[$] Duplicate values in dataset >>  18366

# Dropping the duplicate Values >>
df.drop_duplicates(inplace=True)

# Check the Statstical description of dataset >>
df.describe().T

output:

	count	mean	std	min	25%	50%	75%	max
HOUR_I_R	42183.0	0.429344	0.494988	0.0	0.0	0.0	1.0	1.0
ALCHL_I	42183.0	1.912832	0.282084	1.0	2.0	2.0	2.0	2.0
ALIGN_I	42183.0	1.131546	0.338000	1.0	1.0	1.0	1.0	2.0
STRATUM_R	42183.0	0.491620	0.499936	0.0	0.0	0.0	1.0	1.0
WRK_ZONE	42183.0	0.022616	0.148677	0.0	0.0	0.0	0.0	1.0
WKDY_I_R	42183.0	0.771614	0.419797	0.0	1.0	1.0	1.0	1.0
INT_HWY	42183.0	0.150321	0.418952	0.0	0.0	0.0	0.0	9.0
LGTCON_I_R	42183.0	1.492521	0.789874	1.0	1.0	1.0	2.0	3.0
MANCOL_I_R	42183.0	1.337079	0.929756	0.0	0.0	2.0	2.0	2.0
PED_ACC_R	42183.0	0.040514	0.197164	0.0	0.0	0.0	0.0	1.0
RELJCT_I_R	42183.0	0.557926	0.496639	0.0	0.0	1.0	1.0	1.0
REL_RWY_R	42183.0	0.766541	0.423037	0.0	1.0	1.0	1.0	1.0
PROFIL_I_R	42183.0	0.243226	0.429035	0.0	0.0	0.0	0.0	1.0
SPD_LIM	42183.0	43.547875	12.948396	5.0	35.0	40.0	55.0	75.0
SUR_COND	42183.0	1.290710	0.780524	1.0	1.0	1.0	1.0	9.0
TRAF_CON_R	42183.0	0.516322	0.749417	0.0	0.0	0.0	1.0	2.0
TRAF_WAY	42183.0	1.477491	0.584851	1.0	1.0	1.0	2.0	3.0
VEH_INVL	42183.0	1.816964	0.684843	1.0	1.0	2.0	2.0	23.0
WEATHER_R	42183.0	1.142783	0.349855	1.0	1.0	1.0	1.0	2.0
INJURY_CRASH	42183.0	0.497736	0.500001	0.0	0.0	0.0	1.0	1.0
NO_INJ_I	42183.0	0.778702	1.035169	0.0	0.0	1.0	1.0	31.0
PRPTYDMG_CRASH	42183.0	0.491217	0.499929	0.0	0.0	0.0	1.0	1.0
FATALITIES	42183.0	0.011047	0.104524	0.0	0.0	0.0	0.0	1.0
MAX_SEV_IR	42183.0	0.519830	0.521256	0.0	0.0	1.0	1.0	2.0

Summary:

Injuries occur in nearly half (49.8%) of all crashes, showing an even distribution between injury and non-injury crashes.
Alcohol involvement (ALCHL_I) is recorded in a small fraction of cases (mean: 1.91, mostly non-alcohol-related crashes).
Crashes on weekdays (WKDY_I_R) are significantly more common (77.2%) than on weekends.
Only 2.26% of crashes occur in work zones (WRK_ZONE), indicating a low frequency of construction-related incidents.
Speed limits range from 5 to 75 mph, with a median of 40 mph, showing that most crashes occur in mid-range speed zones.
Most crashes (77%) happen on roads not near railways (REL_RWY_R), suggesting railway crossings are relatively safer zones.
Intersections and highways (INT_HWY) contribute to 15% of crashes, indicating moderate risk in these areas.
Multi-vehicle crashes are more common (mean vehicles involved: 1.82, max: 23), with most crashes involving two vehicles.
Fatalities are rare (only 1.1% of crashes result in fatalities), while injury severity (MAX_SEV_IR) varies across cases.
Property damage crashes (PRPTYDMG_CRASH) occur at nearly the same rate as injury crashes (49.1%), showing a balanced impact between damage and injuries.
Straight road crashes: 86.8% of accidents happened on straight roads rather than curves.
Road conditions: Majority of accidents occurred on dry roads with clear weather conditions.

#Data Visualization

# Check the Distrubution of 0=no injury, 1=non-fatal inj., 2=fatal inj. >>
df['MAX_SEV_IR'].value_counts().plot(kind='pie',autopct='%1.0f%%',figsize=(6,6))
# >> We can see Blance in Dataset for Injury and Non Injury 50% almost approximate >>

output:

(Q) Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”

# Convert MAX_SEV_IR into a binary target variable 'INJURY' >>

df['INJURY'] = np.where(df['MAX_SEV_IR'].isin([1, 2]), 'yes', 'no')

# Drop MAX_SEV_IR as it is now encoded into INJURY >>

df.drop(columns=['MAX_SEV_IR'], inplace=True)

# Convert injury categorical to numerical values > yes:1 and no:0 >>

df['INJURY'] = df['INJURY'].replace({'yes':1,'no':0})

#Categorical Distribution

# Select all our columns >>

categorical_vars = df.columns.tolist()

# Set up the grid layout >>

num_vars = len(categorical_vars)

cols = 4

rows = 6

fig, axes = plt.subplots(rows, cols, figsize=(20, 5 * rows))

axes = axes.flatten()

# Plot each categorical variable >>

for i, var in enumerate(categorical_vars):

sns.countplot(x=df[var], ax=axes[i], palette='viridis',hue=df['INJURY'])

axes[i].set_title(var)

axes[i].set_ylabel('Count')

axes[i].set_xlabel('')

plt.tight_layout()

plt.show()

output:

Summary of Distribution:

Injuries occur in 54.5% of all crashes, with a higher proportion on weekends (56.9%) than weekdays (54.5%).
Head-on collisions have the highest injury rate (71.2%), while rear-end crashes have the lowest (43.6%).
Pedestrian and cyclist crashes result in injuries 99.1% of the time, making them the most dangerous.
Speed limits between 35-55 mph account for nearly half (47.1%) of all injury crashes.
Alcohol-related crashes lead to injuries 59.9% of the time, higher than non-alcohol-related crashes (53.8%).
Dark, unlit roads have a higher injury severity risk, despite fewer crashes compared to well-lit areas.
Intersections are involved in 47% of injury cases, making them high-risk zones.
Most injuries happen in daylight (58.6%) and on dry roads (67.3%), despite better driving conditions.
Multi-vehicle crashes (61.2%) cause more injuries than single-vehicle crashes (51%).
Work zone crashes are rare but have a high injury rate (53.6%).

Employ top and experienced data analytics professionals to finish your projects or assignments at an affordable price. For further information, please contact us at:

realcode4you@gmail.com

RealCode4You

Automobile accidents Data Analysis in the United States | Sample Work| Realcode4you

Code Implementation

Summary:

#Data Visualization

Summary of Distribution:

Employ top and experienced data analytics professionals to finish your projects or assignments at an affordable price. For further information, please contact us at:

Recent Posts

Comments