top of page

Automobile accidents Data Analysis in the United States | Sample Work| Realcode4you

The file accidents.csv contains information on automobile accidents in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).


  • Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”

  • Assuming that no information or initial reports about the accident itself are available at the me of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis?

  • Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.

  • What is the overall error for the validation set?

  • What is the percent improvement relative to the naive rule (using the validation set)?

  • Examine the conditional probabilities in the pivot tables. Why do we get a probability of zero for P(INJURY = No ∣ SPD_LIM = 5)?



Code Implementation

# Importing all required ML modules >>
# Importing numpy for mathematical calculations >>
import numpy as np
import warnings
# Importing Pandas for data dealing >>
import pandas as pd
# Importing matplotlib and seaborn for data visualization >>
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
# Importing Sklean modules for splitting data >>
from sklearn.model_selection import train_test_split
# Impoting Labelencoder for encoding data >>
from sklearn.preprocessing import LabelEncoder
# Importing CategoricalNB modules for classification >>
from sklearn.naive_bayes import CategoricalNB
# Importing sklearn accuracy metrics >>
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Dataset Loading

# Loading Dataset into df variables >>
df = pd.read_csv('./accidents.csv')

#Explore Dataset

# Explore the Dataset take a look at top 5 Rows >>
df.head()
# Check the shape of Dataset rows and columns >>
print("[$] Rows of dataset >> ",df.shape[0])
print("[$] Columns of dataset >> ",df.shape[1])

output:

[$] Rows of dataset >>  42183
[$] Columns of dataset >>  24

# Check the information of Dataset >>
df.info()

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42183 entries, 0 to 42182
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   HOUR_I_R        42183 non-null  int64
 1   ALCHL_I         42183 non-null  int64
 2   ALIGN_I         42183 non-null  int64
 3   STRATUM_R       42183 non-null  int64
 4   WRK_ZONE        42183 non-null  int64
 5   WKDY_I_R        42183 non-null  int64
 6   INT_HWY         42183 non-null  int64
 7   LGTCON_I_R      42183 non-null  int64
 8   MANCOL_I_R      42183 non-null  int64
 9   PED_ACC_R       42183 non-null  int64
 10  RELJCT_I_R      42183 non-null  int64
 11  REL_RWY_R       42183 non-null  int64
 12  PROFIL_I_R      42183 non-null  int64
 13  SPD_LIM         42183 non-null  int64
 14  SUR_COND        42183 non-null  int64
 15  TRAF_CON_R      42183 non-null  int64
 16  TRAF_WAY        42183 non-null  int64
 17  VEH_INVL        42183 non-null  int64
 18  WEATHER_R       42183 non-null  int64
 19  INJURY_CRASH    42183 non-null  int64
 20  NO_INJ_I        42183 non-null  int64
 21  PRPTYDMG_CRASH  42183 non-null  int64
 22  FATALITIES      42183 non-null  int64
 23  MAX_SEV_IR      42183 non-null  int64
dtypes: int64(24)
memory usage: 7.7 MB

# Check for any null values >>
print("[$] Null values in dataset >> ",df.isnull().sum().sum())
print("[$] Duplicate values in dataset >> ",df.duplicated().sum())

output:

[$] Null values in dataset >>  0
[$] Duplicate values in dataset >>  18366

# Dropping the duplicate Values >>
df.drop_duplicates(inplace=True)
# Check the Statstical description of dataset >>
df.describe().T

output:



count

mean

std

min

25%

50%

75%

max

HOUR_I_R

42183.0

0.429344

0.494988

0.0

0.0

0.0

1.0

1.0

ALCHL_I

42183.0

1.912832

0.282084

1.0

2.0

2.0

2.0

2.0

ALIGN_I

42183.0

1.131546

0.338000

1.0

1.0

1.0

1.0

2.0

STRATUM_R

42183.0

0.491620

0.499936

0.0

0.0

0.0

1.0

1.0

WRK_ZONE

42183.0

0.022616

0.148677

0.0

0.0

0.0

0.0

1.0

WKDY_I_R

42183.0

0.771614

0.419797

0.0

1.0

1.0

1.0

1.0

INT_HWY

42183.0

0.150321

0.418952

0.0

0.0

0.0

0.0

9.0

LGTCON_I_R

42183.0

1.492521

0.789874

1.0

1.0

1.0

2.0

3.0

MANCOL_I_R

42183.0

1.337079

0.929756

0.0

0.0

2.0

2.0

2.0

PED_ACC_R

42183.0

0.040514

0.197164

0.0

0.0

0.0

0.0

1.0

RELJCT_I_R

42183.0

0.557926

0.496639

0.0

0.0

1.0

1.0

1.0

REL_RWY_R

42183.0

0.766541

0.423037

0.0

1.0

1.0

1.0

1.0

PROFIL_I_R

42183.0

0.243226

0.429035

0.0

0.0

0.0

0.0

1.0

SPD_LIM

42183.0

43.547875

12.948396

5.0

35.0

40.0

55.0

75.0

SUR_COND

42183.0

1.290710

0.780524

1.0

1.0

1.0

1.0

9.0

TRAF_CON_R

42183.0

0.516322

0.749417

0.0

0.0

0.0

1.0

2.0

TRAF_WAY

42183.0

1.477491

0.584851

1.0

1.0

1.0

2.0

3.0

VEH_INVL

42183.0

1.816964

0.684843

1.0

1.0

2.0

2.0

23.0

WEATHER_R

42183.0

1.142783

0.349855

1.0

1.0

1.0

1.0

2.0

INJURY_CRASH

42183.0

0.497736

0.500001

0.0

0.0

0.0

1.0

1.0

NO_INJ_I

42183.0

0.778702

1.035169

0.0

0.0

1.0

1.0

31.0

PRPTYDMG_CRASH

42183.0

0.491217

0.499929

0.0

0.0

0.0

1.0

1.0

FATALITIES

42183.0

0.011047

0.104524

0.0

0.0

0.0

0.0

1.0

MAX_SEV_IR

42183.0

0.519830

0.521256

0.0

0.0

1.0

1.0

2.0


Summary:

  1. Injuries occur in nearly half (49.8%) of all crashes, showing an even distribution between injury and non-injury crashes.

  2. Alcohol involvement (ALCHL_I) is recorded in a small fraction of cases (mean: 1.91, mostly non-alcohol-related crashes).

  3. Crashes on weekdays (WKDY_I_R) are significantly more common (77.2%) than on weekends.

  4. Only 2.26% of crashes occur in work zones (WRK_ZONE), indicating a low frequency of construction-related incidents.

  5. Speed limits range from 5 to 75 mph, with a median of 40 mph, showing that most crashes occur in mid-range speed zones.

  6. Most crashes (77%) happen on roads not near railways (REL_RWY_R), suggesting railway crossings are relatively safer zones.

  7. Intersections and highways (INT_HWY) contribute to 15% of crashes, indicating moderate risk in these areas.

  8. Multi-vehicle crashes are more common (mean vehicles involved: 1.82, max: 23), with most crashes involving two vehicles.

  9. Fatalities are rare (only 1.1% of crashes result in fatalities), while injury severity (MAX_SEV_IR) varies across cases.

  10. Property damage crashes (PRPTYDMG_CRASH) occur at nearly the same rate as injury crashes (49.1%), showing a balanced impact between damage and injuries.

  11. Straight road crashes: 86.8% of accidents happened on straight roads rather than curves.

  12. Road conditions: Majority of accidents occurred on dry roads with clear weather conditions.


#Data Visualization

# Check the Distrubution of 0=no injury, 1=non-fatal inj., 2=fatal inj. >>
df['MAX_SEV_IR'].value_counts().plot(kind='pie',autopct='%1.0f%%',figsize=(6,6))
# >> We can see Blance in Dataset for Injury and Non Injury 50% almost approximate >>

output:


(Q) Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”


# Convert MAX_SEV_IR into a binary target variable 'INJURY' >>

df['INJURY'] = np.where(df['MAX_SEV_IR'].isin([1, 2]), 'yes', 'no')


# Drop MAX_SEV_IR as it is now encoded into INJURY >>

df.drop(columns=['MAX_SEV_IR'], inplace=True)


# Convert injury categorical to numerical values > yes:1 and no:0 >>

df['INJURY'] = df['INJURY'].replace({'yes':1,'no':0})



#Categorical Distribution

# Select all our columns >>

categorical_vars = df.columns.tolist()


# Set up the grid layout >>

num_vars = len(categorical_vars)

cols = 4

rows = 6


fig, axes = plt.subplots(rows, cols, figsize=(20, 5 * rows))

axes = axes.flatten()


# Plot each categorical variable >>

for i, var in enumerate(categorical_vars):

sns.countplot(x=df[var], ax=axes[i], palette='viridis',hue=df['INJURY'])

axes[i].set_title(var)

axes[i].set_ylabel('Count')

axes[i].set_xlabel('')


plt.tight_layout()


output:

Summary of Distribution:

  1. Injuries occur in 54.5% of all crashes, with a higher proportion on weekends (56.9%) than weekdays (54.5%).

  2. Head-on collisions have the highest injury rate (71.2%), while rear-end crashes have the lowest (43.6%).

  3. Pedestrian and cyclist crashes result in injuries 99.1% of the time, making them the most dangerous.

  4. Speed limits between 35-55 mph account for nearly half (47.1%) of all injury crashes.

  5. Alcohol-related crashes lead to injuries 59.9% of the time, higher than non-alcohol-related crashes (53.8%).

  6. Dark, unlit roads have a higher injury severity risk, despite fewer crashes compared to well-lit areas.

  7. Intersections are involved in 47% of injury cases, making them high-risk zones.

  8. Most injuries happen in daylight (58.6%) and on dry roads (67.3%), despite better driving conditions.

  9. Multi-vehicle crashes (61.2%) cause more injuries than single-vehicle crashes (51%).

  10. Work zone crashes are rare but have a high injury rate (53.6%).



Employ top and experienced data analytics professionals to finish your projects or assignments at an affordable price. For further information, please contact us at:


Comments


REALCODE4YOU

Realcode4you is the one of the best website where you can get all computer science and mathematics related help, we are offering python project help, java project help, Machine learning project help, and other programming language help i.e., C, C++, Data Structure, PHP, ReactJs, NodeJs, React Native and also providing all databases related help.

Hire Us to get Instant help from realcode4you expert with an affordable price.

USEFUL LINKS

Discount

ADDRESS

Noida, Sector 63, India 201301

Follows Us!

  • Facebook
  • Twitter
  • Instagram
  • LinkedIn

OUR CLIENTS BELONGS TO

  • india
  • australia
  • canada
  • hong-kong
  • ireland
  • jordan
  • malaysia
  • new-zealand
  • oman
  • qatar
  • saudi-arabia
  • singapore
  • south-africa
  • uae
  • uk
  • usa

© 2023 IT Services provided by Realcode4you.com

bottom of page