Introductions
In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. To complete this project, you will need to implement several conditional predictions and answer the questions below.
Getting Started
To begin working with the RMS Titanic passenger data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame.
Run the code cell below to load our data and display the first few entries (passengers) for examination using the .head() function.
Import Libraries
import numpy as np
import pandas as pd
# RMS Titanic data visualization code
from titanic_visualizations import survival_stats
%matplotlib inline
Read Data
# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)
# Print the first few entries of the RMS Titanic data
full_data.head()
Output
From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:
Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
Since we're interested in the outcome of survival for each passenger or crew member, we can remove the Survived feature from this dataset and store it as its own separate variable outcomes. We will use these outcomes as our prediction targets. Run the code cell below to remove Survived as a feature of the dataset and store it in outcomes.
outcomes = full_data['Survived']
# df.drop() drops categories or rows
data = full_data.drop('Survived', axis = 1)
# Show the new dataset with 'Survived' removed
data.head()
Output
Accuracy Score
def accuracy_score(truth, pred):
if len(truth) == len(pred):
return "Predictions have an accuracy of {:.2f}.".format((truth == pred).mean()*100)
else:
return "Number of predictions does not match number of outcomes!"
predictions = pd.Series(np.ones(5, dtype = int))
print accuracy_score(outcomes[:5], predictions)
Output
Predictions have an accuracy of 60.00.
Pridiction
# We can calculate the null accuracy from this
def predictions_0(data):
predictions = []
for index, passenger in data.iterrows():
predictions.append(0)
return pd.Series(predictions)
predictions = predictions_0(data)
predictions.head()
Output
0 0 1 0 2 0 3 0 4 0 dtype: int64
Accuracy Score
print accuracy_score(outcomes, predictions)
Output
Predictions have an accuracy of 61.62.
Kommentarer