top of page

Titanic Survival| Predicting Titanic Survival Data Using Python Machine Learning


In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. To complete this project, you will need to implement several conditional predictions and answer the questions below.

Getting Started

To begin working with the RMS Titanic passenger data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame.

Run the code cell below to load our data and display the first few entries (passengers) for examination using the .head() function.

Import Libraries

import numpy as np
import pandas as pd

# RMS Titanic data visualization code 
from titanic_visualizations import survival_stats
%matplotlib inline

Read Data

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data


From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:

  • Survived: Outcome of survival (0 = No; 1 = Yes)

  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)

  • Name: Name of passenger

  • Sex: Sex of the passenger

  • Age: Age of the passenger (Some entries contain NaN)

  • SibSp: Number of siblings and spouses of the passenger aboard

  • Parch: Number of parents and children of the passenger aboard

  • Ticket: Ticket number of the passenger

  • Fare: Fare paid by the passenger

  • Cabin Cabin number of the passenger (Some entries contain NaN)

  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the Survived feature from this dataset and store it as its own separate variable outcomes. We will use these outcomes as our prediction targets. Run the code cell below to remove Survived as a feature of the dataset and store it in outcomes.

outcomes = full_data['Survived']

# df.drop() drops categories or rows
data = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed


Accuracy Score

def accuracy_score(truth, pred):
    if len(truth) == len(pred): 
        return "Predictions have an accuracy of {:.2f}.".format((truth == pred).mean()*100)
        return "Number of predictions does not match number of outcomes!"
predictions = pd.Series(np.ones(5, dtype = int))
print accuracy_score(outcomes[:5], predictions)


Predictions have an accuracy of 60.00.


# We can calculate the null accuracy from this

def predictions_0(data):
    predictions = []
    for index, passenger in data.iterrows():
    return pd.Series(predictions)
predictions = predictions_0(data)


0 0 1 0 2 0 3 0 4 0 dtype: int64

Accuracy Score

print accuracy_score(outcomes, predictions)


Predictions have an accuracy of 61.62.


bottom of page