Project 2: Titanic (Part 1)

Results

Workflow

  1. Defining the Question/Problem
  2. Acquire training and testing data and Analyse, identify patterns, and explore the data
  3. Wrangle, prepare, cleanse the data
  4. Model, predict and solve the problem
  5. Visualise, report, and present the problem solving steps and final solution

1. Defining the Question/Problem

Titanic: Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Objective

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. It is your job to predict if a passenger survived the sinking of the Titanic or not.

For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.

Metric

Your score is the percentage of passengers you correctly predict. This is known simply as “accuracy”.

2. Acquire & Analyse training and testing data

I downloaded the training and testing dataset from Kaggle and saved it in the same folder as the jupyter notebook

In [63]:
import numpy as np
import pandas as pd
import random as rnd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LogisticRegression #Logistic Regression
from sklearn.svm import SVC, LinearSVC #Support Vector Machines
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.tree import DecisionTreeClassifier #Decision tree
from sklearn.linear_model import Perceptron #Perceptron
In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df,test_df]
In [3]:
train_df.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Which features are categorical?

  • Survived
  • Sex
  • Embarked
  • Pclass

Which features are numerical?

  • Age
  • Fare
  • SibSp
  • Parch

Which features are mixed data types?

  • Ticket (mix of numeric and alphanumeric)
  • Cabin (alphanumeric)

Which features contain blank, null or empty values?

As shown below using a heatmap, Cabin, Age and Embarked features contain null values

In [4]:
sns.heatmap(train_df.isnull(),cmap='viridis',cbar=False,yticklabels=False)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a0aec49e8>

In the training data we have 7 features that are integer/floats and 5 strings (object)

In the testing data we have 6 features that are integer/floats and 5 strings (object)

In [5]:
train_df.info()
print('-'*40)
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

What is the distribution of numerical feature values across the samples?

In [6]:
train_df.describe(percentiles=[.1,.2,.3,.4,.5,.6,.7,.8,.9,.99])
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
10% 90.000000 0.000000 1.000000 14.000000 0.000000 0.000000 7.550000
20% 179.000000 0.000000 1.000000 19.000000 0.000000 0.000000 7.854200
30% 268.000000 0.000000 2.000000 22.000000 0.000000 0.000000 8.050000
40% 357.000000 0.000000 2.000000 25.000000 0.000000 0.000000 10.500000
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
60% 535.000000 0.000000 3.000000 31.800000 0.000000 0.000000 21.679200
70% 624.000000 1.000000 3.000000 36.000000 1.000000 0.000000 27.000000
80% 713.000000 1.000000 3.000000 41.000000 1.000000 1.000000 39.687500
90% 802.000000 1.000000 3.000000 50.000000 1.000000 2.000000 77.958300
99% 882.100000 1.000000 3.000000 65.870000 5.000000 4.000000 249.006220
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
  • Total training samples are 891 whic represents 40% of the actual number of passengers on board the Titanic (2,224 – given in the context)
  • Survived is a categorical feature with 0 or 1 values
  • The training samples indicate a 38% survival rate which is representative of the actual survival rate at 32%
  • Nearly 30% of the passengers had siblings and/or spouses aboard
  • Fares varied significantly with few passengers (<1%) paying as high as $512
  • Few elderly passengers (<1%) within age range 65 – 80

What is the distribution of categorical features?

In [7]:
train_df.describe(include=['O'])
Out[7]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Homer, Mr. Harry (“Mr E Haven”) male 1601 B96 B98 S
freq 1 577 7 4 644
  • Each passenger on board has a unique name
  • 65% are male (577/891)
  • Cabin values have several duplicates across samples or several passengers shared a cabin
  • Embarked takes on 3 possible values. S port used by most passengers
  • Ticket feature has high ratio of duplicate values (23% – 210/891)

Assumption based on data analysis

Data Correlation

  • We want to know how well each feature correlates with Survival.

Data Completion

  • We may want to complete Age feature as it is definitely correlated to Survival.
  • We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

Data Correction

  • Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
  • Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
  • PassengerId may be dropped from training dataset as it does not contribute to survival.
  • Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

Data Creation

  • We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
  • We may want to engineer the Name feature to extract Title as a new feature.
  • We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
  • We may also want to create a Fare range feature if it helps our analysis.

Data Classification

  • Women (Sex=female) were more likely to have survived.
  • Children (Age<?) were more likely to have survived.
  • The upper-class passengers (Pclass=1) were more likely to have survived.

Assumption verification

In [8]:
g = sns.FacetGrid(train_df,col='Survived')
g.map(plt.hist,'Age',bins=40)
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x102b2e518>

Observations

  • Infants (Age <=4) had high survival rate.
  • Oldest passengers (Age = 80) survived.
  • Large number of 15-25 year olds did not survive.
  • Most passengers are in 15-35 age range.

Confirms that we should consider Age in our model training and fill in the null values. We should also band age groups.

In [9]:
sns.barplot(x='Sex',y='Survived',data=train_df)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1280c160>

The above confirmed our assumption that female were more likely to survived

In [10]:
sns.barplot(x='Pclass',y='Survived',data=train_df)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1293d860>

The above confirmed our assumption that the upper-class passengers were more likely to survived

In [11]:
grid = sns.FacetGrid(train_df,row='Pclass',col='Survived',aspect=1.5, size=2.3)
grid.map(plt.hist,'Age',bins=30)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x1a12976208>

Observations

  • Pclass 3 had the most passengers, however most did not survive. Confirms our classifying assumption 2
  • Infant passengers in Pclass 2 and Pclass 3 mostly survived. Further qualifies our classifying assumption 2
  • Most passengers in Pclass 1 survived. Confirms our classifying assumption 3
  • Pclass varies in terms of Age distribution of passengers
In [12]:
embarked = sns.FacetGrid(train_df,'Embarked',size =2.3, aspect=1.5)
embarked.map(sns.pointplot,'Pclass','Survived','Sex',palette='deep')
embarked.add_legend()
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x1a12daa908>

Observations

  • Female passengers had much better survival rate than males. Confirms classifying 1
  • Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived
  • Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing 2
  • Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating 1
In [13]:
fare = sns.FacetGrid(train_df,row='Embarked',col='Survived',size = 2.3,aspect=1.5)
fare.map(sns.barplot,'Sex','Fare',ci=None)
fare.add_legend()
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x1a130e4780>

Observations

  • Higher fare paying passengers had better survival. Confirms our assumption for creating number 4 fare ranges
  • Port of embarkation correlates with survival rates. Confirms correlating 1 and completing 2

Leave a Reply

Your email address will not be published. Required fields are marked *