- Defining the Question/Problem
- Acquire training and testing data and Analyse, identify patterns, and explore the data
- Wrangle, prepare, cleanse the data
- Model, predict and solve the problem
- Visualise, report, and present the problem solving steps and final solution
1. Defining the Question/Problem
Titanic: Machine Learning from Disaster
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.
Your score is the percentage of passengers you correctly predict. This is known simply as “accuracy”.
2. Acquire & Analyse training and testing data
I downloaded the training and testing dataset from Kaggle and saved it in the same folder as the jupyter notebook
import numpy as np import pandas as pd import random as rnd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from sklearn.linear_model import LogisticRegression #Logistic Regression from sklearn.svm import SVC, LinearSVC #Support Vector Machines from sklearn.ensemble import RandomForestClassifier #Random Forest from sklearn.neighbors import KNeighborsClassifier #KNN from sklearn.tree import DecisionTreeClassifier #Decision tree from sklearn.linear_model import Perceptron #Perceptron
train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv') combine = [train_df,test_df]
|0||1||0||3||Braund, Mr. Owen Harris||male||22.0||1||0||A/5 21171||7.2500||NaN||S|
|1||2||1||1||Cumings, Mrs. John Bradley (Florence Briggs Th…||female||38.0||1||0||PC 17599||71.2833||C85||C|
|2||3||1||3||Heikkinen, Miss. Laina||female||26.0||0||0||STON/O2. 3101282||7.9250||NaN||S|
|3||4||1||1||Futrelle, Mrs. Jacques Heath (Lily May Peel)||female||35.0||1||0||113803||53.1000||C123||S|
|4||5||0||3||Allen, Mr. William Henry||male||35.0||0||0||373450||8.0500||NaN||S|
Which features are categorical?
Which features are numerical?
Which features are mixed data types?
- Ticket (mix of numeric and alphanumeric)
- Cabin (alphanumeric)
Which features contain blank, null or empty values?
As shown below using a heatmap, Cabin, Age and Embarked features contain null values
<matplotlib.axes._subplots.AxesSubplot at 0x1a0aec49e8>
In the training data we have 7 features that are integer/floats and 5 strings (object)
In the testing data we have 6 features that are integer/floats and 5 strings (object)
train_df.info() print('-'*40) test_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB ---------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
What is the distribution of numerical feature values across the samples?
- Total training samples are 891 whic represents 40% of the actual number of passengers on board the Titanic (2,224 – given in the context)
- Survived is a categorical feature with 0 or 1 values
- The training samples indicate a 38% survival rate which is representative of the actual survival rate at 32%
- Nearly 30% of the passengers had siblings and/or spouses aboard
- Fares varied significantly with few passengers (<1%) paying as high as $512
- Few elderly passengers (<1%) within age range 65 – 80
What is the distribution of categorical features?
|top||Homer, Mr. Harry (“Mr E Haven”)||male||1601||B96 B98||S|
- Each passenger on board has a unique name
- 65% are male (577/891)
- Cabin values have several duplicates across samples or several passengers shared a cabin
- Embarked takes on 3 possible values. S port used by most passengers
- Ticket feature has high ratio of duplicate values (23% – 210/891)
Assumption based on data analysis
- We want to know how well each feature correlates with Survival.
- We may want to complete Age feature as it is definitely correlated to Survival.
- We may want to complete the Embarked feature as it may also correlate with survival or another important feature.
- Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
- Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
- PassengerId may be dropped from training dataset as it does not contribute to survival.
- Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.
- We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
- We may want to engineer the Name feature to extract Title as a new feature.
- We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
- We may also want to create a Fare range feature if it helps our analysis.
- Women (Sex=female) were more likely to have survived.
- Children (Age<?) were more likely to have survived.
- The upper-class passengers (Pclass=1) were more likely to have survived.
g = sns.FacetGrid(train_df,col='Survived') g.map(plt.hist,'Age',bins=40)
<seaborn.axisgrid.FacetGrid at 0x102b2e518>
- Infants (Age <=4) had high survival rate.
- Oldest passengers (Age = 80) survived.
- Large number of 15-25 year olds did not survive.
- Most passengers are in 15-35 age range.
Confirms that we should consider Age in our model training and fill in the null values. We should also band age groups.
<matplotlib.axes._subplots.AxesSubplot at 0x1a1280c160>
The above confirmed our assumption that female were more likely to survived
<matplotlib.axes._subplots.AxesSubplot at 0x1a1293d860>
The above confirmed our assumption that the upper-class passengers were more likely to survived
grid = sns.FacetGrid(train_df,row='Pclass',col='Survived',aspect=1.5, size=2.3) grid.map(plt.hist,'Age',bins=30)
<seaborn.axisgrid.FacetGrid at 0x1a12976208>
- Pclass 3 had the most passengers, however most did not survive. Confirms our classifying assumption 2
- Infant passengers in Pclass 2 and Pclass 3 mostly survived. Further qualifies our classifying assumption 2
- Most passengers in Pclass 1 survived. Confirms our classifying assumption 3
- Pclass varies in terms of Age distribution of passengers
embarked = sns.FacetGrid(train_df,'Embarked',size =2.3, aspect=1.5) embarked.map(sns.pointplot,'Pclass','Survived','Sex',palette='deep') embarked.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1a12daa908>
- Female passengers had much better survival rate than males. Confirms classifying 1
- Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived
- Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing 2
- Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating 1
fare = sns.FacetGrid(train_df,row='Embarked',col='Survived',size = 2.3,aspect=1.5) fare.map(sns.barplot,'Sex','Fare',ci=None) fare.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1a130e4780>
- Higher fare paying passengers had better survival. Confirms our assumption for creating number 4 fare ranges
- Port of embarkation correlates with survival rates. Confirms correlating 1 and completing 2