The Titanic sank on the 15th of April 1912 after hitting an iceberg during its first voyage. Unfortunately there were not sufficient lifeboats on board resulting in the death of 1502 of the 2224 passengers and crew. When the Titanic started its journey it was considered the best boat in the World and unsinkable.

As time has passed and analysis has been carried out of the passengers who died and survived it has been discovered that people with certain characteristics had a higher chance of survival. This publication responds to these findings with the intention of creating a decision tree classification model to predict if someone onboard survived or died. Therefore survival will be the dependent variable. Independent variables including title, age, sex, social class, and port of embarkation are used. The publication will be split into two parts. Part one will focus on the data preparation, with part two focusing on data analysis and model creation.

2) Packages

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DT)
library(purrr)
library(corrplot)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)

3) Loading the Data

Two data bases will be used - the first to train the model, and the second to test the model.

The general idea of machine learning classification is to use the training data to look for relationships between the independent variables and the dependent variable, with a model being created.

The model will then be tested against the second data set which contains unseen data. This is to say that it contains all of the independent variables but with the dependent variable hidden. The model will look for similar patterns between the independent variables that existed in the training data and use them to predict the dependent variable.

Both data bases can be downloaded from the following link.

setwd("~/Documents/Machine Learning/15. Hugo/academic-kickstart-master/content/en/post/Titanic")

test <- read.csv("test.csv")

train <- read.csv("train.csv")

3.1) Combining Test and Train

In order to conduct preliminary analysis of all of the passengers both the data sets will be combined. In order for this to be possible firstly a survival variable will be added to the test data set ensuring that both data sets have 12 variables and can be combined. In total with both data sets there are 1309 observations with 12 variables. The variables are:

Variable Description
Survived Survived (1) or died (0)
Pclass Social class of passenger
Name Name of passenger
Sex Sex of passenger
Age Age of passenger
SibSp Number of siblings or partners on the Titanic
Parch Number of parents or children on the Titanic
Ticket Ticket number
Fare Cost of ticket
Cabin Cabin number
Embarked Port of embarkation
test$Survived <- NA

test <- test[,c(1,12,2,3,4,5,6,7,8,9,10,11)]

total <- rbind(train, test)
str(total)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

4) Characteristics Engineering

In the following section the data will be cleaned so that it is able to be analysed further on.

4.1) Missing Values

There are five variables with missing values: Cabin, Survived, Age, Embarked, Fare. It is necessary to treat these missing values as they can impact the effectiveness of a classification model. The only variable that will not be treated for missing values is Survived as this is the dependent variable and the missing values correspond to the hidden test data.

checkAllCols(total)
##            col   class  num numMissing numInfinite      avgVal minVal maxVal
## 1  PassengerId integer 1309          0           0 655.0000000      1   1309
## 2     Survived integer  891        418           0   0.3838384      0      1
## 3       Pclass integer 1309          0           0   2.2948816      1      3
## 4         Name  factor 1309          0          NA          NA     NA     NA
## 5          Sex  factor 1309          0          NA          NA     NA     NA
## 6          Age numeric 1046        263           0  29.8811377      0     80
## 7        SibSp integer 1309          0           0   0.4988541      0      8
## 8        Parch integer 1309          0           0   0.3850267      0      9
## 9       Ticket  factor 1309          0          NA          NA     NA     NA
## 10        Fare numeric 1308          1           0  33.2954793      0    512
## 11       Cabin  factor  295       1014          NA          NA     NA     NA
## 12    Embarked  factor 1307          2          NA          NA     NA     NA

4.2) Edad

There are 263 missing values for the age variable. The average age of passengers will be inserted for these missing values.

4.2.1) Avergae Age

The below syntax will be used to calculate the average age of passengers and add it to the data. Additionally, a new variable will be created which groups the passengers by age into the following classes:

  • Age < 13
  • Age >= 13 & Age < 18
  • Age >= 18 & Age < 40
  • Age >= 40 & Age < 60
  • Age >= 60
total <- total %>% mutate(Age = ifelse(is.na(Age), mean(total$Age, na.rm = T), Age), 
                          `Age Group` = case_when(Age < 13 ~ "Age.0012", 

                                 Age >= 13 & Age < 18 ~ "Age.1317",
                                 
                                 Age >= 18 & Age <40 ~ "Age.1839",

                                 Age >= 40 & Age < 60 ~ "Age.4059",

                                 Age >= 60 ~ "Age.60Ov"))

4.3) Port

As there are only two observations with missing values for Port, Southampton will be inserted for both these observations as it is the most frequent port in the data.

levels(total$Embarked)
## [1] ""  "C" "Q" "S"
table(total$Embarked)
## 
##       C   Q   S 
##   2 270 123 914
levels(total$Embarked)[1] <- c("S")

4.4) Fare

There is one missing value for this variable which is located in row 1044. It is likely that the variables of Class and Port impacted the cost of the Fare. Passenger 1044 ‘Mr Thomas Storey’ was a class 3 passenger and boarded the ship in Southampton. Therefore, to replace this missing value the average fare for third class passengers who boarded in Southampton will be used. The result is a value of £14.44.

mean_fare_calculation <- total %>% filter(Pclass == '3' & Embarked == 'S') %>% filter(!PassengerId == 1044)

mean(mean_fare_calculation$Fare)
## [1] 14.43542
total[1044, 10] <- 14.43542

4.5) Name

The title of each passenger will be separated in a new variable for Title. The following table shows that the most common titles for passengers were Master, Miss, Mr, y Mrs with a representation of 97.40% of the passengers.

Some of the less common titles will be grouped together in a new class called rare_title. Additionally, the titles for Mlle and Ms will be added to the class of Miss. The title of Mme will be added to the class of Mrs.

total$Title <- gsub('(.*, )|(\\..*)', '', total$Name)

table_titles_total <- table(total$Sex, total$Title)

table_titles_total
##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme  Mr Mrs
##   female    0   0   0    1   1        0    1     0      0  260    2   1   0 197
##   male      1   4   1    0   7        1    0     2     61    0    0   0 757   0
##         
##           Ms Rev Sir the Countess
##   female   2   0   0            1
##   male     0   8   1            0
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

total$Title[total$Title %in% rare_title] <- 'Rare Title'
total$Title[total$Title == 'Mlle'] <- 'Miss'
total$Title[total$Title == 'Ms'] <- 'Miss'
total$Title[total$Title == 'Mme'] <- 'Mrs'

table(total$Sex, total$Title)
##         
##          Master Miss  Mr Mrs Rare Title
##   female      0  264   0 198          4
##   male       61    0 757   0         25

4.6) Family

With the variables SibSp and Parch it is possible to know if a passenger had family on the Titanic. SibSp counts siblings and partners, with Parch counting parents and children. A new variable will be created to count family sizes.

total$Family_size <- total$SibSp + total$Parch + 1

4.7) Survival

As already explained the survived variable has missing values as it is the dependent variable, with the missing values the observations from the test data. It is therefore not necessary to treat these missing values, however, the survival rate for the available data will be analysed.

In the below table it can be seen that in the train data 61.62% of the passengers died and 38.38% survived.

total$group <- ifelse(total$PassengerId <= 891, "entrenar", "ensayar")
total %>% filter(group == "entrenar") %>% group_by(Survived) %>% count() %>% mutate(percentage_all = (n/1309) * 100) %>% mutate(percentage_entrenar = (n/891) * 100)
## # A tibble: 2 x 4
## # Groups:   Survived [2]
##   Survived     n percentage_all percentage_entrenar
##      <int> <int>          <dbl>               <dbl>
## 1        0   549           41.9                61.6
## 2        1   342           26.1                38.4

4.8) Cabin

Cabin is the variable with the most missing values with 1014 missing in total. Therefore, this variable will not be included in the model.

Conclusion

This publication has introduced this data analysis project which aims to create a decision tree classification model to predict if someone onboard the Titanic survived or died. Part one has focused on the data preparation, with the data being prepared for subsequent analysis and model creation.