The Titanic sank on the 15th of April 1912 after hitting an iceberg during its first voyage. Unfortunately there were not sufficient lifeboats on board resulting in the death of 1502 of the 2224 passengers and crew. When the Titanic started its journey it was considered the best boat in the World and unsinkable.
As time has passed and analysis has been carried out of the passengers who died and survived it has been discovered that people with certain characteristics had a higher chance of survival. This publication responds to these findings with the intention of creating a decision tree classification model to predict if someone onboard survived or died. Therefore survival will be the dependent variable. Independent variables including title, age, sex, social class, and port of embarkation are used. The publication will be split into two parts. Part one will focus on the data preparation, with part two focusing on data analysis and model creation.
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DT)
library(purrr)
library(corrplot)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)
Two data bases will be used - the first to train the model, and the second to test the model.
The general idea of machine learning classification is to use the training data to look for relationships between the independent variables and the dependent variable, with a model being created.
The model will then be tested against the second data set which contains unseen data. This is to say that it contains all of the independent variables but with the dependent variable hidden. The model will look for similar patterns between the independent variables that existed in the training data and use them to predict the dependent variable.
Both data bases can be downloaded from the following link.
setwd("~/Documents/Machine Learning/15. Hugo/academic-kickstart-master/content/en/post/Titanic")
test <- read.csv("test.csv")
train <- read.csv("train.csv")
In order to conduct preliminary analysis of all of the passengers both the data sets will be combined. In order for this to be possible firstly a survival variable will be added to the test data set ensuring that both data sets have 12 variables and can be combined. In total with both data sets there are 1309 observations with 12 variables. The variables are:
Variable | Description |
---|---|
Survived | Survived (1) or died (0) |
Pclass | Social class of passenger |
Name | Name of passenger |
Sex | Sex of passenger |
Age | Age of passenger |
SibSp | Number of siblings or partners on the Titanic |
Parch | Number of parents or children on the Titanic |
Ticket | Ticket number |
Fare | Cost of ticket |
Cabin | Cabin number |
Embarked | Port of embarkation |
test$Survived <- NA
test <- test[,c(1,12,2,3,4,5,6,7,8,9,10,11)]
total <- rbind(train, test)
str(total)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
In the following section the data will be cleaned so that it is able to be analysed further on.
There are five variables with missing values: Cabin, Survived, Age, Embarked, Fare. It is necessary to treat these missing values as they can impact the effectiveness of a classification model. The only variable that will not be treated for missing values is Survived as this is the dependent variable and the missing values correspond to the hidden test data.
checkAllCols(total)
## col class num numMissing numInfinite avgVal minVal maxVal
## 1 PassengerId integer 1309 0 0 655.0000000 1 1309
## 2 Survived integer 891 418 0 0.3838384 0 1
## 3 Pclass integer 1309 0 0 2.2948816 1 3
## 4 Name factor 1309 0 NA NA NA NA
## 5 Sex factor 1309 0 NA NA NA NA
## 6 Age numeric 1046 263 0 29.8811377 0 80
## 7 SibSp integer 1309 0 0 0.4988541 0 8
## 8 Parch integer 1309 0 0 0.3850267 0 9
## 9 Ticket factor 1309 0 NA NA NA NA
## 10 Fare numeric 1308 1 0 33.2954793 0 512
## 11 Cabin factor 295 1014 NA NA NA NA
## 12 Embarked factor 1307 2 NA NA NA NA
There are 263 missing values for the age variable. The average age of passengers will be inserted for these missing values.
The below syntax will be used to calculate the average age of passengers and add it to the data. Additionally, a new variable will be created which groups the passengers by age into the following classes:
total <- total %>% mutate(Age = ifelse(is.na(Age), mean(total$Age, na.rm = T), Age),
`Age Group` = case_when(Age < 13 ~ "Age.0012",
Age >= 13 & Age < 18 ~ "Age.1317",
Age >= 18 & Age <40 ~ "Age.1839",
Age >= 40 & Age < 60 ~ "Age.4059",
Age >= 60 ~ "Age.60Ov"))
As there are only two observations with missing values for Port, Southampton will be inserted for both these observations as it is the most frequent port in the data.
levels(total$Embarked)
## [1] "" "C" "Q" "S"
table(total$Embarked)
##
## C Q S
## 2 270 123 914
levels(total$Embarked)[1] <- c("S")
There is one missing value for this variable which is located in row 1044. It is likely that the variables of Class and Port impacted the cost of the Fare. Passenger 1044 ‘Mr Thomas Storey’ was a class 3 passenger and boarded the ship in Southampton. Therefore, to replace this missing value the average fare for third class passengers who boarded in Southampton will be used. The result is a value of £14.44.
mean_fare_calculation <- total %>% filter(Pclass == '3' & Embarked == 'S') %>% filter(!PassengerId == 1044)
mean(mean_fare_calculation$Fare)
## [1] 14.43542
total[1044, 10] <- 14.43542
The title of each passenger will be separated in a new variable for Title. The following table shows that the most common titles for passengers were Master, Miss, Mr, y Mrs with a representation of 97.40% of the passengers.
Some of the less common titles will be grouped together in a new class called rare_title. Additionally, the titles for Mlle and Ms will be added to the class of Miss. The title of Mme will be added to the class of Mrs.
total$Title <- gsub('(.*, )|(\\..*)', '', total$Name)
table_titles_total <- table(total$Sex, total$Title)
table_titles_total
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs
## female 0 0 0 1 1 0 1 0 0 260 2 1 0 197
## male 1 4 1 0 7 1 0 2 61 0 0 0 757 0
##
## Ms Rev Sir the Countess
## female 2 0 0 1
## male 0 8 1 0
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')
total$Title[total$Title %in% rare_title] <- 'Rare Title'
total$Title[total$Title == 'Mlle'] <- 'Miss'
total$Title[total$Title == 'Ms'] <- 'Miss'
total$Title[total$Title == 'Mme'] <- 'Mrs'
table(total$Sex, total$Title)
##
## Master Miss Mr Mrs Rare Title
## female 0 264 0 198 4
## male 61 0 757 0 25
With the variables SibSp and Parch it is possible to know if a passenger had family on the Titanic. SibSp counts siblings and partners, with Parch counting parents and children. A new variable will be created to count family sizes.
total$Family_size <- total$SibSp + total$Parch + 1
As already explained the survived variable has missing values as it is the dependent variable, with the missing values the observations from the test data. It is therefore not necessary to treat these missing values, however, the survival rate for the available data will be analysed.
In the below table it can be seen that in the train data 61.62% of the passengers died and 38.38% survived.
total$group <- ifelse(total$PassengerId <= 891, "entrenar", "ensayar")
total %>% filter(group == "entrenar") %>% group_by(Survived) %>% count() %>% mutate(percentage_all = (n/1309) * 100) %>% mutate(percentage_entrenar = (n/891) * 100)
## # A tibble: 2 x 4
## # Groups: Survived [2]
## Survived n percentage_all percentage_entrenar
## <int> <int> <dbl> <dbl>
## 1 0 549 41.9 61.6
## 2 1 342 26.1 38.4
Cabin is the variable with the most missing values with 1014 missing in total. Therefore, this variable will not be included in the model.
This publication has introduced this data analysis project which aims to create a decision tree classification model to predict if someone onboard the Titanic survived or died. Part one has focused on the data preparation, with the data being prepared for subsequent analysis and model creation.