Surviving the Titanic- using Machine Learning for Disaster Analysis

Abhishek Wadhwani
5 min readMar 10, 2023

As Per Kaggle Tutorial

About the Disaster:

The Titanic was a British passenger liner that sank on its maiden voyage from Southampton, England, to New York City on April 15, 1912. The ship hit an iceberg in the North Atlantic Ocean and sank, resulting in the deaths of over 1,500 people. Despite being equipped with numerous safety features, the ship was unable to prevent the disaster. The sinking of the Titanic had a profound impact on public opinion and led to major changes in maritime safety regulations.

These included the requirement for all ships to carry enough lifeboats to accommodate all passengers and crew, the implementation of regular safety drills, and the creation of an international ice patrol to monitor icebergs in the North Atlantic. The story of the Titanic has been immortalized in popular culture and continues to fascinate people around the world.

Challenge:

The aim of this contest is to utilize a pre-existing dataset to develop a predictive model that can determine which individuals would have survived or perished in the Titanic catastrophe. Essentially, participants will construct their own Machine Learning (ML) algorithm that, with the aid of the provided dataset, will be able to forecast passenger survival.

Submission details:

Kaggle Titanic Tutorial: https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook

Kaggle Titanic Disaster Competition: https://www.kaggle.com/competitions/titanic

Github: https://github.com/Abhismoothie/Data-Mining/tree/main

Dataset Used:

Two data sets are provided to us for this competition: the trained dataset (train.csv) and the test dataset (test.csv). The earlier one is used to train our ML model, and the later one is used to test it.

PassengerId: Ids from 1 to 891 to uniquely identify each passenger.

Survived: Enumerated values, either 1-survived and 0-Died (missing this column in test.csv which our ML model will predict)

Pclass: Passenger class

Name: Name of the Passenger

Sex: Enumerated values {male - 65%, female - 35%} identifies gender of a passenger.

Age: Age of passenger (0.42 to 80 years)

SibSp: Number indicating Sibling or Spouse of the passenger on to the ship (0 to 8)

Parch: Number indicating Parent or Children of the passenger (0 to 6)

Ticket: Ticket number of the passenger

Fare: Cost of the ticket.

Cabin: Cabin of the passenger stay. Values include - null, G6 and others.

Embarked: Ship port where passenger boarded. This has enumerated values as S, C, Q indicating Southampton, Cherbourg and Queenstown respectively.

Our output dataset should have headers in CSV file: (Example was given in gender_submission.csv using RandomForestClassifier)

PassengerId: Ids from 1 to 891 to uniquely identify each passenger.

Survived: Enumerated values, either 1-survived and 0-Died (missing this column in test.csv which our ML model will predict)

First Submission:

This submission is done by following the kaggle titanic tutorial:

And the original score received was — 0.77511

My Contribution:

This code loads the Titanic dataset, removes irrelevant columns, handles missing values, corrects errors and inconsistencies, converts categorical variables, normalizes numerical variables, removes outliers, and validates the cleaned data. Gradient Boosting, a popular machine learning algorithm, is then applied to predict the survival of passengers.

The data is loaded from two CSV files into Pandas DataFrames. The irrelevant columns are removed using the drop() method. Missing values in the ‘Age’ column are replaced with the mean age of the passengers using the fillna() method. The ‘Sex’ column is converted to binary values using the replace() method. Categorical variables are converted to dummy variables using the get_dummies() method. Numerical variables are normalized using the MinMaxScaler() method. Outliers are removed from the ‘Fare’ column using numpy.abs(), the mean, and standard deviation. Finally, the cleaned data is validated using the describe() method and column names are printed to the console. The GradientBoostingClassifier() function from the scikit-learn library is used to build a model and predict the survival of passengers based on their features.

  • The code reads Titanic train and test datasets and drops irrelevant columns from them such as ‘Name’, ‘Ticket’, and ‘Cabin’.
  • The missing values in ‘Age’ and ‘Fare’ columns are handled by filling them with their mean and median respectively. Any rows with missing data are removed from the training set.
  • The ‘Sex’ column is corrected and transformed to numerical data by replacing ‘male’ with 0 and ‘female’ with 1.
  • The categorical variable ‘Embarked’ is converted into numerical data using one-hot encoding with ‘pd.get_dummies’ function.
  • The numerical variables ‘Age’ and ‘Fare’ are normalized using the ‘MinMaxScaler’ from the ‘sklearn.preprocessing’ module.
  • Outliers are removed from the ‘Fare’ column of the training set.
  • The features and target for the model are defined. The model is trained using Gradient Boosting Classifier from the ‘sklearn.ensemble’ module with parameters n_estimators=100, learning_rate=0.1, max_depth=3, and random_state=42.
  • The trained model is used to make predictions on the test set.
  • The predictions are saved to a file named ‘submission.csv’ using ‘pd.DataFrame’ and ‘to_csv’ functions.

The score I received after altercation in code — 0.7799

--

--