Titanic Survival Prediction Using Machine Learning

Comparative analysis of 5 ML algorithms achieving 76% accuracy with Naive Bayes classifier

Role Data Scientist & ML Engineer
Timeline 2024
Tools Python, scikit-learn, Pandas
76%
Best Accuracy
Naive Bayes
5
ML Algorithms
Compared
891
Training Samples
Passengers
100%
Python
Jupyter Notebook

Project Overview

This project analyzes the famous Titanic passenger dataset to predict survival outcomes using various machine learning classification algorithms. The analysis includes comprehensive exploratory data analysis, feature engineering, model training, and comparative evaluation of multiple ML techniques.

Challenge

The Titanic dataset presents a classic binary classification problem: predicting whether a passenger survived the disaster based on features like age, sex, ticket class, fare, and family relationships. The challenge involves handling missing data, engineering meaningful features, and selecting the optimal algorithm for maximum accuracy.

Titanic Survival Prediction Analysis - Data visualization showing survival patterns and model performance metrics

Technical Approach

1

Exploratory Data Analysis (EDA)

Analyzed survival patterns across different demographics and passenger classes. Visualized relationships between features like gender, age groups, ticket class, and survival rates to identify key predictive factors.

  • Survival rate analysis by gender and class
  • Age distribution and correlation studies
  • Feature correlation heatmaps
  • Missing data assessment and patterns

Port of Embarkation Distribution

Pie chart showing distribution of passengers by port of embarkation: Cherbourg 43.24%, Queenstown 30.43%, Southampton 26.32%

Analysis revealed that most passengers embarked from Cherbourg (43.24%), followed by Queenstown (30.43%) and Southampton (26.32%). This feature showed correlation with survival rates due to socioeconomic differences between ports.

Survival Status Distribution

Bar chart showing survival status: 342 passengers survived, 549 passengers died

The dataset shows a class imbalance with 549 passengers who died (61.6%) versus 342 who survived (38.4%). This imbalance was considered during model training to prevent bias toward the majority class.

2

Data Preprocessing & Feature Engineering

Cleaned the dataset and engineered new features to improve model performance. Handled missing values using statistical imputation and created derived features from existing data.

  • Missing value imputation for Age and Cabin
  • Created family size feature (SibSp + Parch)
  • Extracted titles from passenger names
  • One-hot encoding for categorical variables
  • Feature scaling and normalization
3

Model Training & Evaluation

Trained and evaluated five different machine learning algorithms on the preprocessed data. Used cross-validation and multiple metrics to assess model performance comprehensively.

  • Naive Bayes: 76% accuracy (best performer)
  • Logistic Regression: 75% accuracy
  • Decision Tree: 74% accuracy
  • Support Vector Machine: 66% accuracy
  • K-Nearest Neighbors: 66% accuracy
4

Model Comparison & Selection

Compared all models based on accuracy, precision, recall, and F1-score. Selected Naive Bayes as the final model due to its superior performance and computational efficiency for this dataset.

  • Performance benchmarking across all models
  • Confusion matrix analysis
  • ROC curve and AUC evaluation
  • Cross-validation for generalization testing

Model Performance Comparison

Model Accuracy Performance Best Use Case
Naive Bayes 76%
Probabilistic classification with independent features
Logistic Regression 75%
Linear decision boundaries, interpretable coefficients
Decision Tree 74%
Non-linear patterns, feature importance analysis
Support Vector Machine 66%
Complex decision boundaries, kernel methods
K-Nearest Neighbors 66%
Instance-based learning, similarity metrics

πŸ† Key Findings

  • Naive Bayes emerged as the top performer with 76% accuracy, demonstrating the effectiveness of probabilistic approaches for this classification problem.
  • Gender was the strongest predictor of survival, with female passengers having significantly higher survival rates.
  • Passenger class (Pclass) strongly correlated with survivalβ€”1st class passengers had much better survival odds.
  • Age groups showed distinct patterns: Children had higher survival rates, while middle-aged passengers had lower rates.
  • Family size influenced outcomes: Passengers with small families (1-3 members) had better survival chances than solo travelers or large families.

Data Visualizations

The project includes comprehensive visualizations to understand survival patterns:

πŸ“Š Survival by Gender

Bar charts showing dramatic survival rate differences between male and female passengers

πŸ“ˆ Class Distribution

Visualization of survival rates across 1st, 2nd, and 3rd class passengers

πŸ”₯ Feature Correlation Heatmap

Correlation matrix showing relationships between all features and survival

πŸ“‰ Age Distribution Analysis

Histograms and box plots analyzing survival patterns by age groups

🎯 Model Performance Metrics

Comparative charts showing accuracy, precision, recall, and F1-scores

πŸ“ Confusion Matrices

Detailed confusion matrices for each model showing true/false positives and negatives

Technology Stack

Programming & Analysis

Python 3.11+ Jupyter Notebook Pandas NumPy

Machine Learning

scikit-learn Naive Bayes Logistic Regression Decision Trees SVM KNN

Visualization

Matplotlib Seaborn

Data Processing

Feature Engineering Data Imputation One-Hot Encoding Cross-Validation

Lessons Learned & Insights

πŸ’‘ Model Selection Matters

Different algorithms excel at different tasks. Naive Bayes' simplicity and assumption of feature independence worked surprisingly well for this dataset, outperforming more complex models like SVM and KNN.

🎯 Feature Engineering Impact

Creating derived features like family size and extracting titles from names significantly improved model accuracy. Good features matter more than complex algorithms.

πŸ“Š EDA is Critical

Thorough exploratory data analysis revealed crucial patterns (gender, class, age) that guided feature selection and model interpretation.

βš–οΈ Handling Missing Data

Strategic imputation of missing values (Age, Cabin) preserved data integrity while maximizing the training set size for better model performance.

Interested in Machine Learning Solutions?

Let's discuss how classification models and predictive analytics can solve your business challenges.