Titanic Disaster

Kaggle
Completed
Author

Chris Kornaros

Published

October 15, 2024

General Overview

The Kaggle Titanic dataset and ML competition is one that many people are familiar with, and if they’re like me, it was also their first ML project. I redid this after 3 years to get familiar with my current workflow of using git, notebooks, venvs, etc. Below I included some notes to myself. While it was overkill, I used a Dockerized environment for this project, just to increase my familiarity with containers and the docker toolset.

Notes on the Titanic model and the process for using the Jupyter Kernel on the EC2 Server

  1. First, navigate to Local-Scripts/.AWS/.EC2_Scripts and run the ec2_start.sh script zsh ec2_start.sh
  2. Next, execute the unix_test_dns.sh script to store the EC2 public DNS in the /etc/hosts file ./unix_test_dns.sh
  3. Use the ssh -i command to connect to the EC2 server, then run the Jupyter kernel image docker run -p 8888:8888 titanic-env Look into adding a volume mount command here to persist model/file changes in the EC2
  4. Now that it’s running. Use a different terminal window (or the VS Code IDE) and test the DNS name. ping unix_test.local Need to make this DNS dynamic
  5. In VS Code, open the .ipynb file in the model folder and continue work.
    1. If you need to reconnect to a kernel, use the Titanic preset.
    2. VS Code connects to the EC2 IPv4 address, even though the Kernel tells you 127.0.0.1
  6. The format for connecting to the Public IPv4 is http://IPv4:8888/
  7. To pull the file out of the container and store it on the EC2 server
    1. docker cp 786853360d97:/home/files/titanic_submission.csv files/titanic_submission.csv
    2. docker copy instance-id:/path/to/file local/path

If you need to modify the container

  • Do so locally, or anywhere, and then push the change to the GitHub repoistory
  • Then, pull the changes into the EC2 server
  • Clear the Docker library/cache, and then rebuild the image from scratch, use the following docker build -t titanic-env -f .config/Dockerfile .
  • Ensure this is done from the main project folder and uses those flags
    • For Titanic, this is in the admin/Kaggle/Titanic folder in the EC2 instance
    • --no-cache ensures it’s a fresh build (This will take a while, not worth it in the smaller environment. Rebuild with cache)
    • -t sets the name of the image
    • -f lets you specify the Dockerfile location
    • . lets Docker know that your current working directory is where the build context should take place

The Code Portion of this notebook

Caution

When I most recently completed this competition, I didn’t do it with the goal in mind of doing a nice write-up. This is really just an amalgamation of the notes I made to myself on how to use/modify the Docker container I ran my model on and the actual notebook + code + notes.

import numpy as np
import pandas as pd
import duckdb
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
# Initialize a connection and create a persistent database
# Worth noting that due to the workflow I'm using, the database was/should be created externally, and then built into the Docker container
# This way, the raw database files are saved locally, but file size won't grow exponentially
con = duckdb.connect("files/titanic.duckdb")
# Using this to DROP and Recreate train_raw, ensuring a fresh process
con.sql("DROP TABLE train_raw;")
con.sql("DROP TABLE test_raw;")
con.sql("CREATE TABLE train_raw AS SELECT * FROM 'files/train.csv'")
con.sql("CREATE TABLE test_raw AS SELECT * FROM 'files/test.csv'")
# Create working tables
#con.sql("DROP TABLE train;")
#con.sql("DROP TABLE test;")
con.sql("CREATE TABLE train AS SELECT * FROM train_raw")
con.sql("CREATE TABLE test AS SELECT * FROM test_raw")
# Verify the proper tables are loaded
con.sql("SELECT * FROM duckdb_tables()")
# Generate summary statistics
con.sql("SUMMARIZE train")
#con.sql("SUMMARIZE test")
# Examine Nulls for the Age, Cabin, and Embarked columns (do this for test as well)
con.sql("SELECT * FROM train WHERE Age IS NULL") # Seems to make the most sense to use the average age here
con.sql("SELECT * FROM train WHERE Cabin IS NULL") # Seems likely Cabins not as strictly recorded for lower class guests, probably unnecessary for model
con.sql("SELECT * FROM train WHERE Embarked IS NULL") # This only comprises 2 records and it's unclear if they made it on in the first place, not a high enough percentage of 1st class survivors to consider keeping
# Update the Age column, replace NULL values with the average Age
con.sql("""UPDATE train AS train_clean
        SET Age = (
            SELECT
                avg(raw.Age) AS cleanAge
            FROM train as raw
            WHERE raw.Age IS NOT NULL
        )
        WHERE Age IS NULL""")
# Update the Sex column, change the VARCHAR type to BOOLEAN
con.sql("""ALTER TABLE train ALTER Sex 
        SET DATA TYPE BOOLEAN USING CASE
            WHEN Sex = 'female' THEN 1 ELSE 0 END
        """)
# Update the Age column in the test dataset
con.sql("""UPDATE test AS test_clean
        SET Age = (
            SELECT
                avg(raw.Age) AS cleanAge
            FROM test as raw
            WHERE raw.Age IS NOT NULL
        )
        WHERE Age IS NULL""")
# Update the Sex column, change the VARCHAR type to BOOLEAN
con.sql("""ALTER TABLE test ALTER Sex 
        SET DATA TYPE BOOLEAN USING CASE
            WHEN Sex = 'female' THEN 1 ELSE 0 END
        """)
# Remove the PassengerId, Name, Cabin, Embarked, Fare, and Ticket columns
con.sql("ALTER TABLE train DROP PassengerId") # Has no bearing on the outcome of the model
con.sql("ALTER TABLE train DROP Name") # Has to be numeric data
con.sql("ALTER TABLE train DROP Cabin")
con.sql("ALTER TABLE train DROP Embarked")
con.sql("ALTER TABLE train DROP Fare") # Dropping because there are nulls in the test file
con.sql("ALTER TABLE train DROP Ticket") # Dropping because of inconsistent values
# Remove the PassengerId, Name, Cabin, Embarked, Fare, and Ticket columns
con.sql("ALTER TABLE test DROP Name") # Has to be numeric data
con.sql("ALTER TABLE test DROP Cabin")
con.sql("ALTER TABLE test DROP Embarked")
con.sql("ALTER TABLE test DROP Fare") # Dropping because there are nulls in the test file
con.sql("ALTER TABLE test DROP Ticket") # Dropping because of inconsistent values
# Creating dataframes for testing/training, I'll be using sklearn here, which needs both
train = con.sql("SELECT * FROM train").df()
test = con.sql("SELECT * FROM test").df()
# Create features and target
X = train.drop("Survived", axis = 1).values
y = train["Survived"].values

X_test = test.drop("PassengerId", axis = 1).values
# Initialize Regression object and split data
logreg = LogisticRegression(penalty = 'l2', tol = np.float64(0.083425), C = np.float64(0.43061224489795924), class_weight = 'balanced')
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.7, random_state = 123)
# Fit and predict
logreg.fit(X_train, y_train)
LogisticRegression(C=np.float64(0.43061224489795924), class_weight='balanced',
                   tol=np.float64(0.083425))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict and measure output
y_pred = logreg.predict(X_val)
y_pred_probs = logreg.predict_proba(X_val)[:, 1]
print(roc_auc_score(y_val, y_pred_probs))
0.8149262043998886
# Create Parameter Dictionary for Model Tuning
kf = KFold(n_splits = 5, shuffle = True, random_state = 123)
params = {
    "penalty": ["l1", "l2"],
    "tol": np.linspace(0.0001, 1.0, 25),
    "C": np.linspace(0.1, 1.0, 50),
    "class_weight": ["balanced", {0:0.8, 1:0.2}]
}
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)
# Run the parameter search, fit the object, print the output
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)
logreg_cv.fit(X_train, y_train)
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))
# Apply the model to the test set
predictions = logreg.predict(X_test)

submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': predictions
})

submission.head()
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
# Write the file to .csv and submit
con.sql("SELECT * FROM submission").write_csv("files/titanic_submission.csv")