Taking the emphasis off of modeling with Potosnail

Spencer Holley
7 min readMar 7, 2021

A User’s Guide

photo by Zdeněk Macháček on Unsplash

About 2 months ago I released a Python Package called Potosnail, It started out as a collection of helper functions I built for my Wikipedia Capstone project https://spencerholley.medium.com/how-i-built-an-ai-app-with-heroku-c321f36cc371. However I’ve since added higher level functions that, can automate the bulk of the modeling process. There is no substitute for domain knowledge and intuition in Machine Learning but Potosnail can take a massive leap in this process. What Potosnail does is take emphasis off of the modeling step in the Data Science Pipeline, allowing data scientists to shift their focus more towards the data. That time can then be spent on collecting data, engineering features, and seeking to understand the industry that the project is based in as well as carry out more thorough Exploratory Data Analysis (EDA).

I feel like many people starting out in Data Science, I know I did, come in with the opposite approach. They get an easy to work with dataset from Kaggle, and then spend all their time building models. This leads people to think that building models is the most important, which leads to an artificial sense of confidence when they do a simple .fit() and get a high accuracy. Then they get frustrated when they actually look at jobs and see that they only have experience in a small chunk of the job. This is where Potosnail can help! With Potosnail there is no need to spend all this time on the modeling process! Instead, you can pass in your data, get a model that’s about there in terms of tuning, and evaluate it. From there you can manipulate data to get a better output, the time saved will also give you more time for EDA which will allow you to create better features in your data as well as allow beginners to shift their focus to learning Data Collection, Data Storytelling, and presenting to nontechnical audiences.

Enough talk! I’m going to walk you through an example for an initial Potosnail setup that you could iterate off of. In this tutorial I’m using the SyriaTel customer churn dataset which is available on Kaggle. I will also note that I’ve already cleaned this dataset and am using the cleaned dataset. You can read in my data from https://raw.githubusercontent.com/spe301/dsc-phase-3-project/main/Data/ChurnData_ForML.csv, But I encourage you clean the data on your own.

Install / import

You can install the latest version of Potosnail via ‘’’pip install Potosnail==0.2.1'’’. from there you can import via ‘’’import potosnail’’’.

Machine Learning

Once we have a cleaned dataset in our hands, we can dive into modeling. any features we want to add, drop, combine, or change should be done before this step. We use WrapML from the Wrapper class. This function preforms gird searches, scales data, filters out multicollinearity for regression problems, handles imbalanced data for classification, and uses SelectKBest to do feature engineering with existing features. The output is a list that contains the model, feature combination, dataset, and scaling method that yield the best results given the input data. This function takes in three arguments; data, target, and task. We also set quiet equal to False. With this setting the function prints scores along the way so you can see the process, I should also note that these scores are not the same. For classification problems it’s a combination of AUC with accuracy, precision, or recall but it is accuracy for regression problems. There is also an fn parameter which is set to True by default, fn stands for false negative and means that false negatives are worse than false positives.

The next step is to inspect the ‘kit’. By doing this we can see the best model, see how the data has changed, and find out what if any scaling methods were used. We then send our hold out dataset, val, through the same scaler and select the same features from it. This way both datasets are compatible.

Great! We can see the best model was a Random Forest. If we want we can tune it further with ml.Optimize() once we create a parameter grid.

Best Model

Once we feel good about the model, we can evaluate it. There are so many ways to do this, since we have a binary classification problem on our hands We’ll do a confusion matrix with BuildConfusion() from the Evaluator class. As you can see, we are getting ~82% recall, recall is our metric because labeling churn customers as not churn is worse than labeling happy customers as churn, and ~96% overall accuracy.

This is a good starting point, but it’s important to have a more real world success metric. In an industry/business setting we don’t care about these terms. For this reason we will do a cost-benefit analysis. The cost of the model in this case would be dollars lost from giving out unnecessary discounts (false positives) and the benefit is dollars saved from keeping customers (true negatives). The reason why benefit is so high is because preventing churn also saves the company the cost of acquiring that customer, which is $350 in this case!

Wrapping it up

All of the following is a starting point to iterate off of, once you’ve got this set up you can keep researching the industry and doing EDA to improve the quality of this data. Then you can keep going; improve the data, improve the model, repeat! If you find that a certain feature isn’t important in X industry than drop it, or find that Y is really important in X industry you can do what you need to find it. I will say that WrapML takes a long time to run, but that’ll just give you more time to explore your data, make some visualizations in Tableau, research the industry, or maybe just play with your dog ;)

If you wish too you can keep reading to learn about Deep Learning with Potosnail!

Deep Learning

Please note that this dataset probably isn’t great for Deep Learning as it’s quite small, This is more of a demo. If you want to see this used in the wild check out this repo https://github.com/spe301/Wikipedia-Capstone. Before we do anything we need to get the data and labels in the right format. They must be numpy arrays and the labels must be one hot encoded, [0, 1] instead of 0 and [1, 0] instead of 1.

Once our data is right we can build a quick model with FastNN, we set output_dim to 2 and use binary_crossentropy for our loss function because we are working on a binary classification problem.

This model is doing fine. There is no overfitting and it converges at near 100% accuracy. We use ViewLoss and ViewAccuracy from the Evaluator class to do this. This model is doing fine. There is no overfitting and it converges at near 100% accuracy.

Not quite as good as the Random Forest’s 82%recall at only 73%. However, I must note that this model hasn’t been tuned in any way.

We use TestDL, this function facilitates grid searches for keras models. We use DeepTabularClassification as our model building function. Basically all it does is build feed-forward networks for classification tasks. In it’s use it will build models in accordance with the parameters given in the grid below. At the end we use .best_estimator_.model to see the best model, We can then inspect it with a .summary() or .best_params_.

Our tuned model did 3% better on recall, but with better feature engineering , and maybe more tuning, could certainly do better.

Why the heck is it called Potosnail?

You’ve read all the way to the end at this point and are probably wondering how Potosnail got it’s name. Since you’ve read this whole blog you certainly deserve an explanation!

When I was a kid I had an imaginary friend that I named Potosnail and I wanted to make something imaginary real. In tech we’re all about turning the imaginary, theoretical, and nonexistent into a reality, and therefore I feel that naming my library after my imaginary friend is able to continue my theme of making ideas real.

--

--