Home | Resume | Work Samples | Courses Taken | Goals |

Data Science and Business Analytics

I am passionate about telling stories with numbers. It's my art, you could call it painting with numbers. (I would later adopt this as the title for a talk.)

For individuals like myself, with no organizational associations, the easiest way to get hold of large real-world datasets to practice on is through a site called Kaggle. I went through three separate datasets and produced three separate models.

For insight into my data processes, check out the CRISP-DM papers, as well as the OSEMN (Awesome) Model. I used R for all data analyses.

Categorization - Titanic Dataset

This is the canonical Kaggle intro dataset. My first question was, "What COULD have made the difference between life and death?" and I started getting elbows-deep in exploring potential answers.

The first two ideas I had were that class and gender would influence survival rates, so I started poking around and graphing survival rates as a function of various columns, and saw that there was a third one I'd missed - age. This makes sense, as it wasn't just women first, it was women and children first.

The tricky bit about that is that the Age column had a 20% NA rate, so I had to learn strategies for imputation. Had there been fewer missing cases than 1 in 5, I probably would have just thrown out the missing data, but given its scarcity I ended up going with the median age of a passenger's class (first class skewed older than third).

Once I had some reasonable deductions made about what kind of passenger tended to survive, I fed these ideas into a Random Forest model for categorization.

Regression - Bike Rental Dataset

Moving from discrete binary variables to continuous numeric ones, this dataset gives you the past two years of rental data, with information about what the weather was like on a certain day, whether it was a weekend, holiday, or weekday, plus other variables, and asks you to predict the next seven days of sales. What complicated things (and made it slightly unrealistic) was that the data was not provided in a time-series fashion, cutting off several common methods of analysis.

This followed a lot of the same initial steps you'd take with any dataset - examine times of high sales, and see how they differed from times of low sales. Unsurprisingly, sunny weekends and holidays performed better than rainy weekdays, but that's not the important bit, the important bit is putting a day into a certain bin of sales, so staffing can be allocated.

You could create a complicated polynomial regression equation for this, which would give you an exact figure for how many bikes are predicted to be rented on a certain day, but in my opinion, that would give you a false sense of security in your model's accuracy. After all, one of its key components is the weather, so why stress over mathematical certainty when you might be giving the model false data?

In the end, I decided to treat it as a binned categorization problem, with the output being what probability a day had of selling a certain range of bikes. If you wanted, you could multiply the probability by the mean of the range and calculate an expected value for that day, but the model would not do it for you and hide the raw data.

Text Analytics - Random Acts of Pizza Dataset

This was likely the most technically challenging of the three, which is why I saved it for last. There is a section of the popular website Reddit called "Random Acts of Pizza". Users are allowed to post requests for a free pizza, and other users can decide to fulfill them or not (the rise of online ordering has made things like this very easy).

In addition to the text of the request, some quantitative data was provided with each request, such as the age and activity level of the account. This data was explored (older, more active accounts were more likely to receive generosity) and set aside for later inclusion in the model.

The model chosen was bag-of-words, so the first order of business with the text was to clean and commoditize it - it was fed through functions to transform everything to lower-case letters, strip punctuation and stop words, and then split on spaces. Thus was the input tokenized and ready for aggregation.

Then it was just a matter of more categorization - which words appeared more frequently in successful requests than in unsuccessful ones? This result was fed into the model, and success was had. I considered rewarding my own self with a pizza for knocking off this tough challenge.

Going Forward

Data science and business analytics, in existence since the early days of the Roman census, have finally hit the big time. I'm excited to learn more and more as I go forward.