For individuals like myself, with no organizational associations, the easiest way to get hold of large real-world datasets to practice on is through a site called Kaggle. I went through three separate datasets and produced three separate models.
For insight into my data processes, check out the CRISP-DM papers, as well as the OSEMN (Awesome) Model. I used R for all data analyses.
The first two ideas I had were that class and gender would influence survival rates, so I started poking around and graphing survival rates as a function of various columns, and saw that there was a third one I'd missed - age. This makes sense, as it wasn't just women first, it was women and children first.
The tricky bit about that is that the Age column had a 20% NA rate, so I had to learn strategies for imputation. Had there been fewer missing cases than 1 in 5, I probably would have just thrown out the missing data, but given its scarcity I ended up going with the median age of a passenger's class (first class skewed older than third).
Once I had some reasonable deductions made about what kind of passenger tended to survive, I fed these ideas into a Random Forest model for categorization.
This followed a lot of the same initial steps you'd take with any dataset - examine times of high sales, and see how they differed from times of low sales. Unsurprisingly, sunny weekends and holidays performed better than rainy weekdays, but that's not the important bit, the important bit is putting a day into a certain bin of sales, so staffing can be allocated.
You could create a complicated polynomial regression equation for this, which would give you an exact figure for how many bikes are predicted to be rented on a certain day, but in my opinion, that would give you a false sense of security in your model's accuracy. After all, one of its key components is the weather, so why stress over mathematical certainty when you might be giving the model false data?
In the end, I decided to treat it as a binned categorization problem, with the output being what probability a day had of selling a certain range of bikes. If you wanted, you could multiply the probability by the mean of the range and calculate an expected value for that day, but the model would not do it for you and hide the raw data.
In addition to the text of the request, some quantitative data was provided with each request, such as the age and activity level of the account. This data was explored (older, more active accounts were more likely to receive generosity) and set aside for later inclusion in the model.
The model chosen was bag-of-words, so the first order of business with the text was to clean and commoditize it - it was fed through functions to transform everything to lower-case letters, strip punctuation and stop words, and then split on spaces. Thus was the input tokenized and ready for aggregation.
Then it was just a matter of more categorization - which words appeared more frequently in successful requests than in unsuccessful ones? This result was fed into the model, and success was had. I considered rewarding my own self with a pizza for knocking off this tough challenge.