Stats1: Basics with NumPy, Matplotlib, scikit-learn

Our #DST4L class on October 22nd was a whirlwind tour of probability, statistics, and machine learning in 3 hours.  Before diving into that, I want to make sure you all know that the Python programming language was named after Monty Python. You may have figured this out if you’ve clicked on a Google result and found a picture of a lumberjack along with the cool python information you sought.  So it seems the perfect segue to begin our review with Run away! Run away!

Okay, wait.  Probability, statistics and machine learning are not as scary as a killer rabbit.  At the end of class, Rahul Dave (@rahuldave on Twitter) reminded us that these are complicated topics.  We shouldn’t get discouraged if we struggle.  He suggested some books for review*.

In this class, we shifted focus to matplotlib using a new IPython notebook.  Like our earlier teachers Lynn Cherney and Tom Morris, Rahul reminded us to start by exploring our data.  He used a scatter plot with marginal histograms to show us how we can visualize isolated bits of data.  He also suggested that to have good graphs, we should use good colors.

I’ve had to review a lot of the statistical terms mentioned in class this week including these:

  • Gaussian distribution — I was happy to see this referred to as the familiar “normal distribution” and happier still to discover in Wikipedia that it is one of several distributions that are bell-shaped (i.e. a bell curve)
  • Bernouilli distribution — This was introduced with the example of a single coin toss where p represents the probability that it comes up heads and 1-p represents the probability it will come up tails.



  • Sample size — The importance of this was detailed in Howard Wainer’s very readable article, “The Most Dangerous Equation.”  With five examples, the article clearly explains how data becomes skewed when sample sizes are too small.

It’s critical to have a well-curated data set.  We need to train our data in order to make predictions and test them.  Test using 60% of your data.  When you have a hypothesis, you can test it using the other 40% of your data set.  Testing will be repeated many, many times. And this is where machine learning and scikit learn fit in.  The more data we have, the better off we’ll be.

We’ll continue with scikit learn, cross-validation, Bayes rule and more in our next class.

See you soon.

* Here are the suggested readings from class:

Python for data analysis

Doing data science : straight talk from the frontline

Machine learning for hackers

Books by Edward Tufte

Here’s the link to the Etherpad notes in PDF

Don’t forget to read Rahul’s emails which contain some reading assignments.

Leave a Reply