This is an introductory tutorial on machine learning using the scikit-learn and Keras Python packages. Prerequisites are minimal; chiefly, I assume that the reader has a little bit of prior programming experience—preferably in Python. A passing familiarity with basic inferential statistical methods (primarily linear regression) is also helpful, but isn’t essential. Some of the material in thi tutorial is borrowed from Jake Vanderplas’s excellent scikit-learn tutorial. The main differences between the present tutorial and most others out there are that (a) this tutorial is more verbose than most (i.e., the emphasis is on conceptual understanding rather than just on learning the scikit-learn API), and (b) most of the examples are drawn from more unique datasets and contain application to real world examples.
All of the code in this tutorial is written in Python. There is nothing intrinsically special about Python in the machine learning context; in principle, all of the examples and simulations in these pages could have been written in other languages (R, Matlab, etc.). Indeed, there are plenty of machine learning tutorials out there written in other languages.That said, Python does have a number of practical advantages over other languages. Chief among these is the fact that it’s currently the most widely used language in the data science and machine learning community. This means there are exceptional tools written in Python for virtually every domain of machine learning. Exhibit A is the scikit-learn package for machine learning. Scikit-learn is the world’s most widely used machine learning, and some of the reasons for its popularity will hopefully soon become clear. Scikit-learn is itself built on the numpy numerical computing library, which we’ll also use fairly regular. Exihbit B is the Keras package for machine learning. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow. It was developed with a focus on enabling fast experimentation.
Now that we know what machine learning is, let’s turn to the scikit-learn package. Scikit-learn is the most widely-used machine learning package in Python (and probably the most widely-used ML package, period). Its popularity stems from its simple, elegant API, stellar documentation, and comprehensive support for many of the most widely used machine learning algorithms (the main exception being deep learning, which we will use Keras for). Scikit-learn provides well-organized, high-quality tools for virtually all aspects of the typical machine learning workflow, including data loading and preprocessing, feature extraction and feature selection, dimensionality reduction, model selection and evaluation, and so on.
First you will need to install Python, depending on what operating system you are using there are different approaches.
/usr/bin/ruby -e $(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)
brew install python3
sudo apt-get install python3.6
sudo apt install python3-pip
Several Python integrated development environments (IDE) exist to make writting Python code easier. The one I use the most, and I highly recommend, is Spyder IDE. Spyder is a powerful scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts. It offers a unique combination of the advanced editing, analysis, debugging, and profiling functionality of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package.
Once you have installed Python, installing Spyder is straight forward. You will need to open a terminal or command prompt and type the following
pip install spyder