Data Loading
To learn how to do machine learning we’re going to need some data to work with. To facilitate learning and experimentation, scikit-learn includes a datasets module containing a number of widely-used toy datasets. Here’s how we could load the (in)famous Iris dataset:
from sklearn import datasets
# Load a dictionary (technically, a Bunch) containing the data
iris = datasets.load_iris()
# 'data' and 'target' contains the feature data and classes, respectively
X, y = iris['data'], iris['target']
X
contains feature information for 150 individual Iris flowers drawn from 3 different species. y
contains the true class information for all flowers. If we want to inspect the features in a tabular form, we can easily load the data into a pandas DataFrame
:
# Here we're importing the pandas package, which we'll use extensively
# for data manipulation. In future sections, we'll put the core imports
# at the top of the notebook, which is the convention in Python.
import pandas as pd
# Initialize a new pandas DataFrame from the X matrix and the feature names
data = pd.DataFrame(X, columns=iris['feature_names'])
data.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
In principle, we could use the iris dataset (or one of the other datasets bundled with scikit-learn) for many of the examples we’ll work through. But the iris dataset has some limitations—most notably, it’s fairly small (only 150 rows and 4 features), and has nothing to do with real world applications. Instead, we’ll use data that should be of interest to many individuals: house pricing dataset and time-series stock prices. The housing price dataset consists of various house features along with the sales price of the home. The time-series stock price datasets will be harvested from Yahoo finance.
We will first make use of the house pricing dataset to learn the basics of machine learning. We’ll use pandas—the reference data analysis library in Python—to do this. Pandas provides us with a fairly magical read_csv
function that can read in almost any kind of tabular data.
# read_csv is a workhorse function that can read almost any kind of
# plain-text format. The returned object is a pandas DataFrame.
all_data = pd.read_csv('data/house_prices.csv', sep=',', index_col=0).reset_index(drop=True)
Representing the data
Once the data have been read in, we can take a look at the first few rows:
# head() display the first few rows of the dataset.
all_data.head()
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 80 columns
As we can see, the data are tabular. Every row represents a different house, and every column represents a different variable. In machine learning terminology, we typically refer to the rows and columns as samples and features, respectively. We can thus think of our data as a two-dimensional n (samples) x p (features) matrix. The vast majority of algorithms implemented in the scikit-learn and keras packages expect to receive numerical matrices of this kind as their primary inputs. (Note that some of the columns in our dataset—e.g., “MSZoning” and “LotShape”—contains strings or categorical values, so we need to pre-process these columns). One option would be to recode these columns into a numerical form before we could make proper use of them by defining different levels. The other option would be to just remove them. Since we have 80 features, we will just remove them for now. The original dataset consists of 80 columns and 1460 samples.