Data Loading

Last updated on Sep 30, 2021

To learn how to do machine learning we’re going to need some data to work with. To facilitate learning and experimentation, scikit-learn includes a datasets module containing a number of widely-used toy datasets. Here’s how we could load the (in)famous Iris dataset:

from sklearn import datasets

# Load a dictionary (technically, a Bunch) containing the data
iris = datasets.load_iris()

# 'data' and 'target' contains the feature data and classes, respectively
X, y = iris['data'], iris['target']

X contains feature information for 150 individual Iris flowers drawn from 3 different species. y contains the true class information for all flowers. If we want to inspect the features in a tabular form, we can easily load the data into a pandas DataFrame:

# Here we're importing the pandas package, which we'll use extensively
# for data manipulation. In future sections, we'll put the core imports
# at the top of the notebook, which is the convention in Python.
import pandas as pd

# Initialize a new pandas DataFrame from the X matrix and the feature names
data = pd.DataFrame(X, columns=iris['feature_names'])
data.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In principle, we could use the iris dataset (or one of the other datasets bundled with scikit-learn) for many of the examples we’ll work through. But the iris dataset has some limitations—most notably, it’s fairly small (only 150 rows and 4 features), and has nothing to do with real world applications. Instead, we’ll use data that should be of interest to many individuals: house pricing dataset and time-series stock prices. The housing price dataset consists of various house features along with the sales price of the home. The time-series stock price datasets will be harvested from Yahoo finance.

We will first make use of the house pricing dataset to learn the basics of machine learning. We’ll use pandas—the reference data analysis library in Python—to do this. Pandas provides us with a fairly magical read_csv function that can read in almost any kind of tabular data.

# read_csv is a workhorse function that can read almost any kind of
# plain-text format. The returned object is a pandas DataFrame.

all_data = pd.read_csv('data/house_prices.csv', sep=',', index_col=0).reset_index(drop=True)

Representing the data

Once the data have been read in, we can take a look at the first few rows:

# head() display the first few rows of the dataset.
all_data.head()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 80 columns

As we can see, the data are tabular. Every row represents a different house, and every column represents a different variable. In machine learning terminology, we typically refer to the rows and columns as samples and features, respectively. We can thus think of our data as a two-dimensional n (samples) x p (features) matrix. The vast majority of algorithms implemented in the scikit-learn and keras packages expect to receive numerical matrices of this kind as their primary inputs. (Note that some of the columns in our dataset—e.g., “MSZoning” and “LotShape”—contains strings or categorical values, so we need to pre-process these columns). One option would be to recode these columns into a numerical form before we could make proper use of them by defining different levels. The other option would be to just remove them. Since we have 80 features, we will just remove them for now. The original dataset consists of 80 columns and 1460 samples.

Data Loading

Representing the data

Greydon Gilmore

Intraoperative Neurophysiologist Biomedical Engineer

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2