Introduction to Python for Statistics
Introduction
Python is a powerful programming language that is very popular for data analysis as well as many other applications in the real world. This portfolio will go over some intermediary processes in Python 3.7.3, before implementing basic statistical models; including linear models, regression and classification.
Python Overview
Functions and Modules
Similar to R, functions can be defined and called in the same script. The function definition syntax is as follows
def function_name(input1, input2):
variable = function_operation
second_variable = second_function_operation
return output
There’s a couple things to note here. Firstly, indentation is essential to defining the function. Unlike R or MATLAB, there is nothing to write that marks the beginning or end of the function as this is all handled by where the indentation begins and ends. Secondly, the function name comes after the def
, instead of the other way around (as in R). Finally, the colon (:
) is necessary, as it is in loops and if statements.
Functions can be written in one script, and called in another, and the way this is achieved is through modules. Modules are a collection of code that you can load from another file, similar to how source
and library
work in R. You load modules with any of these lines:
import module_name
import module_name as short_name
from module_name import function_name
from module_name import *
All these commands load the module (or parts of it), but do it slightly differently. In Python, all modules loaded have to be accessed through their name to use their functions. For example, to use a function from NumPy, you would have to use numpy.function_name
. This name can be shortened if you used the second line above to load the module, commonly, NumPy is abbreviated to np
so that you would only have to type np.function_name
to access it.
You can also load specific functions from a module by using the third and fourth line above, where the asterisk *
indicates to load all functions from the module. Note that if the module is loaded this way, you do not need to specify the name of the module preceding the function. It is considered bad practice to load functions this way, as declaring the module name removes the chance of overlapping function names within modules.
Example: Morse Code
As an example of basic operations in Python, consider writing two functions to encode and decode morse code respectively. The encoding function will take a string, or a piece of text, and convert it into morse code. The decoding function does the opposite: converts morse code into plain english text. To write these functions, we first must define a dictionary:
letter_to_morse = {'a':'.-', 'b':'-...', 'c':'-.-.', 'd':'-..', 'e':'.',
'f':'..-.', 'g':'--.', 'h':'....', 'i':'..', 'j':'.---', 'k':'-.-',
'l':'.-..', 'm':'--', 'n':'-.', 'o':'---', 'p':'.--.', 'q':'--.-',
'r':'.-.', 's':'...', 't':'-', 'u':'..-', 'v':'...-', 'w':'.--',
'x':'-..-', 'y':'-.--', 'z':'--..', '0':'-----', '1':'.----',
'2':'..---', '3':'...--', '4':'....-','5':'.....', '6':'-....',
'7':'--...', '8':'---..', '9':'----.', ' ':'/'}
So this dictionary can be accessed by indexing with the letter that we want a translation for. For example
letter_to_morse["6"]
Likewise, we need to define the reverse; a dictionary that takes morse code and outputs an english letter. We can reverse this dictionary with a loop.
morse_to_letter = {}
for letter in letter_to_morse:
morse = letter_to_morse[letter]
morse_to_letter[morse] = letter
morse_to_letter[".-"]
Now we can define the two functions encode
and decode
that use these two dictionaries. For the encoding function:
def encode(x):
morse = []
for letter in x:
letter = letter.lower()
morse.append(letter_to_morse[letter])
morse_message = " ".join(morse)
return morse_message
This function initialises an array morse
, and loops over all letters in the input x
and appends to the morse
array the translation in morse code. After the loop, the morse
array is joined together. For the decoding function:
def decode(x):
english = []
morse_letters = x.split(" ")
for letter in morse_letters:
english.append(morse_to_letter[letter])
english_message = "".join(english)
return english_message
This has a similar process as with encode
. First the input x
is split, and then each split morse symbol is looped over and converted to english using the morse_to_letter
dictionary. Before we run these functions, we can save them in a separate file called morse.py
and put it in our working directory. Then using another script (which will be called run_morse.py
), we can import the morse.py
functions as a module.
So we can import morse
and run the decode
function.
import morse
morse_message = ".... . .-.. .--. / .. -- / - .-. .- .--. .--. . -.. / .. -. / ... --- -- . / -- --- .-. ... . / -.-. --- -.. ."
morse.decode(morse_message)
Likewise, we can encode a secret message using morse.encode
:
morse.encode("please dont turn me into morse code")
One thing to note here is that the dictionaries morse_to_letter
and letter_to_morse
were defined outside of the functions. Since this module was imported, we can actually access this:
morse.letter_to_morse
Although oftentimes we would want variables to remain hidden, this is not possible if we are importing all the contents of a script. Importing Python files as modules is usually done so that functions can be imported across the working directory, as opposed to having one big file that contains all the code.
Classes
Object oriented programming in Python is done through classes, the general form of which is
class class_name:
def __init__(self):
self.variable_I_want_to_define = variable
self.another_variable_I_want_to_define = variable_2
def some_class_function(self, x):
something_I_can_do_with_my_variables = x
return some_output
def some_other_function(self):
self.an_internal_variable = 3
return another_output
The functions can be run as a member of the class, and the __init__(self)
section of the class is a function that is run when the class is initialised. Any variables that are within the __init__
section of the class become defined, sometimes depending on the inputs to the __init__
function. These could be variables such as initial conditions, or initial estimates. We could add more inputs to the __init__
function after self
, if needed.
The other functions can be called, and these could change some of the internal self.
variables, e.g. updating the initial conditions, running a certain loop or adding additional variables.
Example: Morse Code Again
We can update the morse code example to write a class called morse_code_translator
that has an encode and decode function as part of the class. This is written as
class morse_code_translator:
def __init__(self):
self.letter_to_morse = {'a':'.-', 'b':'-...', 'c':'-.-.', 'd':'-..',
'e':'.', 'f':'..-.', 'g':'--.', 'h':'....', 'i':'..', 'j':'.---',
'k':'-.-', 'l':'.-..', 'm':'--', 'n':'-.', 'o':'---', 'p':'.--.',
'q':'--.-', 'r':'.-.', 's':'...', 't':'-', 'u':'..-', 'v':'...-',
'w':'.--', 'x':'-..-', 'y':'-.--', 'z':'--..', '0':'-----',
'1':'.----', '2':'..---', '3':'...--', '4':'....-', '5':'.....',
'6':'-....', '7':'--...', '8':'---..', '9':'----.', ' ':'/'}
self.morse_to_letter = {}
for letter in self.letter_to_morse:
morse = self.letter_to_morse[letter]
self.morse_to_letter[morse] = letter
self.english_message_history = []
self.morse_message_history = []
def encode(self, x):
morse = []
for letter in x:
letter = letter.lower()
morse.append(self.letter_to_morse[letter])
morse_message = " ".join(morse)
self.english_message_history.append(x)
self.morse_message_history.append(morse_message)
return morse_message
def decode(self, x):
english = []
morse_letters = x.split(" ")
for letter in morse_letters:
english.append(self.morse_to_letter[letter])
english_message = "".join(english)
self.english_message_history.append(english_message)
self.morse_message_history.append(x)
return english_message
So this has simply moved the decode
and encode
functions over inside a class. The dictionaries are now written as self.letter_to_morse
and self.morse_to_letter
, with the self.
prefix. Other than that, I have also added two more variables: english_message_history
and morse_message_history
, to exemplify how variables can be updated by calling the functions. Every time encode
or decode
is run, these arrays keep track of all messages that have passed through either function. We can test to see if this works, by importing this class from morse_class.py
.
from morse_class import morse_code_translator
translator = morse_code_translator()
translator.encode("please not again")
translator.decode(".--. .-.. . .- ... . / -.. --- -. - / - ..- .-. -. / -- . / .. -. - --- / . -. --. .-.. .. ... ....")
So this has worked as expected, and the translator
object now can both decode and encode morse code messages. Now that these two messages have passed through the class, we can see if the history has worked.
translator.english_message_history
translator.morse_message_history
So the internal self
variables have been updated with the two messages that were passed through the class.
Statistical Models
Python has a lot of support for fitting statistical models and machine learning, this is mostly as part of the sklearn
package. Other modules must also be imported that provide different funtionality. Statistical models are handled by sklearn
, dataframes are handled by pandas
and numpy
, randomness is handled by numpy
and random
, and plots are handled by matplotlib
.
import sklearn as sk
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
Linear Regression
A basic linear regression model in python is handled as part of the sklearn.linear_model
module. The use of a linear model in Python can be explained through an example. Some simulated data can be set up as follows
$$
\boldsymbol{x} \sim \text{Unif}(0,1), \qquad y \sim N(\boldsymbol{\mu}, .5), \qquad \boldsymbol{\mu} = \beta_0 + \beta_1 \boldsymbol{x},
$$
where both $\boldsymbol{x}$ and $\boldsymbol{y}$ are of length $n$. Since this is simulated data, we pre-define $\beta_0 = 5$ and $\beta_1 = -2$ as the intercept and the gradient. We can use numpy
to sample this data, and the linear_model
module in sklearn
to fit a linear model to it.
n = 200
beta0 = 5
beta1 = -2
x = np.random.uniform(0, 1, n)
mu = beta0 + beta1*x
y = np.random.normal(mu, .5, n)
This data can be converted into a dataframe with the DataFrame
function in pandas
.
data = pd.DataFrame({"x":x, "y":y})
To set up the linear model, a model object must be set up in advance, and then the fitting routine can be called to it. This can be done in one line by nesting operations.
model = lm.LinearRegression(fit_intercept=True).fit(data[["x"]], data["y"])
Here, the model has been fit by first creating a linear model object with lm.LinearRegression
(and specifying that we do want an intercept). Secondly, the .fit()
function has been called on that object to obtain parameter estimates $\beta_0$ and $\beta_1$ by passing the covariates x
and response variable y
to it. Now we can see whether these estimates bare resemblance to the true parameter values used to simulate the data.
print("Intercept:", model.intercept_, ",", "Gradient:", model.coef_)
Intercept: 4.89676478340633 , Gradient: [-1.90845212]
This is reasonably close to the true values. Note that here there is only one covariate vector, but this method is applicable to multiple covariates which would give multiple estimates of the gradient for each covariate. Now we can plot the data and the fit, a pandas
data frame has an inbuilt method used for plotting the data, which produces a scatter plot from matplotlib
.
data.plot.scatter("x", "y", figsize=(14, 11));
plt.plot(x, model.predict(data[["x"]]), 'r');
Multiple Linear Regression
In the previous section, we fit a linear model with a single covariate vector. This is good for exemplary purposes as a two dimensional linear model is easy to visualise. Often in data analysis we can have multiple explanatory variables $\boldsymbol{x}_1, \dots, \boldsymbol{x}_p$ that give information about the respone variable $\boldsymbol{y}$.
For an example of multiple linear regression, we will be using the Boston house prices dataset, which is part of the freely provided datasets from sklearn
. This dataset describes the median value of owner-occupied homes (per $1000), as well as 13 predictors that could provide information about the house value. We load this dataset and convert it to a pandas
data frame.
from sklearn.datasets import load_boston
data = load_boston()
boston = pd.DataFrame(data.data, columns = data.feature_names)
boston["MEDV"] = data.target
boston.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
The variable we are interested in predicting is MEDV
, and all other variables will be used to give information to predicting MEDV
.
Before fitting any models, we can perform some exploratory data analysis to help the decision as to what predictor variables need to be included in the model fitting. Luckily, there are easy methods available that inspect how strongly correlated variables are with other variables in a pandas
data frame. Firstly, we will be using the inbuilt corr
function and plotting it as a heatmap. This uses the seaborn
module to create a heatmap.
import seaborn
corr = boston.corr()
cmap = seaborn.diverging_palette(-500, -720, as_cmap=True)
plt.figure(figsize=(14, 11))
seaborn.heatmap(corr, cmap=cmap);
There are two things to infer from this plot: how correlated the predictor variables are from one another, and how correlated the predictor variables are from the response. We are mostly interested in how correlated the predictors are with MEDV
, as if they influence the value of MEDV
significantly, then they are likely to be important to include in a model. Another plot we can use to explore the data is a multi variable scatter plot. That is, a scatter plot for each pair of variables. pandas
has a function for this, as part of the plotting
submodule.
from pandas.plotting import scatter_matrix
scatter_matrix(boston, figsize=(16, 16));
There is a lot to process in this plot, but you can restrict your attention to the final column on the right. This column will show scatter plots with all predictor variables and the response MEDV
. We are looking for obvious relationships between the two variables. CRIM
seems to have an effect at lower values, ZN
seems to have a small effect, INDUS
has a non-linear relationship,RM
has a clear linear relationship, AGE
seems to have a significant effect at lower values, B
has an effect at larger values and LSTAT
seems to have a quadratic relationship. These variables will be important to consider, but that does not mean all others are not. Since some of these predictors are categorical (even binary), it can be hard to infer their relationship with MEDV
from a scatter plot.
We can now fit the model using the information above. The predictors we want to include are CRIM
, ZN
, INDUS
, RM
, AGE
, B
and LSTAT
. We use the linear model from sklearn
again, but first create the new data frame of the reduced dataset. Additionally we can include a quadratic effect for certain predictors by creating a new column in the data frame with the squared values.
boston_x = boston[["CRIM", "ZN", "INDUS", "RM", "AGE", "B", "LSTAT"]]
boston_x.insert(len(boston_x.columns), "LSTAT2", [i**2 for i in boston_x["LSTAT"] ])
boston_x.insert(len(boston_x.columns), "AGE2", [i**2 for i in boston_x["AGE"] ])
boston_x.insert(len(boston_x.columns), "B2", [i**2 for i in boston_x["B"] ])
Now using the linear model with this modified data frame to fit the full model.
mv_model = lm.LinearRegression(fit_intercept=True).fit(boston_x, data.target)
print("Intercept: \n", mv_model.intercept_, "\n")
print("Coefficients:")
print(" \n".join([str(boston_x.columns[i]) + " " + str(mv_model.coef_[i]) for i in np.arange(len(mv_model.coef_))]))
Intercept:
5.916804277809813
Coefficients:
CRIM -0.11229848683846888
ZN 0.0037476008537828498
INDUS -0.0021805395766587208
RM 4.0621316253156285
AGE 0.07875634757295602
B 0.042639346099976716
LSTAT -2.063750521016358
LSTAT2 0.042240400122337596
AGE2 -0.0002509463690499019
B2 -7.788232883181183e-05
Penalised Regression
The linear_model
submodule of sklearn
doesn’t just contain the LinearRegression
object, it contains many other variations of linear models. Some examples of these are penalised regression models, such as the Lasso model or ridge regression model. You can view the documentation for these online, and they work similarly to LinearRegression
. Below I will briefly go through an example of using a Lasso path algorithm on the dataset provided above. Firstly, the model object can be created immediately using the linear_model.lars_path
object.
_,_,coefs = lm.lars_path(data.data, data.target)
This is using the full dataset provided in data.data
, as the Lasso path plot can be used for feature selection, similar to the scatter matrix approach above.
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]
plt.figure(figsize=(16, 9))
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.legend(data.feature_names)
plt.show()
The coefficients for covariates that provide a lot of information to the response variable are likely going to take a very large value of regularisation parameter to shrink them to zero. These are the lines that are more separated in the plot above, i.e. CRIM
, RM
, NOX
, and CHAS
. Interestingly, these were not the variables that were selected in the scatter matrix approach earlier in this report, highlighting the variation in parameter selection depending on the method.
Classification
The final statistical model introduced will be one interested in classifying. Classification is interested in prediction of a usually non-numeric class, i.e. assign the output to a pre-defined class based on some inputs. The classification model explained here will be Logistic Regression (ignore the name, it’s not a regression method).
Logistic regression is called so because it uses a logistic function to map fitted values to probabilities. The basic form of a classification model is the same of that of a regression method, except the output from the model corresponds to latent, unobserved fitted values which decide the class of each set of inputs.
We will start by simulating some data $\boldsymbol{y}$ that need to be classified, which will be generated by a mathematical combination of some inputs $\boldsymbol{x}$. Each element of $\boldsymbol{y}$ will be given one of the class labels $\mathcal{C}_1$, $\mathcal{C}_2$ or $\mathcal{C}_3$. The goal of the statistical model is parameter estimation, similar to the regression methods.
# Simulate x
n = 1000
x1 = np.random.uniform(0, 1, n)
x2 = np.random.uniform(0, 1, n)
# Create parameters and mean
beta0 = 3
beta1 = 1.5
beta2 = -4.8
beta3 = 0.5
mu = beta0 + beta1*x1 + beta2*x2 + beta3*x1**2
# Set class based on value of mean
y = np.repeat("Class 1", n)
y[mu > 0] = "Class 2"
y[mu > 2] = "Class 3"
# Combine into DataFrame
clsf_dat = pd.DataFrame({"y":y,"x1":x1,"x2":x2 })
colors = y
colors[y == "Class 1"] = "red"
colors[y == "Class 2"] = "blue"
colors[y == "Class 3"] = "green"
clsf_dat.plot.scatter("x1", "x2", c=colors, figsize=(14, 11));
Now that the data is set up, we can fit the logistic regression model to it. LogisticRegression
is a function as part of the linear_model
submodule of sklearn
, just as before, so this next step should be familiar to you.
lr_model = lm.LogisticRegression(fit_intercept=True, solver='newton-cg', multi_class = 'auto')
lr_model = lr_model.fit(clsf_dat[["x1", "x2"]], clsf_dat["y"])
print("Intercept: \n", lr_model.intercept_, "\n")
print("Coefficients:")
print(" \n".join([str(clsf_dat.columns[i]) + " " + str(lr_model.coef_[i]) for i in np.arange(len(lr_model.coef_))]))
Intercept:
[-4.45432353 0.92640783 3.52791569]
Coefficients:
y [-4.06694321 9.66044839]
x1 [0.18794764 0.76440687]
x2 [ 3.87899558 -10.42485527]
Note that there are multiple coefficients for each input! This is because even though the data were simulated using one parameter each, the logistic regression model has a different input combination for each class. The first class has both coefficients aliased to the intercept, and anything significantly different (in line with the combination of the remaining parameters) is given a different class.
We can roughly evaluate the model’s performance by inspecting a plot of its predictions.
predictions = lr_model.predict(clsf_dat[["x1", "x2"]])
colors = predictions
colors[predictions == "Class 1"] = "red"
colors[predictions == "Class 2"] = "blue"
colors[predictions == "Class 3"] = "green"
clsf_dat.plot.scatter("x1", "x2", c=colors, figsize=(14, 11));
Here we can see an extremely similar plot to the one above, although it is not quite exactly the same! At the point of overlap between classes, some points are mislabelled as the wrong class, due to uncertainty around the decision boundaries.
More complex models exist for classification and regression, but that’s for another time.