Here we are going to discuss linear regression, followed by the actual application of regression in our analysis through Python code via Quantopian. The goal of this tutorial is not necessarily to rehash everything talked about by the folks at Quantopian already, but to help give you a better and more detailed explanation of everything. Everything in this tutorial is aimed at folks who are new to Python. So let's begin!
Link to the Quantopian notebook which is what this tutorial is based upon.
Let's look at some code:
# Import libraries
import numpy as np
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt
import math
Ok, so here we see the word "import" a lot, what is going on? It means we are importing something into our coding session, in this instance, import is used to import a set of commands and libraries into our IDLE. You can think of Python as a car frame where the axels, engine, seats, and actual body will connect to. The actual engine, seats, axels, and body are represented by what we are importing into it. Think of numpy as the engine, numpy is a library of functions and commands used to crunch numbers, we want this so of course to add it we tell Python to import it into our session and we will call it np rather than numpy because it's shorter and easier to input. The same goes for statsmodels, matplotlib. Math is short enough, we don't need to abbreviate it really so we just leave it as math. The from statsmodels import regression command is telling Python to import the regression function from the statsmodels database. The statsmodels.api is an api function used with Python to help utilize it's functions. It is a module that provides classes and functions for the estimation of many different statistical models. The same style of explanation can be used for the import of most everything.
Ok, we now know what each function does and why we have them (need them for calculations!), now we need to define them and plot the results.
def linreg(X,Y):
# Running the linear regression
X = sm.add_constant(X)
model = regression.linear_model.OLS(Y, X).fit()
a = model.params[0]
b = model.params[1]
X = X[:, 1]
Looking at this one at a time: X = sm.add_constant(X) , this means we are adding a constant to our graph along the X-axis, that constant is denoted within the parenthesis, so our constant is X. The sm.add part of our module comes from our previously imported statsmodels.api where we defined it as sm. Remember? So we are simply adding an X constant to our previously imported module. Next we have model = regression.linear_model.OLS(Y, X).fit() which is a simple ordinary least squares model. The OLS part stands for 'ordinary least squares', followed by both axis, constrained to fit() within our defined parameters which are denoted by a (alpha) and b (beta) below. The X = X(:, 1) means we are slicing the array and taking all the rows (:), but keeping the second column (1)
# Return summary of the regression and plot results
X2 = np.linspace(X.min(), X.max(), 100)
Y_hat = X2 * b + a
plt.scatter(X, Y, alpha=0.3) # Plot the raw data
plt.plot(X2, Y_hat, 'r', alpha=0.9); # Add the regression line, colored in red
plt.xlabel('X Value')
plt.ylabel('Y Value')
return model.summary()
X2 = np.linspace(X.min(), X.max(), 100), returns evenly spaced numbers over a specified interval. The X.min() is the starting point and the X.max() is the ending point, both constrained by our 100 limit. Y_hat (denoted
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)
# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]
linreg(r_b.values, r_a.values)
Here, we have our start and end dates, which represent the time-line for which we want to obtain our stock data, defined by our asset function where we are going to get the price via the get_pricing function, which is defined by parameters such as the name of the stock, in this case it is 'TSLA' which is Tesla's stock, followed by our fields inputs where we will have 'price', start_date=start, end_date=end). Now we are setting this against a benchmark, as any security should be, and in this case we are setting it against the 'SPY' with the same fields of entry now input.
r_a = asset.pct_change()[1:] and r_b = benchmark.pct_change()[1:] both of these are the percentage change which is the daily returns because we are getting it at a daily frequency by default and the same thing goes for the benchmark. We want to consider the daily returns as our time-series.
The linreg(r_b.values, r_a.values) is our command to begin our linear regression of our asset and benchmark daily returns.
Now, from our code you will a table that spits out, most of the statistics you needn't worry about, but you should pay attention to the F-statistic which is telling us how predictive our model actually is, you want this value to be less than 0.5 (X<0.5). If you F-statistic is any higher than 0.5 than it is pretty much useless.
The rest of the code and their descriptions is available through the link . If you have any questions or concerns let me know!
The last part to making our graph is the following code:
# Generate ys correlated with xs by adding normally-destributed errors
Y = X + 0.2*np.random.randn(100)
linreg(X,Y)
This "makes Y dependent on X plus some random noise".
And to help with interpreting your data, you can use the following code below to help identify your 95% confidence interval for the regression line:
import seaborn
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)
# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]
seaborn.regplot(r_b.values, r_a.values);
I like their clear explanation here:
The regression model relies on several assumptions:
- The independent variable is not random.
- The variance of the error term is constant across observations. This is important for evaluating the goodness of the fit.
- The errors are not autocorrelated. The Durbin-Watson statistic detects this; if it is close to 2, there is no autocorrelation.
- The errors are normally distributed. If this does not hold, we cannot use some of the statistics, such as the F-test.
No comments:
Post a Comment