Friday, March 11, 2016

Quantopian Tutorials: Linear Regression

This, like all of the tutorials I am creating, comes from the Quantopian website. I am simply rehashing them for people and trying to make them more simplified with more detailed explanations of the "why" and "when".

Here we are going to discuss linear regression, followed by the actual application of regression in our analysis through Python code via Quantopian. The goal of this tutorial is not necessarily to rehash everything talked about by the folks at Quantopian already, but to help give you a better and more detailed explanation of everything. Everything in this tutorial is aimed at folks who are new to Python. So let's begin!

Link to the Quantopian notebook which is what this tutorial is based upon.

Let's look at some code:

# Import libraries
import numpy as np
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt
import math

Ok, so here we see the word "import" a lot, what is going on? It means we are importing something into our coding session, in this instance, import is used to import a set of commands and libraries into our IDLE. You can think of Python as a car frame where the axels, engine, seats, and actual body will connect to. The actual engine, seats, axels, and body are represented by what we are importing into it. Think of numpy as the engine, numpy is a library of functions and commands used to crunch numbers, we want this so of course to add it we tell Python to import it into our session and we will call it np rather than numpy because it's shorter and easier to input. The same goes for statsmodels, matplotlib. Math is short enough, we don't need to abbreviate it really so we just leave it as math. The from statsmodels import regression command is telling Python to import the regression function from the statsmodels database. The statsmodels.api is an api function used with Python to help utilize it's functions. It is a module that provides classes and functions for the estimation of many different statistical models. The same style of explanation can be used for the import of most everything.

Ok, we now know what each function does and why we have them (need them for calculations!), now we need to define them and plot the results.

def linreg(X,Y):
    # Running the linear regression
    X = sm.add_constant(X)
    model = regression.linear_model.OLS(Y, X).fit()
    a = model.params[0]
    b = model.params[1]
    X = X[:, 1]

Looking at this one at a time: X = sm.add_constant(X) , this means we are adding a constant to our graph along the X-axis, that constant is denoted within the parenthesis, so our constant is X. The sm.add part of our module comes from our previously imported statsmodels.api where we defined it as sm. Remember? So we are simply adding an X constant to our previously imported module. Next we have model = regression.linear_model.OLS(Y, X).fit() which is a simple ordinary least squares model. The OLS part stands for 'ordinary least squares', followed by both axis, constrained to fit() within our defined parameters which are denoted by a (alpha) and b (beta) below. The X = X(:, 1) means we are slicing the array and taking all the rows (:), but keeping the second column (1)

    # Return summary of the regression and plot results
    X2 = np.linspace(X.min(), X.max(), 100)
    Y_hat = X2 * b + a
    plt.scatter(X, Y, alpha=0.3) # Plot the raw data
    plt.plot(X2, Y_hat, 'r', alpha=0.9);  # Add the regression line, colored in red
    plt.xlabel('X Value')
    plt.ylabel('Y Value')
    return model.summary()

 X2 = np.linspace(X.min(), X.max(), 100), returns evenly spaced numbers over a specified interval. The X.min() is the starting point and the X.max() is the ending point, both constrained by our 100 limit. Y_hat (denoted  in statistics) represents the predicted equation for a line of best fit in linear regression. Within that best fit we have X2 * b + a. The plt.scatter plots the parameters you have input on a scatter plot.The rest of the code are customizations to the plot you can modify as you would like, described by the #.

start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)

# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]

linreg(r_b.values, r_a.values)

Here, we have our start and end dates, which represent the time-line for which we want to obtain our stock data, defined by our asset function where we are going to get the price via the get_pricing function, which is defined by parameters such as the name of the stock, in this case it is 'TSLA' which is Tesla's stock, followed by our fields inputs where we will have 'price', start_date=start, end_date=end). Now we are setting this against a benchmark, as any security should be, and in this case we are setting it against the 'SPY' with the same fields of entry now input.

r_a = asset.pct_change()[1:] and r_b = benchmark.pct_change()[1:] both of these are the percentage change which is the daily returns because we are getting it at a daily frequency by default and the same thing goes for the benchmark. We want to consider the daily returns as our time-series.

The linreg(r_b.values, r_a.values)  is our command to begin our linear regression of our asset and benchmark daily returns.

Now, from our code you will a table that spits out, most of the statistics you needn't worry about, but you should pay attention to the F-statistic which is telling us how predictive our model actually is, you want this value to be less than 0.5 (X<0.5). If you F-statistic is any higher than 0.5 than it is pretty much useless.

The rest of the code and their descriptions is available through the link . If you have any questions or concerns let me know!

The last part to making our graph is the following code:
# Generate ys correlated with xs by adding normally-destributed errors
Y = X + 0.2*np.random.randn(100)
linreg(X,Y)

This "makes Y dependent on X plus some random noise".

And to help with interpreting your data, you can use the following code below to help identify your 95% confidence interval for the regression line:

import seaborn

start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)

# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]

seaborn.regplot(r_b.values, r_a.values);

I like their clear explanation here:
The regression model relies on several assumptions:
  • The independent variable is not random.
  • The variance of the error term is constant across observations. This is important for evaluating the goodness of the fit.
  • The errors are not autocorrelated. The Durbin-Watson statistic detects this; if it is close to 2, there is no autocorrelation.
  • The errors are normally distributed. If this does not hold, we cannot use some of the statistics, such as the F-test.

Thursday, March 10, 2016

Quantopian Tutorial 6

So here we will talk about order management. By default on Quantopian there is no limit as to how much money you can borrow and invest in your algorithm, but in reality this isn't realistic. So here we will take a look at how to control the amount of money you invest.

def initialize(context)
     context.stock = sid(xtl) <---- now here, when you type in 'xtl' it will appear as '40768'.

def before_trading_start(context, data):
     pass

# Called every minute
def handle_data(context, data):
     open_orders = get_open_orders()
     if context.stock not in open _orders: 
          order_target_percent(context.stock, 1.00)
     record(cash = context.portfolio.cash)

What you see here in the 'order_target_percent(context.stock, 1.00) is the order we are placing for 100% (1.00) of the stock. Then we use the record function to plot the cash in my portfolio at the end of each day (even though we are in minute mode).

After you have written the code above, go ahead and hit the 'build' button.

When the build is finished, you will notice that the cash dips below zero, this is because we are investing borrowed money.

For this example, let's say we don't want to borrow any money, we just want to invest what is in our capital base.
Now, you might be wondering 'I already ordered 100% of my portfolio value so why is it ordering more?" The answer is because sometimes an order takes more than a bar to fill. Like with what is happening in our order. Remember, we are ordering more than a million dollars of XTL shares, and in every bar after that we are making similar-sized orders because our original order has not yet filled. Each of these order are going to stack up on each other to the point where, at the end, when all the orders are resolved, you will end up in a much bigger position in XTL than you had originally planned. To prevent this, we can use the function get_open_orders to see which securities we have placed orders for that have not yet been filled. The get_open_orders is going to give us a dictionary keyed by security ID of our open orders. So, if we hit the 'build' button again after inserting this function, we will notice that we are no longer borrowing a lot of money. The number we get is not zero (what we want) and this is due to slippage which we will discuss later.

So let's look at ticker XTL, using an algorithm that uses order-target percent to order 100% of our portfolio in XTL in a minute!

I hope you enjoyed this lesson about managing orders in Quantopian, as always, if you have any questions or comments please feel free to leave feedback!