Monthly Archives: March 2014

Linear regression in R: Interpreting the summary

When performing a linear regression in R, the program outputs a lot of relevant information when you call summary(). In this post we'll go through all the figures and discuss how to interpret them. An Excel sheet containing all the calculations is available here.

To get started, we perform a regression on the Boston data set, which is part of the MASS package:

> library(MASS)
> names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" 
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
> ?Boston

We will try to predict the median value of owner-occupied homes (medv) only based on the crome rate:

> summary(
lm(formula = medv ~ crim, data = Boston)

    Min      1Q  Median      3Q     Max
-16.957  -5.449  -2.007   2.512  29.800

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.03311    0.40914   58.74

Based on this output, let's go through all the values and see what they tell us. First let's have a look at a plot of this data:

There's definitely a trend here (higher crime rate corresponds to lower house prices), as expected but only up to a point. After a crime rate of about 40, the median house prices remain roughly constant. There is also a more significant problem with our regression: We predict negative house prices for areas with very high crime rates. We will see how to address these two issues in subsequent posts.

This line shows the call made to calculate the regression. In this case this is not really helpful, but when the call is more complicated and includes higher order and interaction terms it is helpful to have this stored somewhere.

The residuals are defined as the difference between actual and predicted value. If our prediction showed no bias, then these residuals would be distributed evenly around zero (just as many predictions are too low and too high). The R output gives a good first impression of this distribution. A good rule of thumb is that the median should be close to zero and  the absolute values of the 1Q and 3Q values should be approximately identical. It seems that in this case, the residuals are not distributed evenly around zero, indicating problems with the fit.

These are the actual coefficients for the intercept and the predictor variables along with their standard error, fitted using least squares. If the absolute value of the estimate is much larger than the standard error, there's a good chance that the real coefficient is actually zero. The t value calculates exactly that, it is defined as

 t value = Estimate / Std.Error.

As rule of thumb, the t value should be larger than 2 and the bigger the value, the better. The p value gives the probability that the t value lies between -2.29 and +2.29. If the probability is very small, then there is virtually no possibility that there is no relationship between the predictor and the dependent variable.

Signif. codes
The symbols show the significance level. As a general rule of thumb, the p value should be at most 0.05. R shows this by attaching one (*) to three (***) stars next to the predictor variable to show the significance.

Residual standard error and degrees of freedom
The degrees of freedom are the number of observations minus the number of parameters. In our case there are 506 observations and 2 parameters (one for the intercept and one for the crim variable). The more degrees of freedom, the less likely we are to overfit.
The residual standard error is defined as

 RSE = sqrt(RSS/df),

where the RSS is the Residual Sum of Squares (the sum of the squares of the residuals). The residual standard error gives an indication "how wrong" the prediction is, on average.

(Adjusted) R-squared
The R-squared gives the percentage of the total variation which is explained by the model, i.e. R^2 = 1 - RSS/TSS. RSS is the Residual Sum of Squares, as above. TSS is the Total Sum of Squares, that is the sum of the squared differences between the variable and its mean. TSS measures the variability given in the data.
Adjusted R^2 also incorporates the number of parameters:

 Adj.R^2 = 1 - (RSS/df) / (TSS/(N-1)),

where N is the number of observations (in our case 506). If there is no danger of overfitting, the Asjusted R^2 should be very close to the R^2.

The F-statistic tests the hypothesis that all parameters are zero. This is more useful in the case of multiple regression, where it gives us a general indication of whether the complete model is any good. The F statistic is calculated with the following formula:

F= ((TSS-RSS)/N) / (RSS/(N - k - 1))

Pig & mongo-hadoop on a local ubuntu cluster

I had a surprisingly hard time getting pig and mongo-hadoop to work on my local ubuntu machine. In this post I'll go through the steps of installing pig-0.12.0 and MongoDB 2.2.3 locally. Code which I used to make sure everything is running correctly can be found at

I will install everything in $HOME/hadoop.

Installing pig

tar xzf pig-0.12.0.tar.gz
cd pig-0.12.0
cd contrib/piggybank/java
cd ~/hadoop/pig-0.12.0
ant clean jar-all -Dhadoopversion=23

You also need to add the path of pig-0.12.0 to your .bashrc file and source it: Add this line to ~/.bashrc:

export PATH=$PATH:[path to pig-0.12.0]

and type

source ~/.bashrc

Installing MongoDB

For this i followd the guide at

apt-get install mongodb-10gen=2.2.3
echo "mongodb-10gen hold" | sudo dpkg --set-selections

Make sure that everything is running OK by typing "mongo", which should get you to the MongoDB shell.

Downloading the MongoDB Java driver

cd hadoop

Installing mongo-hadoop

cd hadoop
git clone
cd mongo-hadoop
./sbt package

To push data from Pig to MongoDB, you need to register the three .jars by adding the following three lines at the beginning of your .pig script:

REGISTER [.../]hadoop/mongo-2.10.1.jar 
REGISTER [.../]hadoop/mongo-hadoop/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER [.../]hadoop/mongo-hadoop/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar

Email data analysis

In order to learn some more about data analytics, I am working Russel Jurneys' book "Agile Data Science". In it, he uses data downloaded from gmail to illustrate the principles and helpfully he set up a github for all code used throughout his book ( I will be analysing data obtained from Microsoft Outlook. Since getting the data prepared was a bit of a hassle, I will document the steps here.

1) Export all emails into a .pst file.

2) Get readpst and transform the data into mbox format. On Ubuntu, this should be as easy as typing

sudo apt-get install pst-utils
readpst -r emails.pst

This creates an mbox file for each folder in the pst file containing all emails in the folder.

3) Reading the mbox file in python is pretty easy once you now about the mailbox module:

import mailbox

m = []
for message in mailbox.mbox(mboxfile):

The next step will be getting all those mails from the mbox files into an avro schema.

Tutorial: Buying XCP for BTC

This post is somewhat offtopic, but in the last couple of days I've been playing around with counterpartyd, the reference implementation of the counterparty protocol. It took me a while to figure everything out, so I thought I'd post a short tutorial to make it easier for anyone out there trying to do the same.

Counterparty is a protocol for a distributed, open source financial marketplace. With it, you can issue and trade assets, make broadcasts and make bets. Everything is decentralized, open source and built on top of the bitcoin blockchain. Sounds pretty good so far!

The only downside is, that in order to use the exchange, you need an altcoin, called XCP. It is used for issuing assets and making bets as well as for payouts. Although it can be exchanged for BTC directly through the counterpartyd client, that process is not exactly straightforward.

At the moment, the implementation is very much at an alpha stage. There is no GUI, so everything must be done via the terminal, and sometimes it throws not so helpful errors. But it works - Today I bought some XCP and I will explain every step in this post.

First install bitcoind and counterpartyd. This part is very well explained in the docs and I had no problems worth mentioning getting everything ready. Once everything is installed start the bitcoind server and the counterpartyd server:

$ bitcoind
$ Bitcoin server starting
$ counterpartyd
Block: 288486
Block: 288487
Block: 288488

You might have to wait until the whole blockchain is indexed by counterpartyd. As soon as the number of blocks is the same as the one given by "bitcoind getinfo | grep blocks", you are good to go. Open a second terminal window and leave the server running in the background. It'll give you useful information regarding order matches and so on later.

Next, make send the bitcoins you want to send to a new address. You can check if they have arrived using by typing "bitcoind listreceivedbyaddress".  To make sure the counterpartyd client is aligned, type "counterpartyd balances <address>".

Figure out how many XCP you would like to receive for your BTC. You can check out the last price on blockscan. When I made the trade, the price was 0.01 BTC/XCP, and since I was going to spend 0.5 BTC, I'd expect to get 50 XCP. On that site you can also see the latest transactions, the current order book and so on. To place the order, type

counterpartyd order --source <address> --get-quantity 50 --get-asset XCP --give-quantity 0.5 --give-asset BTC --expiration 100 --fee-fraction-provided 0.001

The expiration is given in blocks (i.e. 100 blocks translates to roughly 16 hours). When buying BTC you need to provide a fee. I tried 0.0001 which resulted in an error, so I increased it to 0.001 and everything worked nicely.

After entering the order, you need to wait a little until it gets included in the next block (approximately 10 minutes). After that time, it should show up bith in your server window and on blockscan. There you can also click on "[View Order Matches"] to show you if it (or parts of it) have been matched. If there are matches, you will get the XCP as soon as you have paid the required BTC amount.

To do this, look in your counterpartyd server terminal windows for a line starting with "Order Match: " and compare it with the information on blockscan. It shows the amounts (both in BTC and XCP) and the transaction hash. Copy that into the clipboard.

In the other window, enter the command

btcpay --order-match-id <transaction hash>

Finally, to make sure the XCP have been credited to you, enter "counterpartyd wallet" to see the balances and the address.