Implementing the Pearson correlation algorithm using Ruby and the GNU Scientific Library

November 24, 2008

At Ruby Manor on Saturday Alex MacCaw gave a great introduction to his acts_as_recommendable plugin for Rails. acts_as_recommendable simplifies collaborative filtering for Rails models, automatically generating recommended items, at an on-line store for example, based on a database of user preferences.

At its heart, acts_as_recommendable uses a statistical measure known as the Pearson correlation coefficient to calculate the "nearness" of items to one another. Alex talked about the performance issues he encountered when implementing the algorithm in pure ruby. To allow recommendations to be calculated for the entire database he had to switch to making calculations offline and reimplementing the algorithm in C using RubyInline.

The GNU Scientific Library has an implementation of the Pearson algorithm, and in this post I'd like to show how the Ruby code, or its inline-C equivalent can be replaced with GSL code using Ruby bindings to the GSL.

My naive pure Ruby version of the algorithm looks like this:

def ruby_pearson(x,y)
  n=x.length

  sumx=x.inject(0) {|r,i| r + i}
  sumy=y.inject(0) {|r,i| r + i}

  sumxSq=x.inject(0) {|r,i| r + i**2}
  sumySq=y.inject(0) {|r,i| r + i**2}

  prods=[]; x.each_with_index{|this_x,i| prods << this_x*y[i]}
  pSum=prods.inject(0){|r,i| r + i}

  # Calculate Pearson score
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-(sumx**2)/n)*(sumySq-(sumy**2)/n))**0.5
  if den==0
    return 0
  end
  r=num/den
  return r
end

Here I should note that the actsasrecommendable code is considerably more complicated, however the heart of the calculation looks something like the above. We can replace that with an inline C version using something like:

require 'rubygems'
require 'inline'

inline do |builder|
  builder.c '
    #include <math.h>
    double inline_pearson(int n, VALUE x, VALUE y) {
    double sum1 = 0.0;
    double sum2 = 0.0;
    double sum1Sq = 0.0;
    double sum2Sq = 0.0;
    double pSum = 0.0;

    VALUE *x_a = RARRAY(x)->ptr;
    VALUE *y_a = RARRAY(y)->ptr;

    int i;
    for(i=0; i<n; i++) {
      double this_x;
      double this_y;
      this_x = NUM2DBL(x_a[i]);
      this_y = NUM2DBL(y_a[i]);
      sum1 += this_x;
      sum2 += this_y;
      sum1Sq += pow(this_x, 2);
      sum2Sq += pow(this_y, 2);
      pSum += this_y * this_x;
    }
    double num;
    double den;
    num = pSum - ( ( sum1 * sum2 ) / n );
    den = sqrt( ( sum1Sq - ( pow(sum1, 2) ) / n ) *
          ( sum2Sq - ( pow(sum2, 2) ) / n ) );
    if(den == 0){
      return 0.0;
    } else {
      return num / den;
    }
   }'
end

Which is a considerable amount of code. I love the fact that this code can be embedded directly in the Ruby source code but, as was pointed out to me, for some people this would be seen as something of a maintenance nightmare. If you're prepared to install the pre-requisite GSL library and its Ruby bindings you can actually replace all of the above code with the simple:

require 'gsl'

def gsl_pearson(x,y)
  GSL::Stats::correlation(
    GSL::Vector.alloc(x),GSL::Vector.alloc(y)
  )
end

In this code, the <verb>GSL::Vector.alloc()</verb> method converts a Ruby Array to a GSL Vector class. We then call the correlation method, helpfully provided by the Statistics portion of the GSL.

So, how does this perform ? A quick benchmark :

require 'benchmark'

n = 100000
x = []; n.times{x << rand}
y = []; n.times{y << rand}

Benchmark.bm() do |bm|
  bm.report("Ruby:") {ruby_pearson(x,y)}
  bm.report("GSL:") {gsl_pearson(x,y)}
  bm.report("Inline:") {inline_pearson(n,x,y)}
end

Gives some indicative results :

usersystemtotalreal
Ruby:1.5300000.0200001.550000(1.544765)
GSL:0.0100000.0000000.010000(0.015925)
Inline:0.0000000.0000000.000000(0.004115)

While the Inline version in this example outperforms the GSL version, both offer considerable savings on the Ruby version. One of the real advantages of the GSL is it allows you to rapidly experiment with alternative implementations of an algorithm. Also you can be safe in the knowledge that your code is based on a well-tested library of scientific functions.

Published