[go: up one dir, main page]

Converting C code to C++ code: An example from plyr

Hadley Wickham and Dirk Eddelbuettel — written Dec 2, 2013 — source

The plyr package uses a couple of small C functions to optimise a number of particularly bad bottlenecks. Recently, two functions were converted to C++. This was mostly stimulated by a segmentation fault caused by some large inputs to the split_indices() function: rather than figuring out exactly what was going wrong with the complicated C code, it was easier to rewrite with simple, correct C++ code.

The job of split_indices() is simple: given a vector of integers, x, it returns a list where the i-th element of the list is an integer vector containing the positions of x equal to i. This is a useful building block for many of the functions in plyr.

It is fairly easy to see what is going on the in the C++ code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::exports]]
std::vector<std::vector<int> > split_indices(IntegerVector x, int n = 0) {
    if (n < 0) 
      stop("n must be a positive integer");
  
    std::vector<std::vector<int> > ids(n);
  
    int nx = x.size();
    for (int i = 0; i < nx; ++i) {
        if (x[i] > ids.size()) {
           ids.resize(x[i]);
        }
    
        ids[x[i] - 1].push_back(i + 1);
    }
  
    return ids;
}
  • We create a std::vector containing std::vector (of type integer) called ids. It will grow efficiently as we add new values, and Rcpp will automatically convert to a list of integer vectors when returned to R.

  • The loop iterates through each element of x, adding its index to the end of ids. It also makes sure that ids is long enough. (The plus and minus ones are needed because C++ uses 0 based indices and R uses 1 based indices.)

The code is simple, easy to understand (if one is a little familiar with the STL), and performant. The most awkward aspect of the code is switching between R’s 1-based indexing and C++ 0-based indexing (and indeed this was the source of a bug in a previous version of this code).

Compare it to the original C code:

#include <R.h>
#include <Rdefines.h>

SEXP split_indices(SEXP group, SEXP n) {
    SEXP vec;
    int i, j, k;

    int nlevs = INTEGER(n)[0];
    int nobs = Rf_length(group);  
    int *pgroup = INTEGER(group);
  
    // Count number of cases in each group
    int counts[nlevs];
    for (i = 0; i < nlevs; i++)
        counts[i] = 0;
    for (i = 0; i < nobs; i++) {
        j = pgroup[i];
        if (j > nlevs) error("n smaller than largest index");
        counts[j - 1]++;
    }

    // Allocate storage for results
    PROTECT(vec = allocVector(VECSXP, nlevs));
    for (i = 0; i < nlevs; i++) {
        SET_VECTOR_ELT(vec, i, allocVector(INTSXP, counts[i]));
    }

    // Put indices in groups
    for (i = 0; i < nlevs; i++) {
        counts[i] = 0;
    }
    for (i = 0; i < nobs; i++) {
        j = pgroup[i] - 1;
        k = counts[j];
        INTEGER(VECTOR_ELT(vec, j))[k] = i + 1;
        counts[j]++;
    }
    UNPROTECT(1);
    return vec;
}

This function is almost three times as long, and has a bug in it. It is substantially more complicated because it:

  • has to take care of memory management with PROTECT and UNPROTECT; Rcpp takes care of this for us

  • needs an additional loop through the data to determine how long each vector should be; the std::vector grows efficiently and eliminates this problem

Conversion to C++ can make code shorter and easier to understand and maintain, while remaining just as performant.

tags: stl  basics 

Related Articles