Hadley Wickham and Dirk Eddelbuettel — written Dec 2, 2013 — source
The plyr package uses a couple of
small C functions to optimise a number of particularly bad bottlenecks.
Recently, two functions were converted to C++. This was mostly stimulated by a
segmentation fault caused by some large inputs to the split_indices()
function: rather than figuring out exactly what was going wrong with the
complicated C code, it was easier to rewrite with simple, correct C++ code.
The job of split_indices()
is simple: given a vector of integers, x
,
it returns a list where the i
-th element of the list is an integer vector
containing the positions of x
equal to i
. This is a useful building block
for many of the functions in plyr.
It is fairly easy to see what is going on the in the C++ code:
We create a std::vector
containing std::vector
(of type integer
)
called ids
. It will grow efficiently as we add new values, and Rcpp will
automatically convert to a list of integer vectors when returned to R.
The loop iterates through each element of x
, adding its index to the end
of ids
. It also makes sure that ids
is long enough. (The plus and minus
ones are needed because C++ uses 0 based indices and R uses 1 based
indices.)
The code is simple, easy to understand (if one is a little familiar with the STL), and performant. The most awkward aspect of the code is switching between R’s 1-based indexing and C++ 0-based indexing (and indeed this was the source of a bug in a previous version of this code).
Compare it to the original C code:
This function is almost three times as long, and has a bug in it. It is substantially more complicated because it:
has to take care of memory management with PROTECT
and UNPROTECT
;
Rcpp takes care of this for us
needs an additional loop through the data to determine how long each
vector should be; the std::vector
grows efficiently and eliminates this
problem
Conversion to C++ can make code shorter and easier to understand and maintain, while remaining just as performant.
Tweet