David Zimmermann, Leonardo Silvestri, Dirk Eddelbuettel — written May 5, 2020 — source
Rcpp
provides the DataFrame
class which enables us to pass data.frame
object between C++ and
R. DataFrame
objects are key to R and used very widely. They also provide the basis from which
two key packages extend them. One of these, the tibble
package, operates in a similar fashion to
data.frame
and treats the data as immutable. Upon a change to the data, a new version is created
internally—this is frequently referred to as “copy-on-write”. Immutable data structures have some
desirable properties in terms of reasoning with and about data, but the copying comes at a price in
terms of performance especially once data becomes sizeable.
The other package, data.table
, offers a contrasting approach. It generally modifies the data
in-place; this is often referred to as by reference. As (generally) no copies of the data are
needed, the runtime is often reduced. This design difference is (along with a lot of attention to
performance and optimisations in general) one of the reasons why data.table
is so looking rather
attractive in benchmarks.
As objects of type data.table
also inherit from data.frame
, we can use all data.frame
functions on a data.table
. Internally, however, the two are very different. The question now is,
how can we create a data.table
efficiently inside of Rcpp
without the need for a deep copy of
the data?
This question is important if the data generation process takes place in C++
, e.g. when parsing,
simulating, or otherwise generating data. One example for which question came up is the
RITCH package which parses (large) binary ITCH files (financial
messages) in Rcpp
and needs a fast data conversion to data.table
.
We first provide a very short answer, mostly for future reference. A more detailed answer is provided below.
Essentially two steps have to be taken:
"data.table", "data.frame"
in Rcpp
data.table::setalloccol(df)
function on the “partial” data.table
A look inside the data.table::setalloccol()
functions shows that it itself calls only a few
data.table
functions needed to do some housekeeping and optimisation on internal state. It would
be useful to access these functions directly from another package at the C level, and we plan to
discuss this with the data.table
team.
A first thought would be to use the Rcpp::DataFrame
or the Rcpp::List
class in Rcpp
to return
the data to R
and then call data.table::setDT()
on the object to properly convert it to a
data.table
. To give an example, lets consider creating a dataset of some data in Rcpp
and
returing it to R as a Data.Frame
Using setDT()
, we can convert the list to a data.table
:
Classes 'data.table' and 'data.frame': 10000 obs. of 10 variables: $ col_0: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... $ col_1: num 2.371 -0.167 0.927 -0.568 0.225 ... $ col_2: num -0.836 -0.221 -2.104 -1.668 -1.098 ... $ col_3: num -0.194 0.258 -0.538 -1.179 0.901 ... $ col_4: num 0.4825 0.7214 -0.5078 -0.0647 1.3021 ... $ col_5: num 0.26 0.918 -0.722 -0.808 -0.141 ... $ col_6: num -0.3883 0.0274 -0.2761 -0.0867 2.1477 ... $ col_7: num 0.414 -0.641 0.281 -0.694 -0.367 ... $ col_8: num 0.783 -0.424 -0.844 0.876 1.125 ... $ col_9: num 0.5732 0.0183 -0.022 -0.4278 -0.4776 ... - attr(*, ".internal.selfref")=<externalptr>
Inspecting the data.table
code reveals that internally, data.table
uses the setalloccol()
function to convert a data.frame
to data.table
after doing some checks. This can be leveraged
in the following way:
Set the class attribute of the list to "data.table", "data.frame"
and then use setalloccol()
to reallocate the data by reference
Classes 'data.table' and 'data.frame': 10000 obs. of 10 variables: $ col_0: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... $ col_1: num 2.371 -0.167 0.927 -0.568 0.225 ... $ col_2: num -0.836 -0.221 -2.104 -1.668 -1.098 ... $ col_3: num -0.194 0.258 -0.538 -1.179 0.901 ... $ col_4: num 0.4825 0.7214 -0.5078 -0.0647 1.3021 ... $ col_5: num 0.26 0.918 -0.722 -0.808 -0.141 ... $ col_6: num -0.3883 0.0274 -0.2761 -0.0867 2.1477 ... $ col_7: num 0.414 -0.641 0.281 -0.694 -0.367 ... $ col_8: num 0.783 -0.424 -0.844 0.876 1.125 ... $ col_9: num 0.5732 0.0183 -0.022 -0.4278 -0.4776 ... - attr(*, ".internal.selfref")=<externalptr>
[1] TRUE
Comparing the two methods, we see that the second implementation is a lot faster on smaller datasets.
On larger datasets, the differences is less dominant, which hints that data.table
is able to use
the data in place and has no need to copy it.
test replications elapsed relative user.self sys.self 2 create_dt_correct(15, 1000) 10 0.007 1.000 0.008 0 1 create_dt_naive(15, 1000) 10 0.013 1.857 0.013 0 user.child sys.child 2 0 0 1 0 0
test replications elapsed relative user.self sys.self 2 create_dt_correct(15, 1e+06) 10 4.202 1.000 4.035 0.167 1 create_dt_naive(15, 1e+06) 10 4.265 1.015 4.147 0.118 user.child sys.child 2 0 0 1 0 0
R 4.0.0 brings a new function list2DF()
which we could consider as well.
Note: This post draws upon, and extends, an earlier writeup at his repo.
tags: data.table
Tweet