While investigating ways to make some R code more performant I came across the compiler package. This package allows one to compile functions and/or entire files into compiled bytecode. Alternatively one may also enable just-in-time compilation. In my use case with Rserve the JIT compilation options actually made performance worse because each session was recompiling everything rather than sharing compiled source. In my case the way forward was to compile the files into bytecode so they were precompiled manually and available as bytecode files.

Just-in-time Compile

library(compiler)
enableJIT(3)

The options for JIT are 0 through 3, inicating no JIT through the max level where everything possible is compiled before first use.

Compiling functions

library(compiler)
myfoo <- function(){
    seq(1:5)
}

compiledfoo <- cmpfun(myfoo)

That example is not going to yield any performance benefit because seq is already compiled but it demonstrates HOW to compile a function. My experience was that it is about 2-3 times faster and is most helpful for a function that gets called multiple times.

Compiling an entire R file

library(compiler)
cmpfile(infile='<path_to_file>.R')

This will compile the entire R file down to bytecode in the same location with an extension of .Rc This compiled file may then be sourced in other R files and the performance benefits I saw were highly dependent on the types of routines being called.

Sourcing your newly compiled R file

library(compiler)

loadcmp('<path/to/compiled/file>.Rc')
# call all of your functions as you normally would

Recently, I was experimenting with a data set in R that ended up being more challenging than I first expected. It really wasn’t that large as far as size on disk or memory(single instance) was concerned but it had over 3,000 columns.

At first I didn’t think it was that big of a deal and it wouldn’t be if one only needed a single instance of the ~ 750MB file in memory. What if you have a web application that instantiates that data set on a server for each user and you have many concurrent users? All of a sudden that 750MB and any operations on that data can quickly exhaust available RAM.

I looked at the bigmemory package because it looked promising and contained some tech under the hood that I’ve used: memory mapped files and shared memory via the C++ boost libraries. That combination allows a single instance of a matrix to be accessed by multiple processes. The problem with that ended up being that the dat set was not a matrix of uniform type but mixed types. Enter the ff and ffbase packages!

The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory - the effective virtual memory consumption per ff object.

I loaded the original data frame and tried to cast it to an ffdf (ff data frame):

require(ffbase)
load('./35509-0001-Data.rda')
testffdf <- as.ffdf(da35509.0001, vmode = NULL)
save.ffdf(testffdf, dir="./testdf", relative=TRUE)

It failed silently. I looked in the ./testdf and nothing. Ugh, what appeared so promising now looked bleak!

After some additional tinkering, I received an actual error.

Error en  ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  : 
   write error

Not very descriptive but it was something to go on. After some searching I came across this thread on stackoverflow.

The answers were not entirely helpful but they did discuss that each column in an ff data frame are stored in distinct files. Bingo! I’ve encountered bumping into exceeding the max number of files open on a system before and it is a very simple fix. Keep in mind the file limits are in place to prevent resource exhaustion on the system. Those limits are set VERY conservatively though. Increasing the limits is as simple as the following:

Linux:

Add the following to your /etc/security/limits.conf file.

youruserid  hard  nofile 100000 # you may enter whatever number you wish here
youruserid  soft  nofile  10000 # whatever you want the default to be for each shell or process you have running

OS X:

Add or edit the following in your /etc/sysctl.con file.

kern.maxfilesperproc=166384
kern.maxfiles=8192

You’ll need to log out and log back in for the change to take effect. After that change, one may use the ff data frame with up to the number of columns you specified in your limits.conf or sysctl.con file.

Now I’m able to load what was 750MB per data frame instance as a ffdf and only consume ~ 11MB of RAM per instance. Many parallel instances of the R routines using this data may be run without exhausting available RAM. Keep in mind that you’ll want to tweak your max open file settings to account for expected concurrent use.

You can test this by opening up a new R session, changing directories to where you were working previously and loading the ffdf from disk:

required(ffbase)
load.ffdf('./testdf')

Your should now have an instance of the original testffdf object.

A helpful presentation on the ff and ffbase packages is available here.

I’ve recently started exploring GlusterFS in Docker containers to use as persistent storage for the Dockerized services and applications I’ve been working on. If this is performant enough then for my purposes it will close the gap for me to really treat the data center as one giant computer. Getting started was pretty straight forward.

Note: This is a quick example. Make sure you read up on security, changing the default password, and review the original Dockerfile. I’ll be experimenting with running this out on AWS soon and should be able to further tighten up my example.

Get the latest Gluster container:

Get the latest container:

docker pull gluster/gluster-centos

Make the persistent data folders on each host:

sudo mkdir -p /gluster/logs && sudo mkdir /gluster/data && sudo mkdir /gluster/config && /gluster/mnt

Start a GlusterFS container on each host:

docker run -d \
   --name gluster \
   --privileged \
   --net=host \
   -v /gluster/data:/gluster \
   -v /gluster/logs:/var/log/glusterfs \
   -v /gluster/config:/var/lib/glusterd \
   -v /gluster/mnt:/gluster/mnt \
   gluster/gluster-centos

Probe the hosts in the cluster:

For each container on each host you’ll want to execute this to get them aware of the other peers. If running out on AWS these steps could be orchestrated through the init system on the hosts so you don’t have to log into each machine.

gluster peer probe 1.1.1.1

Now create your volume and start it:

gluster volume create media replica 3 transport tcp 172.30.0.185:/gluster/data  172.30.0.186:/gluster/data 172.30.0.30:/
gluster volume start media

In this example I’m replicating across each of three servers but depending on your needs you could: distributed striped, distributed, replicated, distributed striped replicated, … know what and why.

Mount the volume

The docs made a big deal out of mounting the volume. I suspect if you were doing anything other than replicating that would become very important.

You’ll want to do this on each host, using its internal ip:

mount -t glusterfs 172.30.0.186:/media /gluster/mnt

From one of the hosts testing with a write statement to the volume:

echo "testing, 1,2,3..." >> /gluster/mnt/test.txt

And from another host you should be able to read/write to the same document. One could then launch containers on any host with a mount to /gluster/mnt to store data. Then it would have access to the data no matter which node it was launched on.

This forces a manual update if one is available:

update_engine_client -update

You know it before you even try. You try anyway. Something like trying to bzip2 316,387 CSV files representing ~ 10TB of data. You know it will be at least 110 the size and R can handle the bzip2 files directly so you call bzip2 and get the argument list too long error.

Find and xargs to the rescue! Also, lbzip2 because you really don’t want to wait on bzip2 any more than you have to.

find path_to_files/ -name "*.csv" | xargs -P 5 lbzip2

Note: you’ll want to balance the number of parallel lbzip2 instances (or whatever you are running) as indicated by -P 5 above, balance that with the fact that lbzip2 runs in parallel too.

It still took a long time to compress 316,387 files but it went a whole lot faster with parallel instances of parallel bzip2 using the xargs -P flag.