Most people would agree that R is a popular language for data analysis. Perhaps less well known is that R has good support for parallel execution on a single CPU through packages like future. In this presentation we will talk about our experience scaling up R processes even further running R in parallel in docker containers using Kubernetes. Robots generate massive amounts of sensor and other data; extracting the right information and insights from this requires significant more processing than can be tackled on a single execution environment. Faced with a preprocessing job of several hundred GB of data of compressed json line files, we used Pachyderm to write data pipelines to run the data prep in parallel, using multicore containers on a kubernetes cluster.
By the end of the talk we will have dispelled the myth that R cannot be used in production at scale. Even if you do not use R, you will have seen a use case to scale up analysis regardless of your language of choice.