Most people would agree that R is a popular language for data analysis. Perhaps less well known is that R has good support for parallel execution on a single CPU through packages like future. In this presentation we will talk about our experience scaling up R processes even further running R in parallel in docker containers using Kubernetes. Robots generate massive amounts of sensor and other data; extracting the right information and insights from this requires significant more processing than can be tackled on a single execution environment. Faced with a preprocessing job of several hundred GB of data of compressed json line files, we used Pachyderm to write data pipelines to run the data prep in parallel, using multicore containers on a kubernetes cluster.
By the end of the talk we will have dispelled the myth that R cannot be used in production at scale. Even if you do not use R, you will have seen a use case to scale up analysis regardless of your language of choice.
Hello. Thank you very much for your interest in this talk. My name is Frans van den Eyn. I'm the chief data officer at Expantia, and we help organizations to boost their data analytics capacity. This talk was prepared together with Florian Pistoni. Florian is the co-founder and CEO of Inorbit, and Inorbit is a cloud robot management platform, and they are taking robots, developing them further to take robot operations to the global scale. These robots are everywhere now. Where you look even more now with COVID, they're working alongside with humans. They're working autonomously, and it means that the volume of the data that we're gathering from robots is increasing exponentially. Of course, this offers very interesting chances for us data scientists to work with the kind of data that has some special characteristics. And if we look at what Inorbit has been doing, they have a community that has accumulated more than 3.8 million hours of robot data in the last 12 months alone, and they're growing rapidly. Right now, they're adding about a year's worth of data each and every day. For us to do analysis on that, even though this is a lightning talk, it's very briefly, but I want to highlight some of the issues we found and how we've solved. The main problem we found is that Inorbit offers their services to any fleet. We have no control of how exactly those data are being gathered and how they're sent to the central service. In one of the POCs we did, we were faced with many robots that were sending millions of files, one file every five seconds for every agent, but each file contained data of multiple sources. And each of those sources were in-robot or in-agent operations. They had their own timestamps. They were not directly related. So, the first step that we needed to do was data extraction. And this was a little bit more complex than we expected, especially because we needed to join observations by nearest timestamp. I will highlight that a little bit in one slide further. And then do the feature engineering on top of that. So, what I mean with nearest time joining, we have different signals about mission, localization, speed. And we need to find an interval where we can join each and every one of those signals to have a single observation. We worked out how that could be done. And then we needed to start the feature engineering. So, once we have one line, one observation per time unit, we wanted to look back. We wanted to look back at what happened before a failure. So, there's a failure right here. And if we go back, say, 42 seconds, then we need to do that for each and every one. Doing that for and taking into account all the cases where we couldn't include the datum, for instance, when there was a failure within the 42 second time frame was absolutely possible. But then we were faced with an enormous volume where our local computer simply said, no, this is not going to be possible. So, we immediately thought about farming this out to Kubernetes. We set up a bucket with the data that was going in. We packaged them in one day zips. We packaged GZIPs to make it a little bit more workable and to be able to transport the data with more ease. And then farm out the full data extraction of Kubernetes. A second bucket with the intermediate result, farm out the data engineering, we have the result ready for analysis. What we found is that it is much easier to get the help of something called PackyDerm. So, PackyDerm is a product. It's a company. They have an open source version of this, which is what we use. And what we have there is not a bucket. What we have is a filing cabinet. We have a repository where we can version the data that is coming in and version the data that is coming out. Doing this kind of data pipeline with versioning means that if there's one change at any point in the pipeline, the rest of the pipeline will respond and will update the data automatically. So, that prepares us to do all the to have all the heavy lifting ready once we bring this into production. Just a quick look at what this looks like. We create pipelines that are very similar to your Kubernetes configuration files. And the key thing here is that we can connect the data in the repository, in the PackyDerm repository, PFS is the PackyDerm file system, to what we're running in our R script. Because our R scripts, they were already parallelized, but parallelizing was not enough. So, we were now able to farm this out, making that connection, setting whatever data preparation was we're doing next. And for each datum, that is sort of the PackyDerm term for what each unit of data in the pipeline is, we can see whether it was successful or whether it failed. Right? This is a screen from Patrick Santamaria, who I did most of this work with. So, being able to monitor on that level what is happening with your data is in practice, it was absolutely great. So, parallelizing R code, it's easy. There's a package called FUR built on future by Henrik Bengtsson, and it's very easy to parallelize it out. Going from parallel R code to a massively parallel pipeline is doable, thanks to PackyDerm. For us, it was much easier to work with Kubernetes through that. The other thing was that large clusters are surprisingly affordable. We worked on 80 CPU, 520 gigabyte cluster for under $10 an hour, which was something we haven't experienced before, and now we're using more and more in the work that we're doing. For the team that we were working with for Inorbit, this also has huge implications, because it means that as fleets grow, they know that collecting and processing data becomes critical for them. We also now know that where they understand the scaling of their full platform, it doesn't need to be hard, but scaling up the analysis doesn't need to be hard either. And using AI at the robot fleet level, you know, unlocks many new opportunities that they are working hard on to continue to offer to their customers. Thank you very much. I hope this was useful. If you want any further information, please visit our websites, expantia.com or inorbit.ai. We both have blogs running and we like to write about this stuff. So we hope to be in touch sometime in the future. So thank you very much.