Scheduled R scripts on Google Compute Engine

Mark Edmondson

2019-05-04

Simple Scheduler

The below lets you send any of your R scripts to a dedicated VM. It takes advantage of containeRit to bundle up all the dependencies your R script may need.

In brief you need to:

  1. Use containeRit to create a Dockerfile that will run your R script with all its dependencies
  2. Build a Docker image using that Dockerfile, easiest is via GitHub and Build Triggers
  3. Launch a VM that will run your scheduled scripts with gce_vm_scheduler - keep this live
  4. Set up the scheduled cron for the Docker image using gce_schedule_docker

Example

The example below uses a script that comes with the package, that you can use as a template.

Demo script

A demo script is below, that may look like something you want to schedule:

library(googleAuthR)         ## authentication
library(googleCloudStorageR) ## google cloud storage

## set authentication details for non-cloud services
# options(googleAuthR.scopes.selected = "XXX",
#         googleAuthR.client_id = "",
#         googleAuthR.client_secret = "")

## download or do something
something <- tryCatch({
    gcs_get_object("schedule/test.csv", 
                   bucket = "mark-edmondson-public-files")
  }, error = function(ex) {
    NULL
  })
    
something_else <- data.frame(X1 = 1,
                             time = Sys.time(), 
                             blah = paste(sample(letters, 10, replace = TRUE), collapse = ""))
something <- rbind(something, something_else)

## authenticate on GCE for google cloud services
googleAuthR::gar_gce_auth()

tmp <- tempfile(fileext = ".csv")
on.exit(unlink(tmp))
write.csv(something, file = tmp, row.names = FALSE)
## upload something
gcs_upload(tmp, 
           bucket = "mark-edmondson-public-files", 
           name = "schedule/test.csv")

Note that its best to not save data onto the scheduler - its much better to use an external storage to load and save data to such as Google Cloud Storage.

Schedule setup script

The above script can then be scheduled via the below.

We first create a Dockerfile that holds all your scripts dependencies. This is the magic of containeRit:

if(!require(containeRit)){
  devtools::install_github("MarkEdmondson1234/containerit") #use my fork until fix merged
  library(containeRit)
}

script <- system.file("schedulescripts", "schedule.R", package = "googleComputeEngineR")

## put the "schedule.R" script in the working directory
file.copy(script, getwd())


## it will run the script whilst making the dockerfile
container <- dockerfile("schedule.R",
                        copy = "script_dir",
                        cmd = CMD_Rscript("schedule.R"),
                        soft = TRUE)
write(container, file = "Dockerfile")

Now you have the Dockerfile, it can be used to create Docker images. There are several options here, including docker_build function to build on another VM or locally, but the easiest is to use a code repository such as GitHub, and the new Google Container Registry service, Build Triggers.

This creates the Docker image for you on every GitHub push, and makes it available either publically or privately to you.

The example does this via this public Container Registry service created for googleComputeEngineR and has the above scripts at "demo-docker-scheduler"

After the image has built, you can schdule it to be called via a crontab task via the new dedicated functions, gce_vm_scheduler and gce_schedule_docker

## Create a VM to run the schedule
vm <- gce_vm_scheduler("my_scheduler")

## setup any SSH settings if not using defaults
vm <- gce_vm_setup(vm, username = "mark")

## get the name of the just built Docker image that runs your script
docker_tag <- gce_tag_container("demo-docker-scheduler", project = "gcer-public")

## Schedule the docker_tag to run every day at 0453AM
## schedule uses crontab syntax
gce_schedule_docker(docker_tag, schedule = "53 4 * * *", vm = vm)

The Docker image is now downloaded the first time, and run on the schedule (0453 AM in above example)

The Dockerfile used need not be an R related one, any Docker image can be scheduled.

Master-Slave Scheduler

For bigger jobs or more seperation, you can launch entire VMs dedicated for your scheduled task. This lets you tailor the VM individually.

Costs

$4.09 a month for the master + $1.52 a month per slave (daily 30 min cron job on a 7.5GB RAW instance).

Pricing calculator here

The master and slave templates

These have been set up via a public Google Container Registry via build triggers, tied to googleComputeEngineR’s repostiory on Github. You can see the Dockerfiles used in the dockerfiles in system.file("dockerfiles", "gceScheduler", package = "googleComputeEngineR")

Each time the GitHub repository is pushed, these Docker images are rebuilt, allowing for easy changes and versioning.

Setup the master VM

Now we have the templates saved to Container Registry, make a ‘Master’ VM that is small, and will be on 24/7 to run cron. This costs ~$4.09 a month. Give it a strong password.

library(googleComputeEngineR)

username <- "mark"

## make the cron-master
master <- gce_vm("cron-master", 
                 predefined_type = "g1-small",
                 template = "rstudio", 
                 dynamic_image = gce_tag_container("gce-master-scheduler", project = "gcer-public"),
                 username = username, 
                 password = "mark1234")


## set up SSH from master to slaves with username 'master'
gce_ssh(master, "ssh-keygen -t rsa -f ~/.ssh/google_compute_engine -C master -N ''")

## copy SSH keys into the docker container 
## (probably more secure than keeping keys in Docker container itself)
docker_cmd(master, cmd = "cp", args = sprintf("~/.ssh/ rstudio:/home/%s/.ssh/", username)
docker_cmd(master, cmd = "exec", args = sprintf("rstudio chown -R %s /home/%s/.ssh/", username, username)

Setup slave instance

Create the larger slave instance, that can be then stopped ready for the master to activate as needed. These will cost in total $1.52 a month if they run every day for 30 minutes. Here its called slave-1 but a more descriptive name helps, such as a client name.

slave <- gce_vm("slave-1", 
                 predefined_type = "n1-standard-2",
                 template = "rstudio", 
                 dynamic_image = gce_tag_container("gce-slave-scheduler", project = "gcer-public"),
                 username = "mark", 
                 password = "mark1234")
                 
## wait for it to all install (e.g. RStudio login screen available)
## stop it ready for being started by master VM      
gce_vm_stop(slave)

If you want to use the latest version of the Docker built image, you need to recreate the instance, allowing you to create versioning.

Create scheduled script

Create the R script you want to schedule. Make sure it is self sufficient in that it can authenticate, do stuff and upload to a safe repository, such as Google Cloud Storage.

This script will be in turn uploaded itself to Google Cloud Storage, so the slave instance can call it via a handy googleCloudStorageR function that runs a script locally from a cloud storage file:

googleCloudStorageR::gcs_source('download.R', bucket = 'your-gcs-bucket')

The example script below authenticates with Google Cloud Storage, downloads a ga.httr-oauth file that carries the Google Analytics authentication, runs the download then reauthenticates with Google Cloud Storage to upload the results. Modify for your own expensive operation.

## download.R - called from slave VM
library(googleCloudStorageR)
library(googleAnalyticsR)

## set defaults
gce_global_project("my-project")
gce_global_zone("europe-west1-b")
gcs_global_bucket("your-gcs-bucket")

## gcs can authenticate via GCE auth keys
googleAuthR::gar_gce_auth()

## use GCS to download auth key (that you have previously uploaded)
gcs_get_object("ga.httr-oauth", saveToDisk = "ga.httr-oauth")

auth_token <- readRDS("ga.httr-oauth")
options(googleAuthR.scopes.selected = c("https://www.googleapis.com/auth/analytics", 
                                        "https://www.googleapis.com/auth/analytics.readonly"),
        googleAuthR.httr_oauth_cache = "ga.httr-oauth")
googleAuthR::gar_auth(auth_token)

## fetch data

gadata <- google_analytics_4(81416156,
                             date_range = c(Sys.Date() - 8, Sys.Date() - 1),
                             dimensions = c("medium", "source", "landingPagePath"),
                             metrics = "sessions",
                             max = -1)

## back to Cloud Storage
googleAuthR::gar_gce_auth()
gcs_upload(gadata, name = "uploads/gadata_81416156.csv")
gcs_upload("ga.httr-oauth")

message("Upload complete", Sys.time())

Create master script

Create the script that will run on master VM. This will start the slave instance, run your scheduled script and stop the slave instance again.

## intended to be run on a small instance via cron
## use this script to launch other VMs with more expensive tasks
library(googleComputeEngineR)
library(googleCloudStorageR)
gce_global_project("my-project")
gce_global_zone("europe-west1-b")
gcs_global_bucket("your-gcs-bucket")

## auth to same project we're on
googleAuthR::gar_gce_auth()

## launch the premade VM
vm <- gce_vm("slave-1")

## set SSH to use 'master' username as configured before
vm <- gce_ssh_setup(vm, username = "master", ssh_overwrite = TRUE)

## run the script on the VM that will source from GCS
runme <- "Rscript -e \"googleAuthR::gar_gce_auth();googleCloudStorageR::gcs_source('download.R', bucket = 'your-gcs-bucket')\""
out <- docker_cmd(vm, 
                  cmd = "exec", 
                  args = c("rstudio", runme), 
                  wait = TRUE)

## once finished, stop the VM
gce_vm_stop(vm)

Add worker script to cron

Log in to the master VM and save the script, then schedule it via the cronR RStudio addin.