Brian Wagner | Blog

Google Cloud Functions for Automating Tasks

Feb 14, 2019 | Last edit: Feb 22, 2019

Cloud functions or Lambdas on AWS are a quick way to get one's toes wet in the cloud computing world. Google, Amazon and Microsoft Azure all have competing products, and are marketing them avidly as a gateway to bring more customers into their deep base of services. Node.js is used by all, as well as a mixed bag of other languages that varies by provider. I have been using Golang, which entered beta support for Google Cloud Platform in January 2019.

"Serverless" is a word often used to describe these types of operations. Of course, that doesn't mean there is no server involved. Rather it's not YOUR server to worry about. The cloud host maintains it and manages security, updates, load, logging, etc. All we want is a place to perform a runtime, and not worry about the server configuration and its lifecycle. I like to think of this as "Rent a Runtime" or "Runtime as a Service."

This article dives into some specifics about creating a cloud function, inspired by a real-world challenge at work. The task was converting an unwieldy CSV file into something more usable for a front-end application. I chose to export to JSON. Initially, this process (download CSV, convert, upload JSON to file server) ran on my local computer. Shifting this to a cloud function removes my local computer from the equation, and opens the door to more automation.

Why Golang?

I have been exploring Golang for a few months. As a developer working with dynamic language runtimes, I'm fascinated at the prospect of building a self-contained package and deploying it on a platform. I especially like the idea of being liberated from setting up language environments. No need to install Node.js, or verify which version of PHP the system is running. Once a Golang project is compiled, it should be ready to deploy on the platform of choice.

The Challenge

Each day we get a CSV file dump, an ugly beast. We want to clean up the data, sort it, and generate one or more files that can be more easily consumed by a web application. Finally we need to put those files somewhere that can support traffic.

Each of these steps is well suited to the cloud.

  • we can upload the original file and rely on bucket-triggers to fire the conversion process
  • we can use any programming language available to execute code -- not just the one we're using for our main web application. (Imagine using a Python function for number crunching when the site is in PHP, for example.)
  • we have easy access to cloud storage buckets where we want the output files to live

Background

Truth be told, this was a real-life task for me to untangle, and the solution of using a cloud function dawned on me gradually. (I wasn't looking for a problem to solve with cloud functions, although that is fine too!) Initially, what we had was a 3.2MB file on the server that was read, sorted and formatted by a PHP script on every page load. Not ideal. Eventually, I created a script to generate two separate JSON files and host them locally. Then I moved to upload the files to a cloud bucket for faster availability. The end goal was to automate as much as possible, and make sure these files were updated each day.

Creating a Cloud Function

Cloud functions can be triggered in several ways:

  • make an HTTP request
  • file is added/changed/removed from a cloud storage bucket
  • cloud pub/sub messages are sent
  • events are fired by other cloud services, e.g. database, etc.

For each of these triggers, the function logic has access to the originating event, so it can capture critical details for the operation. In our case, we want to execute the function when a new file is uploaded to a bucket, so we will have access to that file and its contents during the operation. Each cloud platform will have its own way of exposing this, so read the documentation!

This official Google demo has some basics about creating a function that relies on bucket changes on Google Cloud Platform.

Most importantly, note the init() function that is required to instantiate the storage interface that we use to read and write to buckets. Also see how we pass the Event object into our function. Full documentation on the Google Cloud "storage" package.

type GCSEvent struct {
    Bucket string `json:"bucket"`
    Name   string `json:"name"`
}
 
var (
    storageClient *storage.Client
)
 
func init() {
  // Declare a separate err variable to avoid shadowing the client variables.
  var err error
 
  storageClient, err = storage.NewClient(context.Background())
  if err != nil {
    log.Fatalf("storage.NewClient: %v", err)
  }
}
 
func BlurOffensiveImages(ctx context.Context, e GCSEvent) error { ... }

Create a Module

GCP expects our Golang code to exist as a module. So we enable modules (must be running Go 1.11 or higher to use modules)

  • set GO111MODULE=on
  • go mod init
  • go mod tidy

Next we download the Google storage library:

  • go get cloud.google.com/go/storage

Program Logic

Maybe the hardest part of this exercise is resolving how to process the raw data. Do we split it up into different files, because that's how the application will use it? Do we discard some data because it's not exposed to the end-user? Or clean up/convert other bits? All that depends on the data you have.

For demo, I went searching for CSV data and found this one on U.S. regional employment numbers in the 'creative fields.'

"The creative class thesis—that towns need to attract engineers, architects, artists, and people in other creative occupations to compete in today's economy—may be particularly relevant to rural communities, which tend to lose much of their talent when young adults leave."

From this, we'll ignore everything but:

  • state name
  • county name
  • metro area (boolean)
  • total civilian employment
  • total employed in creative fields

To demonstrate splitting the data and writing to two separate files, we'll sort the data by the "metro" field.

package creative
 
import (
    "bytes"
    "compress/gzip"
    "context"
    "encoding/csv"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "log"
 
    "cloud.google.com/go/storage"
)
 
type GCSEvent struct {
    Bucket string `json:"bucket"`
    Name   string `json:"name"`
}
 
type region struct {
    State    string `json:"state"`
    County   string `json:"county"`
    Metro    bool   `json:"metro"`
    Employed string `json:"employed"`
    Creative string `json:"creative"`
}
 
var (
    storageClient *storage.Client
)
 
func init() {
    // Declare a separate err variable to avoid shadowing the client variables.
    var err error
 
    storageClient, err = storage.NewClient(context.Background())
    if err != nil {
        log.Fatalf("storage.NewClient: %v", err)
    }
}
 
func ConvertFile(ctx context.Context, e GCSEvent) error {
    uri := fmt.Sprintf("gs://%s/%s", e.Bucket, e.Name)
 
    if e.Name != "creativeclass200711.csv" {
        log.Printf("Ignoring this file: %v", e.Name)
        return nil
    }
 
    log.Printf("Received a file to convert: %s", uri)
 
    // Get cloud to read the file.
    rc, err := storageClient.Bucket(e.Bucket).Object(e.Name).NewReader(ctx)
    if err != nil {
        return err
    }
 
    data, err := ioutil.ReadAll(rc)
    if err != nil {
        return err
    }
 
    // Convert to json.
    metro, notMetro := process(data)
 
    // Write files to another bucket to avoid re-running the function.
    dest := "DESTINATION_BUCKET_NAME"
 
    w := storageClient.Bucket(dest).Object("metro.json").NewWriter(ctx)
    writeFile(metro, w)
 
    ww := storageClient.Bucket(dest).Object("not_metro.json").NewWriter(ctx)
    writeFile(notMetro, ww)
 
    return nil
}
 
func makeRegion(d []string) region {
    var metro bool
    if d[4] == "1" {
        metro = true
    } else {
        metro = false
    }
 
    r := region{
        State:    d[1],
        County:   d[3],
        Metro:    metro,
        Employed: d[5],
        Creative: d[7],
    }
    return r
}
 
func writeFile(d []byte, w *storage.Writer) {
    w.ContentType = "application/json"
    w.ContentEncoding = "gzip"
 
    gzw := gzip.NewWriter(w)
    gzw.Write([]byte(d))
    gzw.Flush()
    defer gzw.Close()
 
    if err := w.Close(); err != nil {
        log.Fatalf("Error writing file %v", err)
    }
}
 
func process(data []byte) ([]byte, []byte) {
    r := csv.NewReader(bytes.NewReader(data))
    r.Comma = ','
    records, err := r.ReadAll()
 
    if err != nil {
        log.Fatal("Error reading file to process.")
        return nil, nil
    }
 
    totalRecords := len(records)
 
    metro := make([]region, 0)
    notMetro := make([]region, 0)
 
    for i, rec := range records {
        // Ignore header line.
        if i == 0 {
            continue
        }
        if i > totalRecords {
            break
        }
        reg := makeRegion(rec)
        if reg.Metro == true {
            metro = append(metro, reg)
        } else {
            notMetro = append(notMetro, reg)
        }
    }
 
    fmt.Printf(
        "%v total records. Metro: %v. Not-metro: %v",
        totalRecords-1,
        len(metro),
        len(notMetro),
    )
 
    // Convert to byte strings.
    metroJson, _ := json.Marshal(metro)
    notMetroJson, _ := json.Marshal(notMetro)
 
    return metroJson, notMetroJson
}

Code Breakdown

The main logic here is inside the ConvertFile() function. We need to identify this function by name when we deploy this to the Google Cloud platform. As we mentioned above, the init() function is critical to set up the storage client, so we can read and write from the bucket.

  • Setup the storageClient, as Google docs explain.
  • Parse the filename of the newly uploaded file.
  • Read the file contents and hand off to a 'process()' function that pushes the CSV rows into our 'region' struct.
  • process() also splits the data into chunks, based on metro status, and returns the JSON.
  • Create a storage bucket writer for each new file to create. Here the destination bucket is hard-coded; but it could be set as an environment variable for the function which, I believe, can be changed easily through the Google cloud console.
  • The 'write()' function wraps the repeated logic of assigning content-type and encoding, and chaining a gzip writer to the storage writer.
  • Remember to close the writers, but no need to close the storageClient.

Gzip and chaining writers

Since we're serving JSON to our theoretical application, we can use gzip compression to make it smaller over the line. In Go, that's as simple as chaining writers together.

Close

We always want to close our writers in Go. Here we close both the gzip writer and the storage writer. But according to the Google docs, we do not need to close the storageClient itself.

Access Control

Securing a cloud function can be tricky, especially when using the HTTP trigger method. That URL exists out in the world! From the little I've seen, AWS has better tools to manage this, it seems.

When using the bucket trigger, we have total control over who can upload to that bucket. But we can take an added step and verify when we run the function, based on the filename, for example. If you're writing the converted back to the same bucket (as shown in the Image example above) ... well I think that's a bad idea. You're running the function twice immediately, and it's easy to see a minor change that results in an infinite loop of bucket writing!

My advice is have one bucket to upload, and one to receive the converted file and host it. That way we also can set different access controls on the buckets.

Deploying

The Google SDK allows us to deploy the function from the command line. Deployment also can be done by uploading a ZIP file or tagging a cloud source repo, neither of which I've tried.

From command line, the deploy command is big and ugly:

gcloud functions deploy creative --runtime go111 --trigger-bucket gs://WATCHED_BUCKET_NAME --entry-point ConvertFile
  • "creative" is the package name
  • we must specify the language, as Node.js is the default (although GCP is phasing out a default value)
  • trigger type (others are --trigger-http, --trigger-topic ... more)
  • based on trigger type, we need to identify the resource to watch
  • specify the function name that should be called when the cloud function executes

When successful, the terminal will report some info about the function. One important detail that we get back is Version Number. We can change the code and run the same deploy command over and again, and we'll see the Version Number change. This helps track the code when we're deploying manually.

When deployment fails, we will see some log messages in the terminal and the cloud console that may be helpful, but not terribly so. If we were trying to re-deploy, the platform should continue working from the last working version of the function. Cloud console will show this more readily, including a dropdown that suggests you can jump between versions. But it doesn't seem to work that way; there is no way to rollback from the latest working version to a previous working version.

Next Steps

The next step would be to automate uploading this file, and then the whole process is hands-off! The Google Cloud SDK allows us to upload files easily from command line. But I haven't researched what tools might exist for machine-to-machine file transfer like that.

Thanks!

I found some critical help on this, so kudos to these folks:

https://medium.com/google-cloud/google-cloud-functions-for-go-57e4af9b10da

https://dev.to/plutov/google-cloud-functions-in-go-43e0

https://medium.com/@skdomino/writing-on-the-train-chaining-io-writers-in-go-1b39e07f71c9

References

https://godoc.org/cloud.google.com/go/storage

https://cloud.google.com/appengine/docs/standard/go/googlecloudstorageclient/read-write-to-cloud-storage

https://cloud.google.com/functions/docs/tutorials/imagemagick

https://medium.com/google-cloud/google-cloud-functions-for-go-57e4af9b10da

https://www.ers.usda.gov/data-products/creative-class-county-codes.aspx