Teaching Data Science in the cloud

Background

Kurt is a professor in the Industrial & Manufacturing  Engineering department in the College of Engineering.  As a faculty member in Industrial Engineering, he wanted to introduce data science and analytics to students in his department.  We live in a world where data can help inform and guide our approach to solving problems. More and more devices are collecting information in quantities that were unmanageable just a few years ago.  Recognizing the need for an understanding of data science, Kurt had proposed a new experimental course, where students would collect and analyze a reasonably large set of data points.  Given this opportunity, Kurt wanted to explore what new technologies and options are available for this developing field.  Having heard about Cal Poly's partnership with AWS, he approached ITS to see if the cloud could be used as resource for these learning objectives.

Problem:

 
Students were guided in building a device using an Arduino, that would collect latitude & longitude GPS coordinates, timestamp, and temperature.  Students were then encouraged to bring this device with them on their commute to campus and between classes. This data was then downloaded into a format that could be analyzed. Current campus resources require a hardware provisioning cycle and software configuration that is consistent and stable throughout the quarter.  Kurt had many potential ideas in his head but needed the flexibility to evolve the class throughout the quarter.  He also wanted an environment that would isolate the work the students were doing and allow them the opportunity to experiment and make mistakes. He wanted an environment that was able to adapt to his evolving ideas on how to teach these concepts. 

Solution:

Kurt began by working with ITS to create a virtual lab where each student would be given access to a compute resource in AWS.  The evolution of how to use resources in the cloud was a learning experience for both ITS and IME.  Kurt knew he wanted the students to have the ability to not only consume the data but also be able to share the results amongst  peers in the class.  He decided a web framework would meet his requirements, with a simple web application that could consume the GPS data files and output the formatted content.  As we began to architect the environment we realized AWS's automation layer would allow students to avoid the details of installing and configuring a web server and also elminate a database install on top of an operating system install. Using Amazon's CloudFormation language we were able to automate the build of each server in minutes.  Each student was given an EC2 instance that was provisioned with all the software needed to consume and display the data.  Each machine was left on for the duration of the quarter and then shut down after 10 weeks.  The total cost of this solution was $430 per month to run 26 t2.small.  One larger instance was deployed to analyze the entire class dataset.  Total cost including data transfer charges, EBS storage, and the larger instance for the class was $1215.78.

Future Improvements:

The students also needed a workstation to run other data analytic tools and for the Winter Quarter they all used lab computers.  Some of the configuration controls imposed by lab machines limited the flexibility of what data tools could be used after the machine image was created.  We will likely pilot Workspaces as an alternative work environment for student to interact with GUI rich tools like Jupyter notebooks.  The EC2 instances also don't need to run 24/7 during the quarter.  We will implement a solution that will have the instances power off when not in use and build a simple web application using a server-less solution in AWS to allow students the ability to start up, stopped instances.

ReUsable Architecture:

Many of these pieces will begin to form the components of a virtual lab that can be spun up on demand and reused only when needed.  Future improvements should drive the cost down so that a service like this is affordable and scalable.

Update:

In the Fall of 2018 we've improved the design of the EC2 resources to automatically power down when the resource is idle for 1 hour.  This will reduce the number of billable EC2 hours from ~4000 of hours per week to a few hundred per week.  Using a few AWS services (Cognito, Lambda, API Gateway and an S3 website) students can now power on an EC2 instance based on a tag name on demand.  We estimate this new setup will dramatically reduce the cost of this service.

Questions?

 

We are happy to share this solution with others that are interested.

 

Contact: Darren Kraker dkraker@calpoly.edu

Related Content