We have been thinking of sharing some of our platform experiences with the developer community for a while and the announcement of AWS EC2 Spotathon 2012 provided just the right push to start it off. Watch this space for updates on how the Shufflr platform uses EC2 spot instances to its advantage in various subsystems and use cases.
Shufflr is a multi-screen, personalised social video discovery service powered by its platform on the AWS cloud (see shufflr.tv and products).
Over 4 million users across 170 countries discover online videos of their interest every day on Shufflr’s iPad, iPhone, Android, Windows8 and Facebook apps.
Shufflr takes the ‘Daily Fix’ approach for delivering a personalised discovery experience to its users – with an algorithmic combination of various signals such as the users’ social graphs, hot topics / buzzing videos / celebrity activity on the web, popular content sources, live events etc.
The Shufflr Platform
The Shufflr platform on AWS comprises of many independently scalable services such as user signup, authentication, authorisation, online video indexing, algorithmic filtering and classification of videos based on meta-data, making & tracking social graph connections, social data fetching, personalisation algorithms per-user and per-group, creating and delivering activity streams, developer APIs for apps etc. The platform services communicate with each other over well defined API contracts and SLAs.
AWS EC2 spot instance use cases in the platform
Shufflr platform services that are of non-realtime, offline data-fetch / compute nature have flexible SLAs and are prime candidates for cost optimization with AWS EC2 spot instances (though the overall cost of is planned & managed as a function of reserved, on-demand and spot instances). Some such spot-instance friendly services in Shufflr are, social graph data fetchers, video indexers and filters, background (asynchronous) tasks, test (staging / pre-production) environments etc.
While it is typical to think of spot instances purely as cost savers, we see it as a reminder for asking sound architectural questions such as:
- Which tasks can be delayed without affecting the product requirements and upto when can they be delayed (not only freeing up web/app/DB servers to cater to low latency requests but also has the positive side effect of the flexibility to wait for the right spot price for delayable tasks)?
- How much and how frequently should background tasks be persisting intermediate state (thus lending to robust fault tolerant designs and incidentally becoming amenable to the unpredictable and ephemeral nature of spot instances)?
- Should a given background task complete quicker (by launching more instances in parallel for the same cost), to help other services down the compute chain make their decisions faster?
Spot scaler code snippets (click to zoom in)
And we find that, a probably-not-so-obvious aspect of spot instances is that it lends to two conceptually different use cases:
- Doing a job at the lowest possible cost and
- Doing more jobs for a set cost.
The latter use case helps us innovate better due to our thinking on the lines of:
“These instances cost less and we have budgeted for the worst-case of on-demand pricing. So, let’s see if we can infer any more patterns by fetching and mining a bit more social data / compute a few more iterations to improve recommendation accuracy.”
Use case details
Shufflr’s service oriented architecture lends itself to applying independent cloud deployment / configuration strategies and scaling up/down individual sub-systems as required. We will outline some examples of our spot usages in this context.
- Social graph data fetch with priorities & time-bounds: Users connect their Facebook and Twitter accounts to Shufflr, so that they can get all the videos shared by their friends. When a new user signsup, Shufflr fetches such video posts from his/her social graphs ‘asap’ (so that the user can make an easy emotional connection with the app), indexes the video metadata from the source site into Shufflr and adds it to the user’s social stream – all of them asynchronously. In parallel to those, taste-inference algorithms run through the user’s social graph data to start making personalised video recommendations. The key here is to be able to scale all these asynchronous activities according to varying new user signup loads along with current daily, weekly, monthly active users (in decreasing order of priority of resource allocation as well as job-completion latency). For instance, it is good to fetch the social graph data of a new user within a few seconds of his/her signing up and hence scheduling him at a higher priority than say, a weekly active user. We have a custom, multi-level queue scheduling algorithm to address this. The job queues, scheduler and dispatcher are run on on-demand instances, whereas large numbers of workers run on spot instances. The number of spots are a function of the job queue length, a feedback loop that assesses instance/worker health and launches replacements for terminated spots (due to price spike, say). To maintain a minimum assured throughput, a few workers are run on on-demand instances too. We had used MIT StarCluster initially, but eventually switched a custom scheduler-dispatcher-autoscaler implementation.
From the drawing board: Social Graph Video Fetchers with spot workers in 'red' (click to zoom)
- Per-geography, context sensitive, hot topic videos’ map reducer: Shufflr’s users are spread across 170+ countries. One of the video discovery anchors in Shufflr is videos about hot (trending) topics of the hour. For this, Shufflr uses Twitter trends amongst other signals, to infer hot topics of the hour. A twitter trend, being just a keyword (say, ‘Obama’), does not come along with any context (say, ‘US Election 2012′). Shufflr uses certain patent-pending algorithmic techniques to infer and attach context to trends and use that information to fetch video links+metadata from various video sources online. Spot instances are used here, in a custom map-reduce implementation running multi-threaded workers over a filtered data stream of Twitter firehose. Portions of this system – a Redis server for sorted data sets, nginx fronted raw-FS servers for tweets storage etc are run on on-demand instances as they hold critical state. A pending area of parallelization is to launch one set of spots per geography (woeid) and cover all 170+ countries within the first few minutes of every hour.
Location & context aware, hot topic video fetchers; spot work loads shown in 'red' (click to zoom in)
- Video queue processors: Shufflr has an ever growing index of over 110-million online videos. As time passes, some of the videos URLs in this index become stale. So, periodically these videos are queued to a ‘Stale video remover’ which launches a number of spot instances as a function of ‘job queue depth, instance type and worker throughput’. This is a pure background, best effort non-realtime task with no latency bounds and hence ideal for spot instances. The job queues along with some additional state are maintained on on-demand instances. Another use case on similar lines is the background task of fetching associated videos for a given video in the index. In this case, the video queue and the associated video fetch workers together create an every growing tree of indexed videos in a breadth-first manner recursively. Spot instances are used for running the workers while a large MySQL DB holds the video index along with the video metadata. Since the queue depth increases exponentially here, a cap is placed on the number of spots.
From the drawing board: Associated Videos Fetchers; spot workers shown in 'red' (click to zoom in)
From the drawing board: Stale Videos Removers ; spot workers shown in 'red' (click to zoom in)
Our video queue tracking charts
Yes, we like drawing them by hand and sticking them in the aisle for all to see.
Daily video queue tracking charts - 1. // medium: sketch ink on paper, display: glass wall in the aisle. (click to zoom in)
Daily video queue tracking charts - 2. // medium: sketch ink on paper, display: glass wall in the aisle. (click to zoom in)
Spot workers on the video queue
(To illustrate the workers running in parallel, we need to get a better composite screenshot. We hope to post an update soon.)
Video queue jobs being processed by spot workers (click to zoom in)
Just this year, we have seen 75% cost savings till date over 51000 spot instance hours (normalised to m1.small equivalents, across various spot instance types we’ve used in multiple availability zones). This is only one slice of our AWS platform costs but, as they say, every thousand$ counts
Notes on bidding
These cost savings came about with a combination of experiments on our bidding strategy. Initially (there weren’t many AZs and regions), we used to bid for spots with in our main AZ (us-east-1c) at on-demand prices. In general we would obtain the instances at about 1/3rd that price. Sometimes we would notice that, the spot prices were shooting higher than on-demand in our AZ, while other AZs continued to give lower prices (notably, us-east-1e).
We slightly altered the strategy to move across AZs when our bid wouldn’t succeed at on-demand pricing for a given timeout duration. That gave us a fairly good hold on the spot costs.
Of late, the ability to bid in the lowest cost AZ has reduced the guess-work because of AWS spot pricing historic data and current price APIs. Someday, we will rig up the perfect price bidding strategy algorithm (and use the same to pick winning stocks in the stock market .
Notes on scaling down spots
While scaling down spots, our auto-scale strategies (a) turn down only those spots that are about to complete an instance hour and (b) utilise a spot-instance as long as possible for the current hour, before terminating.
Performance use cases
There are certain job queues in the Shufflr platform which are serviced by background workers, even though the nature of those jobs are different from typical relaxed-latency jobs in other subsystems. These job queues are meant for asynchronous processing of elements in the web request chain (so that our app servers – Nginx+Passenger+Rails – are free to service the next web requests). Such asynchronous processing has stricter latency bounds in our use case and hence needs a very health ratio of workers to jobs, to provide predictable throughput, even during user load spikes beyond the upper control limit. To maintain such a ratio and complete servicing these job queues in the shortest possible duration, an array of spots with workers are launched in parallel.
We have been using spot instances since a long while and we are very glad with the cost savings as well as the spot-friendly elements we have built (and continue to build) into our platform architecture. As soon as we can get some free cycles, we would love to share these with the open source community.