Evolving a Machine Learning & Analytics Platform in Python @ Full Stack Toronto Meetup
Our stack is a Python back-end with AngularJS for the front-end. We started with a pretty simple Django app using MySQL with scikit-learn for predictive modelling. Along the way we added Celery and Redis, statsmodels for forecasting, and we’re in the process of moving our data analysis into Cassandra with Spark. This talk will focus on technology choices and how we’ve grown the system from just a few stores to thousands in less than a year.
Vantage Analytics provides information for Shopify (and other e-commerce solution) merchants. It takes in records of customer purchases, and outputs information like trends, predictions, and actions merchants can take to improve their business like suggesting PPC ad campaigns. Data science for small to medium size merchants.
A timeline of growth & change
- Product is live and has a few customers
- Stack is simple (backend has Django, some Celery tasks, MySQL, only a few VMs)
- Cron job runs a few big calculation tasks to figure out the numbers merchants want to see. Vantage is growing by a few merchants per week
- Growth spiked for a few days after being featured by Shopify. Get bigger servers!
- No major architecture changes
The team finds the bottlenecks in architecture and infrastructure as features are added and user base grows
- The cron job that crunches numbers (the most important part of Vantage) is getting slow and memory usage is growing
- Network or disk IO is becoming a bottleneck when importing data from Shopify’s API
Time for changes!
- Hire a backend developer
- Move some number crunching out of the DB and into Python (so DB can focus on writes?)
- Start caching DB reads with redis to reduce DB IO further
- Break giant cron job into smaller tasks, then use Celery queue to manage them
- Scale up Celery from 1 machine to a cluster
The stack as of now
- Django for web app framework
- MySQL for DB
- Redis for DB caching
- Celery for long running asynchronous jobs
- RabbitMQ for message queue
- AngularJS frontend
- Infrastructure will be OK with thousands of merchants signed up
Lessons from 2014
- Measure everything! Shout out to New Relic Pro, prices are negotiable
- Use queues to run as many long tasks asynchronously as possible (this is where Celery comes in)
- MySQL datatypes are hard to change later. Get them right before data grows large. Eg Queries with BLOB types will require a hit to the hard disk which is bad for performance
- When you have to change a MySQL schema, use pt-online-schema-change for making schema changes to big MySQL tables to minimize downtime
- Separate infrastructure into multiple servers early for easy scaling (queue manager on 1 box, queue workers on other boxes, web app on other boxes, DBs on other boxes)
- Hosting on VM Farms was a good choice for offloading system admin tasks, especially when the product team was tiny
- Data needs will outgrow what they can do with MySQL at some point. Cassandra? Cassandra + Spark? Not loving Spark yet. Implementing it may be too big a task for small team.. But they aren’t in a rush to move away from MySQL yet
Remember FSTO Conf is coming up fast! Get your tickets for November 22 – 23, 2014.