The Best ML Tool I’ve Used
Why BigQuery is the great enabler in the field of machine learning.
About 18 months ago, the engineering team at Apteo switched from creating machine learning models in Python to using BigQuery ML.
That move has changed our lives.
Well, no, but it has made a huge impact in our ability to move quickly, serve and train models at scale, and get some nice features like A.I. explainability with minimal effort.
My ML origins
When I first started creating basic ML algorithms, I was a junior in college and we were using a tool called “S”, using an IDE that looked a bit like R-Studio. A few years pass, and then I started using Matlab and R. Then I moved on to Python, which made life a lot easier.
A lot of the ML projects I’ve done have had to be productionized, so it made sense to create them in a widely used, object-oriented language that just happened to have some great ML tools.
During the early days of Apteo, we created a lot of reusable code for data pipelines, transformations, model creation, and model evaluation. While I’m still proud of some of the highly generic code we created to make wide recurrent networks that could understand natural language, it was a good amount of work for a small team to undertake.
And dealing with large amounts of data was a real problem.
After a few pivots that led us into a world of having to create new ML models, but this time we wanted to reduce the time it took for us to iterate and deliver new product.
My cofounder found us BigQuery, Google’s amazing data warehouse that handles large amounts of data with minimal latency, and as an added bonus, has built in ML capabilities.
We started building our new models on that and it has been a real boon to productivity.
Why I like it
There are a few really nice things about BigQuery ML. First, it’s made it easy to create new models using SQL. Using a standard language that I’m familiar with, I can easily spin up a new training session and create a model, fast.
Unsurprisingly, that brings me to my second point. Whereas it might have taken us 12 hours to train a model in a custom-built Python application, BigQuery can do the same thing in 35–45 minutes. Not having to wait to see the results of a model are huge, especially when you’re trying to iterate fast, or worst case, there’s a bug in the dataset creation process.
It’s also nice because it has several built-in features that you’d normally have to handle yourself when training a new model, including implicit data transformations (one-hot encoding, standardization, etc) that you’d normally have to do explicitly, built-in evaluation metrics, and explainability metrics.
It can also be used to serve models (as long as you don’t need sub-second latencies or real-time predictions). We use it to serve our models, creating batch prediction jobs, where we store the results into new BigQuery datasets which we can then use for aggregation, analysis, or serving key results to our end users.
What I’m hoping for next
While it’s an awesome tool, and, as you probably guessed, the best all-around ML tool I’ve used, there are some things that I’m hoping to see in the future.
First, even though it scales to large datasets, it wasn’t able to handle datasets as large as I would have imagined. The last time I used it, it errored out on a dataset of 100M records, which, while a lot, isn’t really that much in the world of machine learning.
Second, building DNNs in it isn’t as robust as using something like Tensorflow. It’s hard to configure each individual layer in the model (though you could reasonably argue that most models shouldn’t necessarily be configured with a ton of layers). It also doesn’t support wide recurrent networks at the time of this writing, nor does it have built in embeddings for NLP. All would be a nice to have, but they’re all also things I can easily live without for now.
I’d also love to see the cost lowered… while it has saved us a ton of development time, we pay for it in terms of money, and there have been a few times where I’ve been told to use reserved instances of BQML (rather than the on-demand version, which is what we have now). Suffice it to say, the reservations they offer are much pricier than what I’d be looking to pay for at the moment.
All-in-all, it’s awesome that Google has provided such a nice tool for data science — an area where we all know that better tooling is highly needed. Highly recommend for your next project.