Big data for training models in the cloud

Big data for training models in the cloud

When you have a lot of data like you can't run training on your local machine, or the size of that data is bigger than your hard drive, it's time to look at other options.

To the cloud

One concrete option is to transfer machine learning training to another computer to access additional storage, thus freeing up your hard drive space and allowing you to work on other things while the pieces of training are taking place.

Let's break down some of the parts that need to be moved to the cloud. It is useful to think of our training as a need for two primary resources: calculation and storage.

The interesting thing here is that we don't have to tie them so tightly as you expected before. We can decode them, which means we can take advantage of special systems for both of us. This can affect data efficiency when dealing with big data.

Compute loads are easily moved together, but moving large datasets can be a bit more engaging. However, if your data is really large, the results are worth the effort, as it allows data to be accessed by multiple machines, in parallel, working on your machine learning training.

Moving data to the cloud

The Google Cloud Platform has some easy ways to stick with these abstractions. First, we want to ensure that our data is stored in Google Cloud Storage or GCS. We can do this using a variety of tools.


Use gsutil only for small to medium datasets. It is a command-line tool designed specifically for interaction with Google Cloud Storage, and it supports a -m option that allows for sending multiple streams in parallel, thus increasing transfer speeds.

Google Transfer Tool

If your data is large enough to be sent over the network, use the Google Transfer tool, which is a physical machine that can securely capture and transfer data in your dataset to a byte.

With a typical network bandwidth of 100 megabytes per second, it takes years to upload pet bytes of data to the network! Even if you have 1 Gigabit connection, it will still take months! Who wants to wait that long !? The transfer device, on the other hand, can capture the entire pet byte of data in 225 hours. She's crazy!


Now that our data is in the cloud, we're really ready to measure our machine learning training. But this is a complete topic of its own! Fear not, we will cover it in the next episode.

Training machine learning models on big data sets can be challenging to accomplish with limited calculations and storage resources, but it won’t be! By moving data to the cloud, using either gsutil or transfer tools, you can train on a large data set without any hesitation.