Datalab: Running notebooks against large datasets

How Datalab: Running a notebook against a large dataset

Streaming your big data into your local computer environment is slow and expensive. In this episode of AI Adventure, we'll take a look at how to bring a notebook environment to your database!
What's better than an interactive Python notebook? An interactive Python notebook with fast and easy data connectivity, of course!

We saw how useful Jupiter notebooks are. This time we will see how to take it further by running it in the cloud with many extra goodies.

Data, but big

When you work with larger and larger datasets in the cloud, it becomes increasingly unnecessary to interact using your local machine. It is difficult to download statistically representative samples of data to check your code and rely on data streaming a stable connection to train locally. So what should a data scientist do?

If you can't bring data to your computer, bring your data to your computer! Let's see how we can run a notebook environment in the cloud, closer to your dataset!

Google Cloud Database is built on top of the familiar Jupiter notebook, with some additional capabilities including easy authentication with your BigQuery dataset, quick operations in Google Cloud storage, and SQL-query support! The toolkit is also open source on GitHub, so you can run it in your environment.

We're going to create a database environment and set it up to run our notebooks in the cloud.
Install the database using gcloud components. Then you will have a new command-line tool called data.

Datable installation is a single command operation

Starting a database is a line command: create a database

The database is still connected to the local host!

This command spins the virtual machine you use for your analysis, configures the network, and installs the necessary libraries that include the TensorFlow, pandas, nampi, and more we use.

Once the database is started it opens up the notebook environment which looks better than what we see in Jupiter notebooks. However, instead of running locally, it is running on a virtual machine in the cloud. The database sets some samples by default, which makes it a good place to start the exploration. Let's look at the Hello World notebook in the Documents folder.

Here we can immediately start playing with notebooks, running, and using cells. This is very convenient because there is no need to manage and configure Python libraries.

Let's make some more tools that are built inside. In the account icon in the upper right corner, there are a number of settings and useful information to notify.

Note first that the notebook is running as a service account. The service account is already authenticated with the assets of the project we have, but if we wish to access resources from another project, we must provide access to the service account, not the user account.

The virtual machine running the notebook is accessible to anyone who can access the project, we do not want to ignore our own account credentials in the database notebook.

Continuing below, we see that we are running a notebook from a Google Compute Engine virtual machine called i-Adventures, and we can turn off the VM at any time by clicking this button.

By default, database shuts down your virtual machine once it has been idle for 0 minutes. You can toggle this feature by clicking Message.

The timeout can also be set to a custom value. Go to Setty to know how to do that. The value we set here will be at the virtual machine's reboot crossing point, and if set to zero, will not automatically shut down.

This is also where we can choose light or dark themes.

Now that we have our database notebook set up and familiar with our environment, let's see what we can do with the database!

An example of a database in action

Today I am going through an example that describes the coexistence between the programming languages used in Github. That is, "If you program in language A, can you program in language B as well?" The sample below the notebook document is in the directory. You can also check it out on GitHub.

This analysis used only a small sample of the large GitHub public dataset. If you would like to work with a full Github committed history you can check out the dataset here and the guiding with it.

Conclusions and next steps

Datalab Cloud Connected Notebooks are a great way to get closer to your data, including convenient connections to devices like BigQuery, and easy authentication to your dataset in the cloud.

Go to the database and see if this is the right option for you!

Links

Programming language correlation sample notebook:

Full GitHub Dataset Guide:

Full GitHub Dataset BigQuery Link:

Google Cloud Datalab:

Datalab documentation:

Datalab source code on GitHub:

Search This Blog

Artificial Intelligence Future