In this post you will learn a very efficient way to use Colab when working on a project that will allow you control the files locally on your own computer.
I will show you how to do it when the project is maintained in a git repository, (even a private git repository), so that you can handle all the git actions easily from your computer as you are already used to.
I will share with you the code and file structure that I use that allows a quick initialization of the Colab notebook.
The benefits that I see when using this proposed method are:
- It allows one click re-initialization of the notebook
- It simplifies the process of working in team with colab and git
- This process also allows running the same notebook locally without any changes
I love google Colab. I think that google has done an amazing service to the ML/AI community when they opened the option for anyone to work, test and play with python code, ML models and concepts, totally free of charge, even with GPUs that are an important part when developing some of the common architectures that are in use today.
I know many, including myself, who would have thought twice and probably give up on their attempt to learn ML if they had to pay for gaining the first experience. Running GPU training can be costly, and having the option to run it free lowers the barrier for people that have the ambitions but lack the resources to make the first steps.
That’s being said, working with Google Colab has its “annoyances”, that haunted me for quite some time. I don’t know about you, but from my experience, when working on a project it is very beneficial to be organized, so working with a version control like git is very important.
There are many problems when trying to combine Google Colab and Git.
- First is how to load the project repository to Colab (and if the repo is private, how to do it in a secure way that will not share your credentials).
- On Colab, where do you store your notebook in relation to your repo
- Once you make changes to your code, how do you push these changes from your Colab hosted notebook into your repo.
- If you (or someone else in your team/group) had made changes to some files in the repo, and you want to use them in your notebook, how do you pull the updates from the repo so that they will be available to your notebook? And can you do that even without reloading the notebook?
- In colab, each time when we open the notebook, it starts `fresh`. We need to connect to files stored on our gdrive, and even how do we run the project when the files are stored somewhere deep inside the gdrive file system?
I believe that the system that I came up with, gives a good solution for all these problems.
The Google Colab and Git system:
The first step for solution is using google drive for desktop. If you haven’t checked this one up, go ahead and do it right now, because what this application offers is an ability for you to have a local folder in your computer that is synced to your google drive.
You can choose which exact folders in gdrive you would like to have on your local hard disk, so you don’t actually need to have backup for the entire gdrive. Once you specify the folder, the sync of files between Colab and your local folder is very fast. Any change that you do in Colab is almost immediately transferred to your local file, and the other way around. So, you can practically work locally with the IDE that you are used too, which makes the work flow much more convenient.
This also solves the issue of how you get the repo to the gdrive folder in the first place. Once you connected gdrive to your disk, create a folder that will hold your project. You can do that in gdrive or on your local, it doesn’t matter since the folder is tracked, and therefore the sync process will make sure that exists both in your computer and in the cloud. Then from your local machine, open your shell/terminal in the said folder, and simply `git clone` the repo.
Since the entire repo is located on a local folder on your machine, you can use your favorite git manager to handle the repo (personally I use SourceTree)
This is the project folder structure that I use:
| —— notebooks
| —— src
| —— __init__.py
| —— other python files
You can store the project anywhere you want on your gdrive (no matter how folders level deep). The ‘trick’ will be that we will ‘change directory`(cd) to the work folder once we open the notebook.
The notebook that we run in colab is the one that is stored in the `notebooks` folder in the repo. This means that any change that you make to the notebook in colab, will be synced to your local google drive folder, which will allow you to push the changes to the repo server from your local machine.
One important thing to remember: Even though the notebook is located in the ‘notebooks’ folder, the working folder when running the notebook doesn’t have to be!
In fact, using the root folder of the project as the working folder makes a lot more sense, since it gives access to the files in the `src` folder. The reason for the __init__.py files is that it defines the `src` as a python module, which allows using normal python imports for any files that is located inside the folder.
In order to streamline the process of changing the directory to the root folder, I recommend using a config file (credentials.colab.ini). I store this file locally, and once I reconnect to a new colab notebook I manually upload the config file to the `/content/’ folder in gdrive (this is the root folder of the Colab notebook instance, so just drag and drop the file from your local folder to the UI. See image below.
Pay attention that the actual credentials.colab.ini mustn’t be pushed to the git repo! Depending on your project, you can use this file to store any other important configurations that you may need.
An important reason for the config file is to hold the root path of the project. I am then using the awesome python configparser module to read the config file, so that we can access the configuration parameters anywhere in the notebook.
from os import path
from importlib import reloadWANDB_enable = False
creds_path_ar = ["../credentials.ini","credentials.colab.ini"]
root_path = ""
data_path = ""for creds_path in creds_path_ar:
config_parser = configparser.ConfigParser()
root_path = config_parser['MAIN']["PATH_ROOT"]
data_path = config_parser['MAIN']["PATH_DATA"]
ENV = config_parser['MAIN']["ENV"]
Another configuration that you might find beneficial adding to the configuration file are the data location (if you store the data outside of your repo) and location of embedding files. These NLP embedding files sure takes a lot of disk space, and since I have more than one project that use them, I prefer to store them in one location, so that I won’t have multiple giant copies of them in my gdrive, which is free only up to 15Gb.
For this reason, have `credentials.example.ini` in your repo, which holds only the template for possible parameters to be used. Any member of the team should make his own copy of the file and updates the values that he needs inside his copy. And put the actual credential file put into gitignore so that it will not get into the repo by mistake.
Reason for this line:
creds_path_ar = [“../credentials.ini”,”credentials.colab.ini”]
This line allows using the notebook from Colab and also when running the exact same notebook with a local hosted jupyter engine. When opening the notebook in your local, the working folder is the same as the one where the notebook is located. So, create another configuration that will be used when running the notebook locally, and put it on the root folder of the project.
A typical configuration file looks like this:
WANDB_ENABLE = FALSE
ENV = LOCAL[MAIN]
WANDB_LOGIN = ……………………………
PATH_ROOT = /content/drive/My Drive/WORK/ML/MyProject
PATH_DATA = /content/drive/My Drive/WORK/ML/data
ENV = COLAB
As you can see, you can use the configparser DEFAULT section in case you need. Another important parameter in the configuration file is the ENV variable. This allows controlling specific code that will run only on Colab, or only locally.
One piece of code that definitely needs to be run only on colab is the following, which mounts gdrive to the notebook instance.
from google.colab import drive
When running the notebook for the first time after connecting to the Colab instance you will need to authorize the access. But until you get completely disconnected from Colab, in case you need to restart the instance, you can simply rerun the cell and the drive will already be mounted.
Finally, both on local and on colab we change to the work folder by running this cell:
which concludes our process. Running these cells at the beginning of the notebook allows an easy one click setup of the notebook directly from git!
If you have more than one person in the team, this method allows them all to share the same notebook, and run it as they want. They can run it on Colab, or even locally if they have access to better resources.
Additional tips to use this method effectively
Few more subtle productivity and efficiency operations that are worth mentioning will improve your workflow when using this system:
Immediate gdrive file open without the cumbersome navigation in gdrive UI.
Have you ever tried to navigate to a deep folder somewhere inside your gdrive? In the past I used to get really frustrated because it is a slow process. You need to navigate the folders one by one, and each folder change takes a few seconds. You might think I crazy, but once you do it few times a day it starts to add up. With Google Drive for Desktop you can jump directly to a folder/file location, which can save you these expensive seconds. This option is located in the context window of the right click of your explorer, see image below
Working locally, pulling changed files and Reloading modules
Another benefit to having your files locally, is that for normal python files that are used within your notebook you can use IDE to edit them. What I mean is that if you struct your project correctly, most important functions should be inside a python module that is imported to the notebook.
Striping away methods out of the notebook into python files is a method that is a good practice ayway, as it supports a more organized code structure, and an increased re-usability of the code. When it is used with this Colab system, it adds also the ability to make changes to files locally. I can open the project straight from my google drive local folder in vscode, and any change I make on files is synced to gdrive.
For example, I can have experiment_utils.py inside the src folder and I can import it with:
from src import experiment_utils as utils
and then call utils.myFunc() inside the notebook.
You can even make changes on the fly. You don’t need to restart the notebook on every change to your module files. To achieve that, have these lines at the beginning of your notebook:
If you still having trouble, you can use the reload method:
from importlib import reload
(just create a new code cell and execute once for the module that you are trying to import. Rerunning the cell does the import of the module can also help.
Of course, that if someone changed the repo, you can pull changes from your local computer, and the updates will be automatically synced to gdrive immediately. Google drive for desktop will show you a little green V icon next to the folder/file name when sync is complete.
I decided to call my python file folder `src`. You can choose other name, but be careful: Don’t name the `src` folder `code`, since Colab environment has some pre-existing code.py file that will give you a strange error message. (Took me some time to figure this one out).
I created a template of the project that is explained in this post. You can access it from my github account: hershkoy/colab_git_template.
I really hope that you have found this post informative. In case you liked it, give it some likes and share with whoever you think can benefit from it. If you find a way to improve this system further, let me know!