Notebooks are becoming the de-facto standard for prototyping and analysis for Data Scientists. Many cloud providers offer machine learning and deep learning services in the form of Jupyter notebooks. Other players have now begun to offer cloud hosted Jupyter environments, with similar storage, compute and pricing structures. One of the main differences can be multi-language support and version control options that allow Data Scientists to share their work in one place.
The Increasing Popularity of Jupyter Notebook Environments
Jupyter notebook environments are now becoming the first destination in the journey to productizing your data science project. The notebook environment allows us to keep track of errors and maintain clean code. One of the best features although simple is that the notebook would stop compiling your code if it spots an error. Regular IDE’s do not stop compilation even if an error is detected and depending on the amount of code, it can be a waste of time to go back and manually detect where the error is located.
Many cloud providers, and other third-party services, see the value of a Jupyter notebook environment which is why many companies now offer cloud hosted notebooks that are hosted on the cloud and accessible to millions of people. Many Data Scientists do not have the necessary hardware for conducting large scale Deep Learning, but with cloud hosted environments, the hardware and backend configurations are mostly taken care which leaves the user to only configure their desired parameters such as CPU/GPU/TPU, RAM, Cores etc.
- MatrixDS is a cloud platform that provides a social network type experience combined with GitHub that is tailored for sharing your Data Science projects with peers. They provide some of the most used technologies such as R, Python, Shiny, MongoDB, NGINX, Julia, MySQL, PostgreSQL.
- They offer both free and paid tiers as well. The paid tier is similar to what is offered on the major cloud platforms where by you can pay by usage or time. The platform provides GPU support as needed so that memory heavy and compute heavy tasks can be accomplished when a local machine is not sufficient.
To get started with a Jupyter notebook environment in MatrixDS:
- Sign-up for the service to create an account. It should be a free account by default.
- You will then be prompted to a Projects page. Here, click on the green button on the top right corner to start a new project. Give it a name and description and click CREATE.
- Then you will be asked to set some configurations such as the amount of RAM and cores. Because it is a free account, you will be limited to 4GB RAM and a 1 Core CPU.
- Once completed, you will be taken to the page where your tool of choice (a Jupyter Notebook instance) will be configuring and getting ready.
- Once you see it is completed the set-up process, click START and once it is in operation, click OPEN and you will be taken to a new tab with your Jupyter Notebook instance.
- Google Colab is a FREE Jupyter notebook environment provided by Google specially for Deep Learning tasks. It runs completely in the cloud and enables you to share your work, save to your google drive directly and offers resources for compute power.
- One of the major advantages of Colab is it offers free GPU support (with limits placed of course – check their FAQ). See this great article by Anne Bommer on getting started with Google Colab.
- It not only comes with GPU support, we also have access to TPU’s on Colab.
A simple example of using Google colab for your Jupyter environment besides the regular Jupyter Notebook is the ability to use The cv2.imshow() and cv.imshow() functions from the opencv-python package. The two functions are incompatible with the stand-alone Jupyter Notebook. Googel colab offers a custom fix for this issue:
from google.colab.patches import cv2_imshow !curl - o logo.png https://colab.research.google.com/img/colab_favicon_256px.png import cv2 img = cv2.imread('logo.png', cv2.IMREAD_UNCHANGED) cv2_imshow(img)
Run the above code in a code cell to verify that it is indeed working and begin your image and video processing tasks.
- Google Cloud offers an integrated JupyterLab managed instances that comes pre-installed with the latest machine learning and deep learning libraries such as TensorFlow, PyTorch, scikit-learn, pandas, NumPy, SciPy, and Matplotlib.
- The notebook instance is integrated with BigQuery, Cloud Dataproc and Cloud Dataflow to offer a seamless experience from ingestion, preprocessing. Exploration, training and deployment.
- The integrated services make it hassle free for users to scale up on demand by adding compute and storage capacity with a few clicks.
To begin your JupyterLab instance on GCP follow the steps in:
Run the following code with Keras to see how well a cloud environment and GPU support can speed up your analysis:
Here is the link to the dataset: Dataset CSV File (pima-indians-diabetes.csv). The dataset should be in the same working directory as your python file to make it simple.
Save it with the filename: pima-diabetes.csv
from numpy import loadtxt from keras.models import Sequential from keras.layers import Dense # load the dataset dataset = loadtxt('pima-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] # define the keras model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) #Sigmoid is chosen because it is a binary classification problem model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # compile the keras model model.fit(X, y, epochs=150, batch_size=10) # fit the keras model on the dataset _, accuracy = model.evaluate(X, y) # evaluate the keras model print('Accuracy: %.2f' % (accuracy*100)
- Saturn Cloud is a new cloud service that offers one-click Jupyter notebooks hosted on the cloud that can scale up to your compute and storage requirements using AWS in the backend. Here is a tutorial to get started: How to Effortlessly Create, Publish, and Even Share Cloud Hosted-Jupyter Notebooks with Saturn Cloud
- Saturn Cloud is supposed to handle the DevOps side of Data Science and make your analysis more reproducible by offering version control and collaboration opportunities.
- Saturn Cloud offers Parallel Computing infrastructure with Dask (written in Python) instead of other big data tools such as Spark.
To get started with Saturn Cloud:
- Go to their login and create an account: Saturn Cloud Login. The basic plan is free to get used to the environment
- To create your notebook instance:
- Specify a name for the notebook
- The amount of storage
- The GPU or CPU to be used
- (Optional) Python environment (eg:- Pip, Conda)
- (Optional) Auto-Shutdown
- A requirements.txt to install the libraries for your project.
- After the above parameters have been specified you can click CREATE to start the server and your notebook instance.
- Saturn Cloud also offers to host your notebook making it shareable. This is an example of Saturn Cloud taking care of the DevOps for a data science project so that the user need not worry.
Run the below code to verify your instance is running as intended.
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] data = pd.read_csv(url, names=names) pd.plotting.scatter_matrix(dataset) plt.show()
What is the Best Jupyter Notebook Environment?
We ranked the Jupyter Notebook Environments from best to worst based on a number of different factors like analysis, visualization capabilities, data storage, and also databases functionality. Each platform is different with its best and worst use cases and its own unique selling point.
All of the above services are made to cater to your deep learning requirements and provide an environment of reproducibility to share you work and conduct your analysis with as little backend work as possible. As deep learning hits new milestones, the algorithms still require vast amounts of data and most data scientists do not have the capacity in their local machine to make it happen. This is when the above alternatives allow us to conduct our analysis with a seamless experience. The following is our best attempt at an objective point of view for which platform is best and which is the worst:
- MatrixDS is unique from the others in that it gives users the options to different tools for different tasks. For analysis, it provides Python, R, Julia, Tensorboard etc and for visualization it can offer Superset, Shiny , Flask, Bokeh etc and to store data it provides PostgreSQL.
- Saturn cloud provides parallel computing support and makes the sign-up process and creating a Jupyter notebook as simple as simple as possible compared to the other providers on this list. To users that just want to get started with minimal frills and only need a server that can handle big data, this probably the best choice.
- This notebook environment provides support for both Python and R. Data Science users may have a preferred language and support for both on a major cloud provider is an attractive offer. It gives access to GCP’s other services such as BigQuery as well straight from the Notebook itself making querying data more efficient and powerful.
- While quite powerful and the only one to offer TPU support, it is not feature rich for a relatively comprehensive data science workflow as the others. It only has Python support and functions similarly to a standard Jupyter Notebook with a different user interface. It offers to share your notebook on your google drive and can access your google drive data as well.