Skip to main content

· 3 min read
Edouard Godfrey

Jupyter notebook is a web based development environment that allow programmers to combine code and documentation.

Hello World

Since its inception in 2015, it has become a de-facto standard for the data science community. It is now more popular than ever. As of October 2020, there were 9.7 million notebooks publicly hosted on Github. This is up from 1.2 million the year before, a 800% increase YoY.

People have been authoring code for decades and the tools have greatly matured. For writing, engineers turn to their favorite IDE; vim, emacs or the more modern options. They run and debug with the Terminal and collaborate through Git. Data scientists write code too, so why are they turning to Notebooks instead ?

We find a hint to this riddle in their job title.

An engineer’s job is to create products. They are concerned about features, quality, performance. They collaborate in groups over large codebase.

A Data Scientist role is different. They do not create products but knowledge. They analyze data and extract principles. They decide in an A/B experiment whether to pick the candidate model or to stick with the legacy one. They collaborate, but in a different manner. Instead of having multiple writers to a shared codebase, the typical data science program follows a one writer, multiple readers paradigm. This mimics how knowledge is spread, with one author and multiple consumers.

At their core, data scientists create and share knowledge. And it turns out that notebooks are great at these two tasks.

Notebooks are designed for exploratory programming. When you analyze a dataset, you do not really know where it is going to take you. As you run chunks of code, you incrementally build and store state, just as you would in a scientific notepad. From raw data, you create intermediate variables, observe them and then combine them into the next level of variables. That process is repeated until you drain the dataset. This incremental exploration is natural. Moving the ball forward is as simple as creating a new cell, writing a few lines of code and pressing Maj-Enter to run it. The Notebook environment stays out of the way, no context switch. This facilitates a flow-state which makes it quite fun.

But creating knowledge is only half the battle.

Imagine we are in 240BC and a scientist tells you “the earth circumference is 24,466 miles, trust me”. That is a great insight, but why would you trust him ? This contradicts prior belief. Everyone has opinions and most of them are in contradiction with one another. Without proof, you would be wise to just stay with the opinion of the majority.

Fundamentally, a thesis without an explanation is useless because it does not spread to skeptical minds. A better claim from the scientist would be "the earth circumference is 24,466 miles, see for yourself".

Erathostene

Now that is compelling. I do not have to take this proposal at face value, I can look at it, audit it and refute or accept it.

Notebook are great for science because they combine code with rich documentation, images and links. They tell a story and walk the reader towards the conclusion.

The IDE/Terminal/Git paradigm is great for code read by machines. Notebooks instead are meant for humans and that makes them great for science.