Collaborative bio-image analysis script editing with git

Posted by Robert Haase, on 4 September 2021

TL;DR: I’m a computer scientist who often collaborates with biologists on bio-image analysis scripts. We are using more and more git, a version control program, for working on code collaboratively. When using git, we speak about repositories, commits and pushing to the origin. We also make forks, send pull-requests and merge code. This blog post explains these terms and demonstrates how a typical collaborative bio-image analysis scripting project looks like.

We’ve all been in that situation: We wrote a little script that should count cells, but it doesn’t work and we can’t tell why. We need support from an expert. Maybe, we also would like to help others in the future by fixing bugs in open source projects. In this blog post I demonstrate how to work on code with colleagues collaboratively. We will use the version control tool git to prevent a project folder full of files called script_v7_robert_final_v2.py. Spoiler: We will hardly use any command line for this.

In simple cases, we can just go to image.sc, start a new thread, copy our code and ask for help. However, what if the code is quite long, involves multiple files and if the discussion about the issue becomes more complicated, the experts on image.sc may ask for a minimal working example (MWE). An MWE contains everything so that an expert could reproduce your issue on her computer. It may appear complicated to share this in the first place. However, preparing the MWE really pays off, for multiple reasons:

Getting organized: Cleaning up code before sending it to the expert may resolve the original issue.
Better understanding how experts work: Experts reduce problematic code to MWEs every day in order to fix them. It’s like pipetting in the wet lab, a very common routine.
Divide and rule: By writing minimal code from the beginning, so that MWEs can be easily made out of it, leads to code that is easier to maintain long-term.

Example: a script with an issue

Assume our problematic script is written in python. We created a conda enviroment and started programming. The script loads an example image, segments it and counts the objects in the image. For the image segmentation we programmed a small library to keep our code well organized. Furthermore, for loading and processing the images, we are using scikit-image, a requirement, that is listed in a requirements text file. Thus, for discussing the issue with others, we need to upload four files to the internet:

Our example project contains an image, a python library for common functions, the actual image processing script and a text file with required python packages that need to be installed to make our script work on other’s computers.

We also would need to share the error message that we see when we execute the script. We will come back to this in a bit.

Sharing code on github.com

For sharing code and working on it collaboratively, manifold platforms exist. A quite common platform for open source projects is github.com. Thus, I will demonstrate how to work collaboratively on this platform. The same procedures work with famous open source projects such as ImageJ or napari. Thus, with learning the procedure here, you could help fixing bugs in such open source projects. The first step for sharing your code with others is creating an account on github.com and creating a folder where you can upload your files. Software developers call these folders repositories. In your github profile, you can click on repositories and create a new one:

For sharing scripts on github.com, you need to create a repository…

… for creating a repository, just click the *New* button…

Before you can upload files to this online folder, you need to make a local copy of the repository. This is how git repositories work in very general: You make a local copy, by cloning it, change something in it and then you push your changes to the origin of the repository, to github.com in our example. For downloading the folder, we will use github desktop. Alternatively, one could use the command line.

When clicking on the *Set up in Desktop* button, Github Desktop will open.

Github Desktop will ask you where the folder for your repository should be created.

After the folder was created, you can copy your files of the MWE to that folder.

By switching back to Github Desktop, you will then see that green plusses highlight added files. Enter a summary of what you did and commit the change.

The last step for uploading is publishing the files to the internet by clicking the *Publish branch* button.

Back on the repository on github.com, we can see the files were uploaded.

Making the life easy for collaborators

Others can now navigate to your online repository and read your code. However, they don’t see the error we experienced yet. Of course, we could ask the expert to download the code, run it and read the error. However, this is obviously some extra burden for them. As we are asking for help, we should make it as easy as possible for them to help us. Thus, we will create a jupyter notebook that reproduces our error. Therefore, we basically copy the code from my_script.py into a new jupyter notebook and execute it.

The jupyter notebook show the error message we experience. We can then upload it to github to make others see our error in its original habitat.

After creating the Jupyter notebook, you will see in Github Desktop that multiple files have been created. Folders named *.ipynb_checkpoints* and *__pycache__* are temporary files you did not create intentionally. You should not upload them and instead ignore them. You can do this by right-click and *Ignore*.

This will create a .gitignore file that lists all the files which should not be uploaded. Gitignore files should also be clean, but that’s another topic. Today, we will be lazy and just upload it by adding a summary and committing it. Be careful: Untick your notebook before committing the gitignore file.

Eventually, we can upload our notebook, first by commiting it…

… and then by pushing it to the origin, the repository on github.com.

Asking for help

This is the point, where we can ask for help. We can send a link to the notebook to collaborators or put it in a new thread on image.sc and explain details.

A minimal working example notebook for reproducing the error is a great starting point for collaborators from the computational side. They can read in the browser what’s the issue and furthermore, they can download everything they need to reproduce and potentially fix the issue.

Switching perspective: The collaborators’ view

Now let’s change the perspective: The collaborator can explore your code in the github repository and already get a glimpse on what might be going wrong. However, in order to fix the issue, the collaborator will need to make a copy of your repository, try out some potential solutions and then send the updated code back to you. The first step is to fork the repository. A fork is basically just a copy.

To copy a repository on github, just click the *Fork* button…

The collaborator can then use Github Desktop to download the *fork* of your project…

… and again select where to download, or “clone” it.

The collaborator should configure Github Desktop to contribute to your project.

The collaborator will then execute this command line to install the requirements you specified. Make sure you specified them all in requirements.txt, otherwise the collaborator may struggle with this step:

pip install -r requirements.txt

After fixing the bug, the expert will also upload her changes as you were uploading the notebook earlier.

The collaborator’s view shows the changes which were necessary to fix the bug. She will then provide a summary and commit the change.

Again, the changes need to be pushed to github.com, in this case to the fork of the original repository.

Filing a pull-request

The bug has been fixed, but you don’t know it yet. The collaborator needs to notify you and send you the code change that fixed the bug. Therefore, the collaborator sends a pull-request, often also short PR. PRs are necessary also from a copyright point of view. If you would just take the change from the collaborator, they could say you were stealing it. However, if they request you to pull changes over, they officially allow you to copy the change. That’s why this procedure is called pull-request.

Pull requests can be opened from the web interface by clicking the *Contribute* button…

When writing short descriptions in PRs, please always be kind. Explain what you changed and why you think it’s a good idea to incorporate your change in the main project.

Merging changes

Back to our perspective of the repository, we will see that a pull-request arrived.

In our repository, we can see a new pull-request and read the comments from the collaborator.

We can also see which code lines were actually changed.

Back in the conversation tab, we can merge the pull request and thus, take the changes over in our repository. Furthermore, also on this side, we can be kind and reply afterwards with a final comment.

The merged pull-request will then be stored in the repository and thus, you can long-term watch who made contributions to this project.

And voila, if the collaborator also updated the notebook, you will see that it now shows the number of objects in your image.

Summary

When collaborating in bio-image analysis scripting projects, it is important to share minimal working examples. This reduces the need for explaining things in very detail by just providing a script that throws an error. When providing these examples, also don’t forget to also put example images in the same place. A popular way for working together on code projects is git, a version control program. While git is often used by experts from the command line, it is also possible to use it via web-interfaces like on github.com and in click-and-run software such as Github Desktop. When following common procedures for exchanging code, you create an empty git repository, clone it to your computer, put your files in it, commit them to the repository and push your changes to the origin of the repository. A collaborator can then fork the repository, clone it, commit changes, push them to her fork and send you a pull-request. I know it’s a couple of terms which might be confusing at the beginning. These terms and the procedure involving them is so common in data science that I can highly recommend trying it out.

Feedback is highly welcome!