Source Code Availability is not Reproducibility | Thoughts | Yin Jun, Phua -- Assistant Professor at Tokyo Tech

I recently tried to run some neural network code, written by other researchers, to understand how their method works. The code is publicly available on Github, with a well-written README on how to setup the environment and run the code. However, when I followed all of the instructions, I encountered a roadblock. One of the dependencies that the code relies on has deprecated some functions, causing the code to stop working altogether.

I managed to trace the issue down to a specific version of matplotlib that the code was depending on. After installing the correct version, I reran the code only for it to fail at a different place, this time somewhere inside pytorch.

This is where I could hear some software engineer screaming that we should pin all dependencies, etc etc. The environment.yml file that came together with the code had pinned some of the dependencies, but apparently not all of it. I thought this was where Anaconda's dependency solving would've helped, after all it was trying to solve the environment for quite some time (until I switched to mamba), furthermore, whatever happened to semantic versioning?

This is way too common of an occurrence, especially in the AI/deep learning field where almost everything is written in Python. Students often have to spend a lot of time and labor (mostly unpaid!), to figure out how to run stuff, and then use the limited remaining time to try to improve on them. I'd put this issue more on Python's package management, where it defaults to the newest version of a dependency it can find, but actually it is also a problem with software engineering as a whole.

Some have suggested that the solution is to put your research in a docker container. This will solve the issue of mismatch dependencies once the container is built properly, and provided you still have the image lying around when you're trying to run it again. However, simply putting stuff in docker doesn't help if you are attempting to modify it and trying to build on top of it. When you rebuild the container with your modifications, it might pick up different dependencies (or the container image that it was based on might no longer exist).

There is an entire field of software reproducibility, known as build reproducibility. There are software like NixOS that can help, but as far as I'm aware, there is still nothing close to a usable solution for the Python ecosystem.

One solution to this problem, is to copy all dependencies into the repository. Python can import packages from the relative path, so this is a relatively easy fix. Unfortunately, this is generally discouraged by software engineering practices for some good reasons, but there are also times when you can violate it. I think this is one of it, but this idea has been cargo culted into oblivion that I feel it is unlikely to ever be widely accepted.

On the other hand, well known models like CNNs, LLMs which are supported natively in PyTorch have no issues of reproducibility. The Transformers package from Hugging Face also do their best in regards to reproducibility. Whenever any issue of trying to run their code arise, they are also very quick in trying to fix it. This means that most students who can (or are willing to) make research progress usually focus their time on these models.

This creates a dynamic where the most popular models are the ones with more research progress, and the lesser known models (which could be just as promising) don't get enough attention. In a field as young as AI, focusing on such a narrow range of research does not look like a good trend at all.

Most top conferences, by now have mandated that source code needs to be available once the paper is published. It is a way of addressing the reproducibility issue that is plaguing the scientific field in general, but the availability of source code alone is not reproducibility. Some software engineers use the term "code rot" for this, but this implies that the issue is with the code that was written by the researcher. I prefer to square the blame on Python packaging and the way dependencies are generally discouraged from being copied into the repository.

With Python 4.0 in the horizon, this problem can only get worse. As the Python 2 to Python 3 conversion has shown, it will not be a painless transition process. We also risk losing a lot of the progress that has been made in the current boom when we can no longer run the code that was used to produce all these research.

Have comments or want to have discussions about the contents of this post? You can always contact me by email.