Monday, May 6, 2013

Software engineering practices for graduate students

Recently I was talking with an Olin student who will start graduate school in the fall, and I suggested a few things I wish I had done in grad school.  And then I thought I should write them down.  So here is my list of Software Engineering Practices All Graduate Students Should Adopt:


Every keystroke you type should be under version control from the time you initiate a project until you retire it.  Here are the reasons:

1) Everything you do will be backed up.  But instead of organizing your backups by date (which is what most backup systems do) they are organized by revision.  So, for example, if you break something, you can roll back to an earlier working revision.

2) When you are collaborating with other people, you can share repositories.  Version control systems are well designed for managing this kind of collaboration.  If you are emailing documents back and forth, you are doing it wrong.

3) At various stages of the project, you can save a tagged copy of the repo.  For example, when you submit a paper for publication, make a tagged copy.  You can keep working on the trunk, and when you get reviewer comments (or a question 5 years later) you have something to refer back to.

I use Subversion (SVN) primarily, so I keep many of my projects on Google Code (if they are open source) or on my own SVN server.  But these days it seems like all the cool kids are using Git and keeping their repositories on GitHub.

Either way, find a version control system you like, learn how to use it, and find someplace to host your repository.



This goes hand in hand with version control.  If someone checks out your repository, they should be able to rebuild your project by running a single command.  That means that everything someone needs to replicate your results should be in the repo, and you should have scripts that process the data, generate figures and tables, and integrate them into your papers, slides, and other documents.

One simple tool for automating the build is Make.  Every directory in your project should contain a Makefile.  The top-level directory should contain the Makefile that runs all the others.

If you use GUI-based tools to process data, it might not be easy to automate your build.  But it will be worth it.  The night before your paper is due, you will find a bug somewhere in your data flow.  If you've done things right, you should be able to rebuild the paper with just five keystrokes (m-a-k-e, and Enter).

Also, put a README in the top-level directory that documents the directory structure and the build process.  If your build depends on other software, include it in the repo if practical; otherwise provide a list of required packages.

Or, if your software environment is not easy to replicate, put your whole development environment in a virtual machine and ship the VM.



For many people, the most challenging part of grad school is time management.  If you are an undergraduate taking 4-5 classes, you can do deadline-driven scheduling; that is, you can work on whatever task is due next and you will probably get everything done on time.

In grad school, you have more responsibility for how you spend your time and fewer deadlines to guide you.  It is easy to lose track of what you are doing, waste time doing things that are not important (see Yak Shaving), and neglect the things that move you toward the goal of graduation.

One of the purposes of agile development tools is to help people decide what to do next.  They provide several features that apply to grad school as well as software development:

1) They encourage planners to divide large tasks into smaller tasks that have a clearly-defined end condition.

2) They maintain a priority-ranking of tasks so that when you complete one you can start work on the next, or one of the next few.

3) They provide mechanisms for collaborating with a team and for getting feedback from an adviser.

4) They involve planning on at least two time scales.  On a daily basis you decide what to work on by selecting tasks from the backlog.  On a weekly (or longer) basis, you create and reorder tasks, and decide which ones you should work on during the next cycle.

If you use GitHub or Google Code for version control, you get an issue tracker as part of the deal.  You can use issue trackers for agile planning, but there are other tools, like Pivotal Tracker, that have more of the agile methodology built in.  I suggest you start with Pivotal Tracker because it has excellent documentation, but you might have to try out a few tools to find one you like.


Do these things -- Version Control, Build Automation, and Agile Development -- and you will get through grad school in less than the average time, with less than the average drama.

4 comments:

  1. Trello is also a nice tool for agile development and general project management. https://trello.com/

    ReplyDelete
  2. Excellent post. I would add the following:

    - Practice convention over configuration. Picking up an old project is a lot easier if you've followed common conventions - place data in a /data folder, results is a /results folder, figures in /figures, AI files in /aifigures, functions with a lower case first letters, script files with an upper case first letter, etc.

    - Devote time to cleanup after project completion. After you submit a paper, devote a few minutes to a few hours in cleaning up your project base. I keep a folder /retired in each project folder for analyses that went nowhere, that way I don't have to sift through a hundred similarly named .m files to find the one that actually does the analysis once I get back reviews.

    - Also after project completion - identify scripts that will be of value to future projects, and stick them in a /toolbox folder in Matlab's path. That way you won't have to search for them for a project 6 months down the line.

    - Automated backup - goes hand in hand with version control

    - Take the time to write scripts for figures that make as close to publication ready as possible. That way, when the data changes, or a reviewer requests a new analysis, you don't have to redo all the steps of cleaning up the figure in Illustrator.

    ReplyDelete
  3. Good post you shared here i hope you will write more.

    ReplyDelete