Collaborative software development

Nathan Sheffield, PhD
www.databio.org/slides
## The three levels of collaboration 0- None 1- One-way communication 2- Conferencing 3- Coordination
## Why collaborate on software?
## Why collaborate on software? Because collective progress increases with increased collaboration.

But I don't develop software!

Yes you do.
Data analysis is software development

# Levels of collaboration
## 0. None I write and use code for my project.
## 1. One-way Communication. I give you my script and you run it. Analogy: TV
## 2. Conferencing. Interactive work toward a shared goal; collecting bug reports and user feedback. Analogy: Brainstorming conference call.
## 3. Coordination. Interdependent work toward a shared goal. Analogy: a sports team. Everyone contributes, adjusts to others, and does something different.

How do we move toward coordination?

0- None
1- One-way communication
2- Conferencing
3- Coordination
Git

a distributed version-control system that tracks changes in software development

  • created by Linus Torvalds in 2005 for development of the Linux kernel
  • free and open-source (GPL2)
Github

a web-based hosting service for version control using Git


  • company started Feb. 2008
  • purchased by Microsoft for $7.5 billion in 2018
# git/github ecosystem ## version control [centralized vs distributed](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) [git vs svn](https://trends.google.com/trends/explore?date=all&geo=US&q=git,svn) ## distribution [the octoverse](https://octoverse.github.com/) ## collaboration [dashboard](https://github.com/orgs/databio/dashboard)
# Git solves problems ## Version control
## Problem 1 ### My computer crashed and I lost all my code. Solution: Remote backup (S3?) *or* git + GitHub
## Problem 2 ### I want to work on my code from my home and work computers Solution: Remote working copy (Dropbox?) *or* git + GitHub
## Problem 3 ### My changes broke this function and I can't remember how it used to work. Solution: Manual version control: "code1.R" and "code2.R"? *or* git + GitHub
## Problem 4 ### I can't remember what code I used on this sample last year. Or, I want to note this particular version because I used it for the initial paper submission. Solution: Version control + unstructured notes/logs? *or* git + GitHub tags
## Problem 5: My remote backup crashed and I lost all my history. Solution: More remote backups (*distributed* VCS)? *or* git + GitHub
# Git solves problems ## Distribution
## Problem 1 ### I want to publish my code with my paper so others can find and use it. How should I do it? Solution: Website? *or* git + GitHub
## Problem 2 ### How can I get a permanent, fast URL for my software so I can build an automated container that will download and install it automatically? Solution: A high-quality code hosting service? *or* git + GitHub
## Problem 3 ### I'd like other people to be able to find and use my code. How can I advertise it? Solution: Google adwords? *or* git + GitHub
## Problem 4 ### How can I find software that people actually use that's relevant for my project? Solution: Google? *or* git + GitHub
# Git solves problems ## Collaboration
## Problem 1 ### Someone else found a bug in my code and wants to show me how to fix it. Solution: E-mail? *or* User submits a pull request on GitHub. You can also [point to specific lines](https://github.com/databio/pypiper/blob/653216887cb2b2ad8e9119b76f40b39da58ec115/pypiper/ngstk.py#L72-L75).
## Problem 2 ### My friend and I are working on a similar problem. How can we share our code with one another, but not with anyone else? Solution: E-mail? Dropbox? *or* GitHub collaborators or organizations
## Problem 3 ### My collaborator wants to keep using my code for this current project while I develop and test a new feature. Solution: Duplicate the code? *or* git branches + GitHub
## Problem 4 ### A user is having trouble getting something to work. How do they know who I am and how to contact me? Solution: An E-mail address on a web page? *or* git + public GitHub issues
## Problem 5 ### I figured out how to adapt this published tool to work for my data. How can I contribute these changes back to the original authors? Solution: E-mail? *or* git + GitHub pull request
## Problem 6: ### Our lab/center all needs to do on a similar thing over and over, with slight differences. How can we share effort but also keep things separate? Solution: Lots of duplicated scripts with minor tweaks? *or* git + GitHub branches and tags
# Key git/github concepts ## repository *vs* remote ## branch *vs* clone ## clone *vs* fork ## pull request *vs* merge ## commit *vs* push ## issue, tag, [stage](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository)
# How git works ## And things to avoid
## Do: commit text files Git uses line-by-line comparison. See this [pull request on the `peppy` repository](https://github.com/pepkit/peppy/pull/238/files) ## Don't: commit binary files
## Do: commit small versioned files Git retains a copy of everything you've committed, even if you delete it. ## Don't: commit large static files
## Do: make commits frequently Nothing can't be undone. Frequent commits helps you track your work. ## Don't: be scared to break something
## Do: learn to use branches [Branches](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell) are a super useful organizational structure ## Don't: be scared of using branches
## Do: use the command line Write your own [aliases](https://github.com/nsheff/env/blob/master/alias_git.sh) for commands you use frequently. ## Don't: just rely on the web interface
## Do: use the issue tracker Every project can enable a GitHub issue tracker, which links nicely to code. ## Don't: use e-mail to document problems and solutions
## Other niceties - *GitHub pages*: free hosting for static web pages - *Jekyll*: Github's blog-aware static site generator - *Git hooks*: executes scripts before or after events - *Github Wiki*: a no-frills wiki on every repository - *GitHub project tracker*: integrates a simple kanban system - *Github API*: provides programmatic access - *Gists*: small code snippets - Free private repositories for individuals - Free private repositories for academic groups
## Git's utility transcends software - analytical code, not just tools - VCS/collaboration for writing grants, papers, CV/biosketch - VCS and host for lab web page and all code documentation - citation management database - shared lab instructions - Environments: modulefiles, Dockerfiles, config files - a shared figure repository for lab members - presentations - communicating with groups of people, brainstorming
## Git is a single infrastructure that provides solutions to a huge number of problems
[peppy repository](https://github.com/pepkit/peppy)