Collaborative software development

Nathan Sheffield, PhD
www.databio.org/slides
## The three levels of collaboration 0- None 1- One-way communication 2- Conferencing 3- Coordination
## Why collaborate on software?
<img src="/images/presentations/collab/afgan.png" width=400 style="margin:0px; padding:0px"> <img src="/images/presentations/collab/morin.png" width=400 style="margin:0px; padding:0px"> <img src="/images/presentations/collab/nature.png" width=400 style="margin:0px; padding:0px"> <img src="/images/presentations/collab/groen.png" width=400 style="margin:0px; padding:0px">
<img src="/images/presentations/collab/garijo.png" width=400 style="margin:0px; padding:0px"> <img src="/images/presentations/collab/nbt.png" width=400 style="margin:0px; padding:0px"> <img src="/images/presentations/collab/stodden.png" width=400 style="margin:0px; padding:0px">
## Why collaborate on software? Because collective progress increases with increased collaboration. <img src="/images/presentations/collab/increase.svg" height=400>

But I don't develop software!

Yes you do.
Data analysis is software development

# Levels of collaboration
## 0. None <img src="/images/presentations/collab/none.svg" height=100> I write and use code for my project.
## 1. One-way Communication. <img src="/images/presentations/collab/communication.svg" height=100> I give you my script and you run it. Analogy: TV
## 2. Conferencing. <img src="/images/presentations/collab/conferencing2.svg" height=200> Interactive work toward a shared goal; collecting bug reports and user feedback. Analogy: Brainstorming conference call.
## 3. Coordination. <img src="/images/presentations/collab/coordination.svg" height=200> Interdependent work toward a shared goal. Analogy: a sports team. Everyone contributes, adjusts to others, and does something different.

How do we move toward coordination?

0- None
1- One-way communication
2- Conferencing
3- Coordination
<img src="/images/presentations/collab/git_logo_white.svg" height=400> <img src="/images/presentations/collab/github_bug_black.svg" height=400>
Git

a distributed version-control system that tracks changes in software development

  • created by Linus Torvalds in 2005 for development of the Linux kernel
  • free and open-source (GPL2)
  • Github

    a web-based hosting service for version control using Git


  • company started Feb. 2008
  • purchased by Microsoft for $7.5 billion in 2018
  • # git/github ecosystem ## version control [centralized vs distributed](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) [git vs svn](https://trends.google.com/trends/explore?date=all&geo=US&q=git,svn) ## distribution [the octoverse](https://octoverse.github.com/) ## collaboration [dashboard](https://github.com/orgs/databio/dashboard)
    # Git solves problems ## Version control
    ## Problem 1: My computer crashed and I lost all my code. Solution: Remote backup
    ## Problem 2: I want to work on my code from my home and work computers Solution: Remote working copy
    ## Problem 3: My changes broke this function and I can't remember how it used to work. Solution: VCS
    ## Problem 4: I can't remember what code I ran on this sample from last year. Solution: VCS + logs
    ## Problem 5: My remote backup crashed and I lost all my history. Solution: *Distributed* VCS
    ## Problem 5: I want to keep note of this particular version of my analysis because it's what I used for the initial paper submission. Solution: tags
    # Git solves problems ## Distribution
    ## Problem 1: I want to publish my code with my paper so others can find and use it. How should I do it? Solution: Put it on the web
    ## Problem 2: I need a permanent, fast URL for my software because I want to build an automated container that will download and install my software automatically. How can I do that? Solution: Put it on a high-quality code hosting service
    ## Problem 3: I'd like other people to be able to find and use my code. How can I advertise it? Solution: Google adwords?
    ## Problem 4: How can I find software that people actually use that's relevant for my project? Solution: Google/Github
    # Git solves problems ## Collaboration
    ## Problem 1: Someone else found a bug in my code and wants to show me how to fix it. Solution: User submits a pull request. You can also [point to specific lines](https://github.com/databio/pypiper/blob/653216887cb2b2ad8e9119b76f40b39da58ec115/pypiper/ngstk.py#L72-L75).
    ## Problem 2: My friend and I are working on a similar problem. How can we share our code with one another, but not with anyone else? Solution: e-mail it back and forth?
    ## Problem 3: My collaborator wants to keep using my code for this current project while I develop and test a new feature. Solution: git branches
    ## Problem 4: A user is having trouble getting something to work. How do they know who to contact? And how can they contact me? Solution: E-mail? Or public GitHub issues?
    ## Problem 5: I figured out how to adapt this published tool to work for my data. How can I contribute these changes back to the original authors? Solution: pull request
    ## Problem 6: Our lab/center all needs to do on a similar thing over and over, with slight differences. How can we share effort but also keep things separate? Solution: lots of duplicated scripts with minor tweaks?
    # Key git/github concepts ## repository *vs* remote ## branch *vs* clone ## clone *vs* fork ## pull request *vs* merge ## commit *vs* push ## issue, tag, [stage](https://git-scm.com/book/en/v1/Getting-Started-Git-Basics#The-Three-States)
    # How git works ## And things to avoid
    ## Do: commit text files Git uses line-by-line comparison. See this [pull request on the `peppy` repository](https://github.com/pepkit/peppy/pull/238/files) ## Don't: commit binary files
    ## Do: commit small versioned files Git retains a copy of everything you've committed, even if you delete it. ## Don't: commit large static files
    ## Do: make commits frequently Nothing can't be undone. Frequent commits helps you track your work. ## Don't: be scared to break something
    ## Do: learn to use branches [Branches](https://git-scm.com/book/en/v1/Git-Branching-What-a-Branch-Is) are a super useful organizational structure ## Don't: be scared of using branches
    ## Do: use the command line Write your own [aliases](https://github.com/nsheff/env/blob/master/alias_git.sh) for commands you use frequently. ## Don't: just rely on the web interface
    ## Do: use the issue tracker Every project can enable a GitHub issue tracker, which links nicely to code. ## Don't: use e-mail to document problems and solutions
    ## Other niceties - GitHub pages: free hosting for static web pages - Jekyll: Github's blog-aware static site generator - Free private repositories for individuals - Free private repositories for academic groups - Git hooks: executes scripts before or after events - Built-in wiki system - GitHub project tracker integrates a simple kanban system - Github's API provides programmatic access - gists are small code snippets
    ## Git's utility transcends software - analytical code, not just tools - VCS/collaboration for writing grants, papers, CV/biosketch - VCS and host for lab web page and all code documentation - citation management database - shared lab instructions - Environments: modulefiles, Dockerfiles, config files - a shared figure repository for lab members - presentations - communicating with groups of people, brainstorming
    ## Git is a single infrastructure that provides solutions to a huge number of problems
    [peppy repository](https://github.com/pepkit/peppy)