Collaborative software development
Nathan Sheffield, PhD
www.databio.org/slides
--- # The three levels of collaboration
--- # Why collaborate on software?
--- # Why collaborate on software? Because collective progress increases with increased collaboration.
--- # But I don't develop software!
Yes you do. Data analysis is software development
--- # Levels of collaboration
I write and use code for my project. Analogy: Meditation --- # Levels of collaboration
I give you my script and you run it. Analogy: TV --- # Levels of collaboration
Interactive work toward a shared goal; collecting bug reports and user feedback. Analogy: Brainstorming conference call. --- # Levels of collaboration
Interdependent work toward a shared goal. Everyone contributes, adjusts to others, and does something different. Analogy: a sports team. --- # How do we move toward coordination?
--- # Do you recognize either of these logos?
---
Git
a distributed version-control system that tracks changes in software development
created by Linus Torvalds in 2005 for development of the Linux kernel
free and open-source (GPL2)
Github
a web-based hosting service for version control using Git
company started Feb. 2008
purchased by Microsoft for $7.5 billion in 2018
--- # What is the git/github ecosystem used for? # It solves problems in...
1. version control. [centralized vs distributed](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) [git vs svn](https://trends.google.com/trends/explore?date=all&geo=US&q=git,svn)
2. distribution. [the octoverse](https://octoverse.github.com/)
3. collaboration. [dashboard](https://github.com/orgs/databio/dashboard)
--- # Git solves problems ## Version control --- ## Problem 1 ### My computer crashed and I lost all my code. Solution: Remote backup (S3?) *or* git + GitHub --- ## Problem 2 ### I want to work on my code from my home and work computers Solution: Remote working copy (Dropbox?) *or* git + GitHub --- ## Problem 3 ### My changes broke this function and I can't remember how it used to work. Solution: Manual version control: "code1.R" and "code2.R"? *or* git + GitHub --- ## Problem 4 ### I can't remember what code I used on this sample last year. Or, I want to note this particular version because I used it for the initial paper submission. Solution: Version control + unstructured notes/logs? *or* git + GitHub tags --- ## Problem 5: My remote backup crashed and I lost all my history. Solution: More remote backups (*distributed* VCS)? *or* git + GitHub --- # Git solves problems ## Distribution --- ## Problem 1 ### I want to publish my code with my paper so others can find and use it. How should I do it? Solution: Website? *or* git + GitHub --- ## Problem 2 ### How can I get a permanent, fast URL for my software so I can build an automated container that will download and install it automatically? Solution: A high-quality code hosting service? *or* git + GitHub --- ## Problem 3 ### I'd like other people to be able to find and use my code. How can I advertise it? Solution: Google adwords? *or* git + GitHub --- ## Problem 4 ### How can I find software that people actually use that's relevant for my project? Solution: Google? *or* git + GitHub --- # Git solves problems ## Collaboration --- ## Problem 1 ### Someone else found a bug in my code and wants to show me how to fix it. Solution: E-mail? *or* User submits a pull request on GitHub. You can also [point to specific lines](https://github.com/databio/pypiper/blob/653216887cb2b2ad8e9119b76f40b39da58ec115/pypiper/ngstk.py#L72-L75). --- ## Problem 2 ### My friend and I are working on a similar problem. How can we share our code with one another, but not with anyone else? Solution: E-mail? Dropbox? *or* GitHub collaborators or organizations --- ## Problem 3 ### My collaborator wants to keep using my code for this current project while I develop and test a new feature. Solution: Duplicate the code? *or* git branches + GitHub --- ## Problem 4 ### A user is having trouble getting something to work. How do they know who I am and how to contact me? Solution: An E-mail address on a web page? *or* git + public GitHub issues --- ## Problem 5 ### I figured out how to adapt this published tool to work for my data. How can I contribute these changes back to the original authors? Solution: E-mail? *or* git + GitHub pull request --- ## Problem 6: ### Our lab/center all needs to do on a similar thing over and over, with slight differences. How can we share effort but also keep things separate? Solution: Lots of duplicated scripts with minor tweaks? *or* git + GitHub branches and tags --- # Key git/github concepts ## repository *vs* remote ## branch *vs* clone ## clone *vs* fork ## pull request *vs* merge ## commit *vs* push ## issue, tag, [stage](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository) --- # How git works ## And things to avoid --- ## Do: commit text files Git uses line-by-line comparison. See this [pull request on the `peppy` repository](https://github.com/pepkit/peppy/pull/238/files) ## Don't: commit binary files --- ## Do: commit small versioned files Git retains a copy of everything you've committed, even if you delete it. ## Don't: commit large static files --- ## Do: make commits frequently Nothing can't be undone. Frequent commits helps you track your work. ## Don't: be scared to break something --- ## Do: learn to use branches [Branches](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell) are a super useful organizational structure ## Don't: be scared of branches --- ## Do: use the command line Write your own [aliases](https://github.com/nsheff/env/blob/master/alias_git.sh) for commands you use frequently. ## Don't: just rely on the web interface --- ## Do: use the issue tracker Every project can enable a GitHub issue tracker, which links nicely to code. ## Don't: use e-mail to document problems and solutions --- ## Other niceties - *GitHub pages*: free hosting for static web pages - *Jekyll*: Github's blog-aware static site generator - *Git hooks*: executes scripts before or after events - *GitHub Wiki*: a no-frills wiki on every repository - *GitHub project tracker*: integrates a simple kanban system - *GitHub API*: provides programmatic access - *GitHub actions*: provide short compute power to build/deploy - *Gists*: small code snippets - Free private repositories for individuals - Free private repositories for academic groups --- ## Git's utility transcends software - analytical code, not just tools - VCS/collaboration for writing grants, papers, CV/biosketch - VCS and host for lab web page and all code documentation - citation management database - shared lab instructions - Environments: modulefiles, Dockerfiles, config files - a shared figure repository for lab members - presentations - communicating with groups of people, brainstorming --- ## Git is a single infrastructure that provides solutions to a huge number of problems --- [peppy repository](https://github.com/pepkit/peppy)