Scientific writing:
conciseness, clarity, and cohesion

by Nathan Sheffield


Assistant Professor, Center for Public Health Genomics
University of Virginia

Let's start with an example

Example 1

What makes this sentence unclear?
The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.

Example 1

Distance between subject and verb
The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.

Example 1

Complex subject
The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.

Example 1

verbs vs. Implied actions (nominalizations)
The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.

Example 1

List precedes its context
The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.

Conciseness
↑ ↓
Clarity
↑ ↓
Cohesion

The Four Problems

The Four Problems

Things that make scientific writing unclear
  1. Subjects and verbs too far apart
  2. Overabundance of nominalizations
  3. Poor flow (misplacement of old vs new information)
  4. Excessive or unnecessary use of passive voice


NOT the complexity of the topic!

The Four Problems

Subjects and verbs too far apart
  • Who did it, and what did they do? English readers expect doers to be near their actions.
  • Complex subjects (subjects modified with essential clauses) can violate this expectation.

The Four Problems

Overabundance of nominalizations
  • English readers expect actions to be in verbs.
  • Nominalizations are actions that appear in parts of a sentence other than a verb (e.g. in nouns or adjectives). The word "nominalization" is a nominalization of the verb "to nominalize."
  • Nominalizations aren’t all created equal: some are clear, others reduce clarity.

The Four Problems

Overabundance of nominalizations
Actions in Nominalizations:
The assumption that all RNAs are poly- adenylated is an oversimplification of the transcription process.
Actions in Verbs:
The model oversimplifies the transcription process because it assumes that all RNAs are polyadenylated.

The Four Problems

Poor flow (lack of cohesion)
A cohesive sentence links with neighboring sentences by starting with familiar ideas and ending with new ideas.

oldnew

    Disrupt flow by:
  • Starting with unfamiliar ideas
  • Ending with backwards-linking ideas


Cohesion matters at both sentence-level and paragraph-level.

The Four Problems

Poor flow (lack of cohesion)
(in a paper about farmers...) Farmers try to provide optimal growing conditions for crops by using soil additives to adjust soil pH. Garden lime, or agricultural limestone, is made from pulverized chalk, and can be used to raise the pH of the soil. Clay, which is a naturally acidic soil type, often requires addition of agricultural lime.

The Four Problems

old information vs new information
Farmers try to provide optimal growing conditions for crops by using soil additives to adjust soil pH. Garden lime, or agricultural limestone, is made from pulverized chalk, and can be used to raise the pH of the soil. Clay, which is a naturally acidic soil type, often requires addition of agricultural lime.

The Four Problems

old information vs new information
Farmers try to provide optimal growing conditions for crops by using soil additives to adjust soil pH. One way to raise the pH of the soil is an additive made from pulverized chalk called garden lime or agricultural limestone. Agricultural limestone is often added to naturally acidic soils, such as clay.

The Four Problems

Excessive or unnecessary use of passive voice
Passive voice is sometimes useful, but it has several side-effects:
  • It often increases length
  • It can eliminate the actor (causing ambiguity)
  • Reverses the order of the sentence (A-B vs. B-A)
    • I stole the money
    • The money was stolen by me

The Four Problems

Excessive or unnecessary use of passive voice
  • Consider cohesion: Don’t choose passive voice simply out of habit. Do choose passive voice when it improves cohesion by putting familiar ideas first.
  • Most scientific journals encourage authors to use active voice for the sake of clarity, conciseness, and cohesion.
  • Passive voice is NOT inherently scientific!

Passive voice

What do the journals say?
Science
Use active voice when suitable, particularly when necessary for correct syntax (e.g., ‘To address this possibility, we constructed a λZap library...,’ not ‘To address this possibility, a λZap library was constructed...’).

Passive voice

What do the journals say?
Nature
Active voice has been Nature policy for as long as I can remember; it is enshrined in our style manual and is specifically recommended to all authors as part of our standard acceptance procedure. However, if an author insists on the passive, we would probably allow it...So you will see papers in Nature in the passive voice, but you can be assured that this is at the author's insistence rather than Nature policy.” -- Maxine Clark, editor

Passive voice

What do the journals say?
Astronomical Society of the Pacific
Use active voice as much as possible, and avoid passive voice as you would avoid the Ebola virus. This means writing ‘Astronomers discovered a new planet’ (active voice) rather than ‘A new planet was discovered by astronomers’ (passive voice). You should write less than 10 percent of your sentences in passive voice.

Revision techniques

Revision techniques

Ways to improve clarity, conciseness, and cohesion
  1. Omit needless words
  2. Put actions in verbs (avoid nominalizations)
  3. Place verbs near subjects
  4. Put familiar information first

Revision techniques

Omit needless words
  1. It is absolutely vital that...
    → We must...
  2. At the same time...
    → Simultaneously/furthermore...
  3. There were five mice receiving antibiotics...
    → Five mice received antibiotics.

Revision techniques

Put actions in verbs
  1. We performed an analysis...
    → We analyzed
  2. The quantification of the atoms was done...
    → The atoms were quantified.
  3. The MS managed the measurement and identification of the proteins.
    → The MS measured and identified the proteins.

Revision techniques

Put actions in verbs
Nominalizations are useful when they summarize the action of the previous sentence:
Our analysis using regression and k-means clustering revealed...
→ We analyzed the data with regression and k-means clustering. This analysis revealed..

Revision techniques

Place verbs near subjects
DNA in repeat regions or small microsatellites or with long stretches of the same base causes problems for next-gen sequencers.
DNA causes problems for next-gen sequencers when it is in repeat regions or small microsatellites or has long stretches of the same base.

Revision techniques

Put familiar information first
We searched the database of sequences to look for similar structures. A protein involved in the regulation of the BRCA1 gene in humans was found by the search.
→ We searched the database of sequences to look for similar structures. This search found a protein involved in the regulation of the BRCA1 gene in humans.

Now for some practice

Example 2 - What would you do?
This component will chiefly involve a description and quantitative analysis of the study’s data collection process.

We suggest: put actions in verbs
→ This component describes and quantitatively analyzes the data collection process.

The sentence is more concise (10 vs 16 words).
The meaning is clearer.
Example 3 - What would you do?
Detailed analyses of the evolutionary features of different types of regulatory elements are an important area for future research.

We suggest: put actions in verbs

Consider implied actions vs. verb.
→ Future research should analyze the evolutionary features of different types of regulatory elements.
The sentence is more concise (13 vs 19 words).
The subject is clearer.
The subject and verb are closer together.
Example 4 - What would you do?
Improvements are expected in the predictive power of all the scores being computed on multispecies alignments.

We suggest: use active voice

→ [We expect to] improve the predictive power of our multispecies alignment scores.
The sentence is more concise (12 vs 16 words).
Prepositions no longer disrupt flow.
Sentence is more direct.
Example 5 - What would you do?
Some astonishing questions about the nature of the universe have been raised by scientists studying the nature of black holes in space. The collapse of a dead star into a point perhaps no larger than a marble creates a black hole.

We suggest: put familiar info first, omit needless words

→ Scientists studying black holes have raised some astonishing questions about the universe. A black hole is created by the collapse of a dead star into a point perhaps no larger than a marble.
The link is clearer; these sentences are more cohesive.
Example 6 - What would you do?
The second reaction is really the end result of a very large number of reactions. It is also worth noting that these two reactions form a simple linear chain whereby the product of the first reaction is the reactant for the second.

We suggest: omit needless words

→ The second reaction is the result of numerous reactions. Moreover, these two reactions form a simple linear chain whereby the product of the first reaction is the reactant for the second.

More concise (32 vs. 42 words)
Example 7 - What would you do?
Significant positive correlations were evident between the substitution rate and a nucleosome score from resting human T-cells.

We suggest: Put actions in verbs

→ In resting human T-cells, the substitution rate correlated with a nucleosome score.

More concise (12 vs. 17)
The verb is correlate rather than the nebulous were evident
Example 8 - What would you do?
We identified genes that are differentially expressed between species. A phylogenetic tree based on the number of differentially expressed genes between species recapitulates their known phylogeny.

We suggest: Put actions in verbs

→ We identified genes that are differentially expressed between species. The number of differentially expressed genes can be used to build a phylogenetic tree that recapitulates the known phylogeny.
The second sentence now links back at the beginning
Subject-verb are now closer in the second sentence.
Example 1 (again, in context) - What would you do?
The model used by the software is a fairly rich probabilistic model, but it is clearly not realistic in several respects. The assumptions that all sites evolve at one of two evolutionary rates (conserved and nonconserved), that these rates are uniform across the genome, that sites evolve independently conditional on whether they are in conserved or nonconserved regions, and that the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns, all represent oversimplications of the complex process of sequence evolution in eukaryotic genomes.
Example 1 (again, in context)

We suggest: Put verbs near subjects

The gist of the sentence: Certain assumptions oversimplify the complex process of sequence evolution in eukaryotic genomes.

Should the gist of the sentence go first or last? Before the list of assumptions or after it?
Example 1 (again, in context)

A possible revision

→ [Our model admittedly] oversimplifies the complex process of sequence evolution in eukaryotic genomes by assuming that: (1) all sites evolve at one of two evolutionary rates (conserved and nonconserved), (2) these rates are uniform across the genome, (3) sites evolve independent of whether they are in conserved or nonconserved regions, and (4) the phylogenetic models for conserved and nonconserved regions have the same branch-length proportions, base compositions, and substitution patterns.
Example 1 (again, in context)

Positive consequences


The most important action (oversimplify) is now a verb
The verbs follow closely after the subjects
The sentence is more cohesive: familiar information links to the previous sentence at the beginning
The sentence contains cues for parsing information (by, [1, 2, 3, 4], however, etc.)
References and further reading
  • The Duke Scientific Writing Resource
  • Style: Toward clarity and grace (1990), Joseph Williams
  • Expections (2004) and The Sense of Structure (2004), George Gopen
  • How to write consistently boring scientific literature (2007), Kaj Sand-Jensen
  • The infectiousness of pompous prose (1992), Martin W. Gregory
  • How we write about biology (1991), Randy Moore
  • Writing intelligible English prose for biomedical journals (2007), John Ludbrook
  • Whose literature is science? (2003), Judith A. Swan
  • What is the scientific literature? (1986), John Maddox
  • Scientific literature: Clear as mud (2003), Jonathan Knight
  • The science of scientific writing (1990), George Gopen, Judith Swan
  • The readability of marketing journals: are award-winning articles better written? (2008), Sawyer, Laran, & Xu
Thanks for listening!

Slides at http://databio.org/slides/scientific_writing.html