# Error Control Coding in Biology Implies Design, Part 3 (of 5)

Parts 1 and2 of this series observed that biological genetic systems function as information-processing systems, and a case was made for coding techniques that protect the genetic data. As a specific example, the genetic code appears designed to minimize the effects of errors in a way that is directly analogous to Gray codes. Gray codes are commonly used by engineers to protect data processed by many modern digital communications systems.

We now turn our attention to another analogy.

## Analogy: Complementary Base Pairing Parity Code

A useful concept to have in mind to better appreciate this analogy is Hamming distance.1 Good codes reduce the probability of error by increasing the minimum Hamming distance between codewords relative to the distance that would have been obtained if no code (or a less powerful code) was used. This minimum distance of a code is like the weakest link in a chain and, therefore, characterizes the code’s strength.

As the code’s minimum distance is increased, it is easier for a recipient to detect a message with no errors. Consider a scavenger hunt game where you “hide” something for a toddler to find, for example, a large blue ball among a pile of smaller white balls. The intent is to make the target blue ball stand out among the others. This exercise is roughly analogous to high Hamming distance. The large blue ball (the intended message) among a number of smaller white balls (errors) exhibits a relatively large dissimilarity, and so the blue ball is easily visible and detectable.

On the other hand, as the code’s minimum distance is decreased the recipient is more likely to make errors in message detection. Now consider an adult scavenger hunt where objects are “hidden” in plain sight because they blend in so nicely in their surroundings. It is difficult to “see” an object you are looking for, even if you are staring straight at it. The intent here is to make the object blend in with the objects around it. This is roughly analogous to low Hamming distance. The target object (the intended message) is hidden among very similar objects (errors) and exhibits a relatively large similarity. Thus, the desired object is not easily visible and detectable.

In digital communications perhaps one of the simplest examples of an error-detecting code (see here and here) is an even parity code. This code is used on a binary message frame (i.e., a sequence of binary digits, 1’s and 0’s). In this code, one parity bit is added to a message frame and its value is chosen to “round” the frame “value” out, to make the message stand out more among the possibilities by increasing the code’s minimum distance (like the example with the blue ball). This allows the recipient to more easily detect that an error has occurred if it detects a “non-round value” (i.e., a white ball). Values that are “round” or “non-round” have precise mathematical definitions in coding theory. The main point is that all parity codes, and the even parity code in particular, impart a precise mathematical structure to the protected (coded) data. This mathematical structure increases the minimum distance between valid codewords and allows for more robust error detection. (See here for more information on parity codes used in engineering.)

Recall that the DNA is a double-strand structure, specifically a double helix. And the four nucleotide bases in the DNA chemical alphabet are A, C, G, and T. Nucleotides A and T are complementary, as are G and C, and these pairings are the basis for the double-stranded structure, where each strand carries the same information as the other strand. Research into the chemical bonds at work between these complementary base pairs reveals that the natural nucleotide alphabet has been chosen to minimize the probability that a given nucleotide on one strand will be incorrectly paired with a partner on the opposite strand. More specifically, a researcher found that the nucleotides used for the DNA chemical alphabet actually form an even parity code. (See research work here and here.)

A convention was used to consistently assign binary values (i.e., 1 or 0) to certain features associated with the four nucleotides that comprise the chemical alphabet in DNA. The relevant features are the relative size of the nucleotide, and its donor-acceptor pattern. The donor-acceptor pattern is relevant for hydrogen bonding. Hydrogen bonds are formed between a nucleotide and its partner on the opposite strand. Careful observation of the resulting binary values reveals that they form an even bit parity code. To be more precise, the relative size of a nucleotide is related to its hydrogen donor-acceptor (D/A) pattern as a parity bit.2

In nature, there are actually 16 nucleotides. Why did nature settle on these specific four, and why only four? At first glance, one may impugn a designer’s inefficiency because there are more nucleotides that could have been used to increase the size of the alphabet, leading to a more efficient genetic code and protein synthesis mechanism. In fact, there has been speculation along these lines. The researcher used the same binary convention to determine the binary representation for the other 12 nucleotides. Upon close inspection, he found that the 16 total nucleotides can be arranged using this framework as eight belonging to the even parity set, and eight belonging to the odd parity set, where the natural alphabet uses a subset of the even parity nucleotides. “Nucleotide Hamming distance” is maximized as a result, as is typical for parity codes, leading to a robust mechanism for error minimization.

The resulting specific four nucleotides emerge as optimal. From this perspective, the genetic machinery is directly analogous to a 1-bit, even-parity code decoder as used routinely in engineering applications. Such a decoder is the optimal way to recover the intended message when a parity code has been used.

The parity code model is an interpretation that readily flows from the relevant chemical bonds that bind the complementary nucleotide pairs. It is a way to mathematically represent or express what is happening at a chemical level. The researcher comments that:

The purine-pyrimidine and hydrogen donor-acceptor patterns governing nucleotide recognition are shown to correspond formally to a digital error-detecting (parity) code, suggesting that factors other than physiochemical issues alone shaped the natural nucleotide alphabet…When this error-coding approach is coupled with chemical constraints, the natural alphabet of A, C, G, and T emerges as the optimal solution for nucleotides.3

In summary, we have seen that an error-detecting code (a parity code) is at work to minimize incorrect bonding between nucleotide pairs on the complementary strands of DNA. For DNA replication to be accurate it is critical that the strands be the true complement of each other. We furthermore note that the specific code used by DNA, an even parity code, is a mainstay in modern communications systems.

The next article in this series will examine another coding analogy between modern digital communications systems and the genetic information-processing system.

 Page 1 | Page 2 | Page 3 | Page 4 | Page 5
##### Endnotes
1. See here and here for more information on Hamming distance.

2. Refer to Figure 1 here. A and G are larger nucleotides called purines, C and T are smaller nucleotides called pyrimidines. Lone pairs are rich in electrons and participate in weak bonding with hydrogen atoms to form hydrogen bonds between complementary pairs. Hydrogen atoms are also referred to as hydrogen donors, and lone pairs as hydrogen acceptors. A binary “1” was assigned for hydrogen (i.e., hydrogen donors, D). A binary “0” was assigned to “lone pairs” (i.e., hydrogen acceptors, A). A binary “1” was assigned to the smaller nucleotides (i.e., pyrimidines). A binary “0” was assigned to the larger nucleotides (i.e., purines).

3. Dónall A. Mac Dónaill, “A Parity Code interpretation of Nucleotide Alphabet Composition,” ChemComm 18 (2002): 2062-63.