Error Control Coding in Biology Implies Design, Part 1 (of 5)

Error Control Coding in Biology Implies Design, Part 1 (of 5)

It’s no secret that we are highly dependent on electronic devices. We use them for managing schedules, communicating, staying “connected,” and storing, managing, and using various media such as music, photographs, and videos. These information processing devices have become an integral part of our everyday lives. It is vitally important that these devices utilize robust methods to protect the integrity of the data that is stored and processed by them.

Information theory, information processing, and error control coding1 are relevant fields behind the technological systems and devices of our time. Engineers work diligently to protect the integrity of data processed by various terrestrial and satellite communications systems in place today. These systems and associated machines enable reliable communications on a truly global scale. Critical to successful communications and reliable information processing are coding techniques2 that engineers have discovered and developed. These techniques play an important role in maintaining the high reliability of data in spite of many error-inducing characteristics of a typical communications link. As remarkable as this technology is, it turns out that within our own cells there exists an even more elegant set of information processing miracles.

Strict and rigorous analogies of information processing systems and several man-made coding techniques occur in nature, specifically in genes. In this series, we will explore the genetic system and will see that it is actually an information processing system.

Genetic System Overview3

The so-called “central dogma of molecular biology” acknowledges that, at the most fundamental level, the construction of proteins involves a one-way flow of information. Proteins are the workhorses of the cell and are vital for all cell functions. Information for protein construction is stored in the cell’s DNA, which is contained in the nucleus. From the nucleus, DNA is transcribed to mRNA; mRNA then takes the information out of the nucleus into the cytoplasm for protein construction. At the ribosome, the process of translation takes the information copied into the mRNA and builds the sequences of amino acids, which will eventually fold to construct the protein. This entire process is called protein synthesis. The flow of protein-building information from DNA—and also in DNA replication—shows that the cell’s bio-machinery is an information-based system.

Every cell nucleus contains DNA composed of two long strands of nucleotides. The set of nucleotides (adenosine, guanosine, cytidine, and thymidine) used comprise a four-letter chemical alphabet (A, G, C, T). The combinations that arise from this alphabet describe how to construct each and every protein in the cell. Each strand of the DNA is complementary to the other strand, and they are intertwined to form the famous double helix. Coding portions of DNA describing the construction of proteins are called genes.

The structure of RNA is similar to that of DNA, with the main exception that RNA consists of a single strand, and so is not characterized by a double helix. Also, uridine replaces thymidine so that the nucleotide alphabet for RNA is A, G, C, U. Three sequential nucleotides form a codon, the fundamental unit that describes the amino acid sequence that forms the protein. As an example, the codon AUC codes for the amino acid isoleucine.

Sequences of amino acids are constructed to form polypeptide chains. These chains fold into complex 3-D shapes determined in part by the chemical forces and bonds within the amino acid sequences. Polypeptide chains assemble to complete the construction of the protein. DNA has to specify one amino acid at each link in the chain. Out of more than 80 possible amino acids (each with right- and left-handed versions), only 20 left-handed ones are relevant for biological systems.

The genetic code refers to the mapping between the codon in the DNA and the 20 biologically relevant amino acids. Since there are 43 = 64 possible codons, and there are 20 amino acids, there is redundancy in the code (~3x) and many mappings are possible. Moreover, since there are different degrees of similarity between the 20 relevant amino acids, the specific details of the mapping become very important. In the event that one of the nucleotides in the codon is in error, a good mapping chooses a replacement amino acid as similar as possible to the desired one, thereby maximizing the possibility of a functional protein in spite of the error.

Error Control Coding and Genetic Information Processing

Given that living systems process information, there exist several good reasons to expect error-correcting codes to be in operation:

  • Life itself depends on robust information transfer.
  • There are real impairments that must be overcome. For example, though mutations already exist within their genomes, organisms maintain functionality.
  • The low error rates observed in the genetic system demand an adequate explanation.
  • Genetic information is inherently digital in nature (i.e., genetic information is specified using a finite set of discrete objects) and is characterized by redundancy.

Furthermore, in the words of leading researcher, Gail Rosen,

Since DNA is a finite, symbolic sequence, it is natural to extend the use of coding theory to sequence analysis.4

Redundancy, the most basic property for any error-correction scheme, exists within the genetic system. All error-correction schemes require redundancy in the coded data protected by that scheme. At its root, the genetic mapping code exhibits such redundancy, and coded genetic sequences themselves also exhibit redundancy. Other leading researchers comment,

All the methods of error-control coding are based on the adding of redundancy to the transmitted information. As the genetic information is redundant, and since the genetic code is also redundant itself, the possible existence of error-control mechanisms represents a somehow natural hypothesis related to the biological task of ensuring a high degree of reliability in the transmission and expression of genetic information.5

We have established that living beings are information-processing systems, and that the genetic communications system is ideal for digital information-processing and error-control coding. In future articles, we turn our attention to a few analogies between modern digital communications systems and genetic information processing.


Keith McPherson

Keith McPherson received his Master of Science in Electrical Engineering from Georgia Institute of Technology in 1993, and currently works as an electrical engineer in Melbourne, FL, in the fields of communications and signal processing.


Page 1 | Page 2 | Page 3 | Page 4 | Page 5
Endnotes
  1. In general, the techniques we will discuss in this series fall under three broad categories: error-correcting codes, error-detecting codes, and Gray coding. “Forward error correction (FEC)” and “error-control codes” are terms also commonly used to refer to error-correcting codes in the technical literature. See here and here for a brief introduction to error-correcting and error-detecting codes. See here for a brief introduction to Gray codes.
  2. See note 1.

  3. See 6, 7, and 8 for complementary overviews of the genetic system and further evidence for design in the genetic system.

  4. Gail Rosen, “Examining Coding Structure and Redundancy in DNA,” IEEE Engineering in Medicine and Biology Magazine 25 (Jan. – Feb. 2006): 62 – 68.

  5. D. L. Gonzalez, S. Giannerini, R. Rosa, “Detecting Structure in Parity Binary Sequences,” IEEE Engineering in Medicine and Biology Magazine25 (Jan. – Feb. 2006): 69 – 81.

  6. Fazale Rana, “Biochemical Synonyms Optimized, Part 1 (of 2) ,” Today’s New Reason To Believe, August 21, 2008.

  7. Fazale Rana, “Biochemical Synonyms Optimized, Part 2 (of 2),” Today’s New Reason To Believe, August 28, 2008.

  8. Fazale Rana, “FYI: I.D. in DNA; Deciphering Design in the Genetic Code,” Facts for Faith, Quarter 1, 2002, 14 – 23.