By Zachary Leung
Have you ever composed a text message on your smartphone and had the app suggest the next word? It is amazing that, many times, the app is correct. How does it work? By sheer chance? Or by reading your mind?
The fact that technology can often predict the next word you are going to type is possible only because your text message comes from your mind. The sentence you type is informative and meaningful because it is intelligently designed and
Scientists have recently found similar predictability in biochemical systems. Does this similarity mean that biochemistry is also intelligently designed? Before answering this question, let us look at one way that mathematicians characterize
information in the sentences we type.
N-Gram Language Modeling
Language is a sophisticated human cognitive process, and N-gram modeling is just one of the language modeling techniques widely employed in a variety of artificial intelligence applications, such as autocorrect and speech recognition.
We will consider three main types of N-gram modeling: unigrams, bigrams, and trigrams.
Unigrams: We know that some English words are used more frequently than others. By analyzing millions of sentences, Oxford University Press reports that among the top 10 most frequently used nouns are: “time,”
“person,” “year,” “way,” “day”…1 There are also lists of the top 100 and top 10,000 frequently used words. These lists agree with
human intuition, contrasting with the most uncommon words, such as “futhorc” and “chaulmoogra.” Unigrams use these lists of word frequencies to predict the next word in a sentence, without
relying on previously written words.
Bigrams and Trigrams: Let us look at this partial sentence: “The quick brown fox…” We can probably make some good guesses about what the next word might be: “is,” “eats,”
and “jumps” are all possibilities. Similarly, bigrams and trigrams use the previous one or two words, respectively, to predict the probability of the next word, whereas unigrams use the frequency of the next word alone.
Thus, in our partial sentence, bigrams and trigrams would use “fox” and “brown fox”, respectively. Generally, the larger N is, the better the predictive power becomes.
Perplexity is a mathematical measurement of how well a model predicts a result. Researchers use an information-theoretic metric known as word perplexity to quantify a language’s branching factor, which is the average number of possible
words that can follow any word. It is a measure of uncertainty. For example, studies show that the Wall Street Journal (WSJ) uses a vocabulary of 19,979 unique English words.2 If a writer who knew nothing about
English were to “unintelligently” compose a sentence, they would pick random words from this vocabulary. In this case, the word perplexity would always be 19,979, as shown in Table 1, reflecting a mere chance scenario when
the writer uses no intelligence to design the sentence.
Table 1: The use of N-grams shows that the perplexity in WSJ is two orders of magnitude lower than the perplexity in sentences of random WSJ words.3
In reality, writers use intelligence to design sentences. By analyzing a WSJ corpus of 38 million words, we can compute the N-gram perplexity, which is found to be 962, 170, and 109 for unigrams, bigrams, and trigrams, respectively.
4 The use of bigrams and trigrams indicates that the word-perplexity of actual sentences in WSJ texts is two orders of magnitude lower than the word-perplexity of sentences of random words (19,979). This significant
reduction shows that writers intelligently design sentences to carry information and meaning, rather than relying on random chance. Is there a similar perplexity reduction in nature?
Consider the following example from biochemistry that demonstrates the similarity between intelligent human language and proteins.
Proteins consist of sequences of polypeptides. These biomolecules form when the cellular machinery links amino acids together.5 Within the structure of proteins, biochemists have discovered compact and self-contained folded
regions called protein domains, each possessing a unique biochemical function. Each protein, therefore, consists of a combination of domains.6
Resemblance to Human Languages
Researchers have discovered a remarkable resemblance between the information structure found in proteins and human languages, as summarized in Table 2.
Table 2: Analogy between biochemistry and human languages.
Recently, a team of scientists led by Yu used N-gram modeling to study protein architectures.7 They examined a dataset of 23 million protein domains across 4,794 species. Since most organisms, especially bacteria and archaea,
have proteins comprised of two or fewer domains, they used unigrams and bigrams only. These scientists found that (1) over 95% of all possible bigrams were absent, indicating that the protein sequences were far from random; and (2)
there was a “quasi-universal grammar” imposed on protein domains, showing the parallel between proteins and languages. For creationists, this result resonates powerfully with the idea that life was created by an intelligent
Protein Domain Perplexity
By analyzing the dataset used by Yu’s team, this author examined perplexity in protein domains. Table 3 shows that the average numbers of protein domains for archaea, bacteria, and eukarya (the three domains of life)
are 671, 917, and 2,434, respectively. If proteins were formed by naturalistic processes that link together protein domains at random, the perplexity for archaea, bacteria, and eukarya would always be 671, 917, and 2,434, respectively.
This is analogous to the perplexity of 19,979 for WSJ texts if sentences were unintelligently written with random words.
|If Protein Sequences|
Table 3: Perplexity if proteins were formed by randomly linking protein domains.
However, as shown in Table 4, the unigram and bigram perplexity in eukarya is 42 and 16 respectively, which is two orders of magnitude lower than the perplexity in random sequences of eukaryote protein domains (2,434). Similar reductions
are found for archaea and bacteria. Analogous to the WSJ texts, N-gram modeling shows that protein domain sequences are far from random. Instead, just like newspaper articles or text messages, they carry information and meaning.
Table 4: Perplexity in actual proteins (as shown by unigrams and bigrams) is two orders of magnitude lower than the perplexity in “random” proteins. Perplexities
are directly calculated from data sets.9
Sentences are not sequences of random words; you and I write sentences with a perplexity significantly lower than the perplexity of gibberish. This is a hallmark characteristic of intelligent design. Similarly, the perplexity in proteins
is much lower than the perplexity in random sequences of protein domains. I see at least as much intelligent design in proteins as in my writing. Perplexity reduction alone builds a positive case that proteins harbor information,
and therefore have been intelligently designed. Proteins, like writing, are indeed information-rich. The analogy between proteins and writing—hallmark characteristics of intelligent design—point to a Creator.