Quick, what’s the longest word in the English language?
If you watched my recent video, you’ll know that in descriptivist terms, that is, how people actually use language in practice, it’s a tie between transinstitutionalization and superincomprehenibleness (both 25 letters). If you allow technical terms, it’s laryngotracheobronchopneumonitis (32 letters). That’s the longest word that’s (normally) used in scholarly literature.
But that’s not the only way to answer the question. Check out this word:
Tryptophan synthase is an enzyme in bacteria that produces the amino acid tryptophan. Humans can’t synthesize it at all, and we have to get it from our food, but bacteria can. Like all enzymes, it’s made of a string of amino acids stuck together. In the case of tryptophan synthase in E. coli, or more specifically the alpha chain (also known as tryptophan synthase A), it’s a relatively small string of 268 amino acids.
(Functional proteins are often made out of multiple sub-proteins that are encoded by separate genes. The other subunit of tryptophan synthase, the beta chain, is made of 397 amino acids. Also, perhaps fittingly, out of the 20 different amino acids, the enzyme that makes tryptophan only contains one tryptophan in its structure, in the beta chain.)
Standard chemical naming rules state that when you attach an amino acid to another molecule, the name of the molecule adds the substituent group name for the amino acid as a prefix to the “base” molecule. For example, the prefix form of methionine is methionyl, and a methionine attached to a glutamine is called “methionylglutamine.” The prefix form of glutamine is glutaminyl (not glutamyl, which is the prefix for glutamic acid). If you attach that pair of amino acids to an arginine, you get “methionylglutaminylarginine.” And if you attach all 268 amino acids in tryptophan synthase A together, you get one very long word.
Tryptophan synthase A is sometimes said to be the longest word in scientific literature, but Ross Pomeroy at RealClearScience showed that this is not the case. In no scientific paper was the “full chemical name” of this protein ever published. And why would they? They have a perfectly serviceable name in “tryptophan synthase A.”
However, it turns out that in the very early days of genetic sequencing, there was a different protein whose full name was written in a scientific reference book. This was the capsid protein of the tobacco mosaic virus, the first virus to be discovered. This protein is only 159 amino acids long, and its name is 1,185 letters.
However, tryptophan synthase A may well be the longest word professionally published in any book. This very long word has been included for many years in The Dictionary Project’s A Student’s Dictionary, which is often given out for free to third graders by Rotary Clubs. This is probably meant to be a fun trivia fact to get kids interested in reading. But thing is, if you actually look at their entry for the word…honestly, it’s a disaster.
You can find photographs of that page of the book online. Here’s one example, but they’re all pretty much the same except for varying degrees of readability:
If you look closely, there are a lot of problems here. Yes, that is a zero in the third-to-last line, and in between two hyphens at that. I honestly don’t know how that happened, much less how no editor caught it. Also, you’ll note that they say the protein is made of 267 amino acids, not 268. This is because it was based on outdated information. I’ll come back to that later. (It also names tryptophan synthase as “tryptophan synthetase,” but that is at least a valid alternate name.)
More troubling is the fact that they say the word has 1,909 letters. This is a problem because, the book doesn’t even match its own claim! I transcribed the word into Scrivener as carefully as I could, and while I won’t swear I got everything right, the program told me that it only had 1,895 letters (not including the -0- bit). And while I was transcribing it, I spotted dozens of errors. Specifically, I saw 28 omitted letters, 10 inserted letters, and 16 incorrect letters that didn’t affect the length. Almost one out of every 30 letters was wrong. We need some quality control here.
Even worse, other online sources don’t agree with either letter count. 1,909 is the number that is cited by Wikipedia, but here’s another reference that writes out the word and claims it has 1,846 letters, but a little copy-and-paste shows that it counts up to 1,853. Given the amount of errors I see in that one, I have to wonder if it was copied using a computer text recognition system. (I tried that myself on the images, and it was useless.) Other sources give still more letter counts.
Now, there’s a simple solution to this: look up the amino acid sequence and rewrite the name with the correct spelling. But this has the same problem of human error…so I took the human error out of it. I went to the listing for the protein on UniProt, a massive scientific database of proteins, and copied-and-pasted the amino acid sequence. Then, I wrote a Python script to convert it to a chemical name without having to trust my spelling the whole way through. (Even though I’m pretty sure I could do it better than A Student’s Dictionary did.)
When I did this, I got 1,921 letters, not 1,909. I also counted 268 amino acids, not 267.
What’s going on here?
It turns out that the entry in A Student’s Dictionary is based on the original 1967 paper that listed the full amino acid sequence for the protein. It’s an obvious place to start. In fact, I started there myself (well, actually this paper by the same authors that lists the same sequence in a more readable way) and only went to the UniProt entry later. But the 1967 paper had it wrong. They were short by one amino acid. The isoleucine (Ile) at position 36 is actually doubled.
And it turns out there are other errors, too. All but one of the red words in the video above are errors in the 1967 sequence. The glutamines (Gln) at positions 2 and 133 are supposed to be glutamic acid (Glu). And the asparagine (Asn)-isoleucine pair at positions 244 and 245 are actually reversed.
If you work these changes back, you get almost the exact amino acid sequence that appears in A Student’s Dictionary. Several websites list the word as being 1,913 letters which turns out to be the correct length for the amino acid sequence they used. But my Python code returns 1,916 letters for that sequence. It turns out there was one more error in the dictionary, at position 155 in the 1967 sequence. Both the original and corrected sequences list it as asparagine (Asn), but the dictionary transcribed it as aspartic acid (Asp). This makes the word three letters shorter, yielding their sequence exactly (if they had spelled it correctly).
Luckily, having the computer do the heavy lifting removes that ambiguity. The correct spelling of the correct sequence of amino acids in tryptophan synthase A is 1,921 letters, not 1,909. I’ve put a copy of the full text of the word and my Python script on my Github page.
Pingback: Long Words Bonus #2: Titin | Science Meets Fiction