Last time, I talked about the protein tryptophan synthase A and how its “full chemical name” may be the longest “English” word ever professionally published in a book. (And I corrected the spelling of it.) But that “word” was just constructed by applying chemical naming rules to the amino acid sequence of the protein. And there are much larger proteins out there, with correspondingly longer amino acid sequences. So, to construct a longer word, you just need to use a bigger protein.
This much is not a surprise. The most extreme claim for the “longest word”—one that has been circulating around the internet for years—is the “full chemical name” of the aptly-named titin, the largest known protein, which is a component of muscle tissues that gives them their passive elasticity. It is normally quoted as having 34,350 amino acids, more than 100 times longer than tryptophan synthase A, and its name is a staggering 189,819 letters. MrBeast even made a video a few years ago where he spent two hours reading the whole thing out loud…although he obviously didn’t have a clue how to pronounce it.
I didn’t pronounce it, but it did write it out, and I even managed to fit it in a single YouTube short.
At 72 letters per line, one line per frame, and 60 frames per second, it comes in at just under a minute…Except you may notice that my version is considerably longer than the one everybody quotes around the internet.
I started this video by downloading the official 34,350 amino acid sequence of titin from the UniProt database and writing a Python script to write the full name—like I did for tryptophan synthase A—just to make sure the letter count was correct. But I didn’t get 189,819 letters. In fact, it wasn’t even close. I got 241,577 letters. This is also not the number in the video, but bear with me.
(And if you think about it, it makes sense. 189,819 letters for 34,350 amino acids is only 5.53 letters per amino acid, which, if you’re familiar with amino acid names, seems improbably short.)
What happened? Well, proteins can actually have multiple different amino acid sequences, called “isoforms.” This can happen if the gene segments that encode them get read in a different order, and different isoforms may be more useful in different circumstances. Isoform 1 of titin is the “canonical” 34,350 amino acid sequence. But the 189,819-letter name everyone circulates, for some inexplicable reason, is Isoform 3 (also known as the “small cardiac N2-B” form), which only has 26,926 amino acids (7.05 letters per amino acid). Isoform 3 is actually the shortest version of titin out there, except for Isoform 6, which is a much smaller fragment.
If we want the actual largest known protein, we need to find the largest isoform of titin. That happens to be Isoform 12, with 35,991 amino acids. When I ran it through my Python script, it returned a word length of 252,176 letters, which is the word I wrote in the above video clip.
I’m not the first person to figure this out. A commenter named Stephen Thomas put the pieces together in 2016. But to my knowledge, I seem to be the first to write out the full corrected version. I’ve posted the full text of this word along with the amino acid sequence and my Python script on my Github page. Now, all we need to do is get MrBeast to read this version…preferably learning a bit of organic chemistry first.
But as I said in the video, this isn’t actually the longest possible word under chemical naming rules (you know, if chemists actually used crazy names like this). Just like you can string together names of amino acids in proteins, you can do the same with DNA sequences. But that’s a project that’s a little bit bigger than a short.