Genes are like the cosmos: the more we discover, the more we have to explore

"I don't know how to keep the air in my chest thinking about the scale and size of the unknown"

It’s reassuring to think that scientists go to work in the lab, day after day, week after week, to work on Big, Important Problems: how life works; how each building block is laid and who lays it; how every protein, every gene, every everything is done.

But that continuous labor doesn’t add up to much, in comparison to all the things we don’t know. By some estimates there are 6,000 genes in the human genome alone that have essentially no known function, 20 years after the completion of the human genome project. About 99.9 percent of bacteria on the planet can’t grow in a lab and are totally unknown to us. Each and every one of those bacteria has its own particular genes, proteins, lifestyles, interactions, and behaviors. And while just knowing the sequences of genes helps, by itself it doesn't start to tell us about what new genes do or what they're for.

With modern gene sequencing technologies, we fill huge databases with more information than anyone could ever sort. And with them, we can ask questions we weren’t capable of asking before, to the point where "sequence everything" might as well be the modern biologist's personal motto. How are different types of cancer actually different? High-throughput sequencing can tell you. I saw someone sneeze on the bus; can we sequence what’s in their snot? We sure can. Is there a lot of bacteria living in your sponge? Yeah! Is it dangerous? Probably not.

We sequence everything because there’s always something there. In the same way that NASA can point a telescope at a blank speck in the sky and see a hundred billion stars, gene sequencing can see hidden, tiny worlds everywhere, right here on Earth. The genetic content of someone’s soggy tissue could reveal new organisms living in your nose, the kind of lives they lead, and the strategies they use to survive. But, for every new question we ask, we stumble on 10 thousand more.

If a cell is a book, its genes and proteins are the words, paragraphs, and sentences inside. As sequencing technology improves, we get better and better at reading the letters, but what the words mean is something else. A gene sequence is just a series of letters. Figuring out what each individual gene does can be a herculean effort encompassing many labs over multiple decades.

A CBP chemist reads a DNA profile

A group of scientists at the Gladstone Institutes in San Francisco, led by Stacia Wyman and Katherine Pollard, put an extremely fine point on this. They looked at a database of approximately 272,000 families of proteins earlier this year. They sorted out any proteins that the database had any information on: possible functions, resemblances to other proteins, or relevant information that would allow at least an educated guess about a protein. They found that more than half of them had not only no known function, but no relevant information known about them at all.

That's families of proteins, which are groups of related proteins. But, each individual protein has its own specific, unique constellation of functions, interactions, regulations, and feedbacks within each species, so even just finding things out about whole families doesn't answer questions about individual family members. Imagine looking at my dad, my uncle, and my grandfather; you would probably guess that I would go prematurely bald. But that information wouldn’t tell you who my friends are, what I do for a living, or where I live.

A tree of a CRISPR-associated protein, where each branch is one protein of a family, each member being an extremely similar versions of itself across a range of different species

Daniel H Haft Jeremy Selengut Emmanuel F Mongodin Karen E Nelson

The research group even made a "most wanted" list of proteins, which have unknown functions and occur across multiple species (and so might be of long-term value to the organisms that have them). That list alone is almost 7,000 families.

It can be stupefying to see data like this. I make a living figuring out how proteins work, and I don't know how to keep the air in my chest thinking about the scale and size of the unknown documented here. How do astronomers live like this? And while it's tempting (for me) to make fun of Carl Sagan and other astronomers when they get moony about the vastness and mystery of space, I think it's because I’m a little jealous of the poetry they draw from their science.

I secretly love their sincerity. It seems so easy, like it’s right there for the taking: the profound unknown, large numbers, mind-bending pictures. But it’s right there for me too, in that endless database of secret genes. When an astronomer looks through a telescope and a biologist through a microscope, they see the same thing.