What the Heck is a “Genome”?
Whenever I do one of my “Computational Biology for Computer Scientists” crash courses, we always cover the same crucial (simplified) background information. DNA molecules are stored in your cells, and they have an order to them. This order allows them to form a sequence, and that sequence encodes information that is then used to produce proteins, little biological robots that go about and perform specific actions (kind of like microservices). We can think of a gene as a unit that encodes a single protein in its entirety, and so the genome is a collection of all the DNA in someone, including these genes and parts of the DNA surrounding those genes. The proteins encoded in your genome underpin basically every single process that is currently undergoing in your body, and so understanding the genome would probably help you a lot in understanding why an organism works the way they work. This was actually the impetus of the human genome project, the idea was that if we are able to sequence and read the whole of the human genome, we could unlock secrets untold about how we humans work and how to treat a number of our illnesses.
Well as of the date of writing this, genetic diseases still exist and we have to eradicate diseases, so did the human genome project fail to yield useful information or was the genome just a lot harder to work with than initially anticipated? So if the genome is a collection of DNA sequences that code for proteins, why can’t we just compare and analyze strings of letters? How do we analyze it and why is it so difficult?
Analysis
Isn’t it Just Comparing Strings?
A few of the factors that make genome analysis difficult would be:
- The sheer size of the human genome (a sequence of ~3 Billion)
- The fact that even defining a gene is deceptively difficult .
- If you ask different geneticists, you’ll get different answers about the exact number of genes!
- How do we deal with variants? If we change a single character in a gene, is that a new gene?
- What happens when the sequence around the gene changes the way the gene is expressed? Is that a different gene?
- Your genome itself isn’t the sole thing responsible for the expression of genes, your genome actually interacts with environmental factors.
Just imagine starting off trying to analyze the first 2 sequenced genomes with no prior information. You would first notice that you have two strings of slightly different lengths $\approx 3 \times 10^{9} + \epsilon$, since biological factors can lead to insertions or deletions of parts of the sequence. The first thing you might try to do is align them, since both of your samples are from ostensibly living adult humans, their genomes can’t be that different right? Well, kind of. Humans share $\approx 99.9\%$ of the genome, and so most of the genetic variation in humans comes from that $\approx 0.1\%$ of the differences. But even then, that’s still $\approx 2-5$ million differences or so. So we have two strings of different lengths that are mostly the same, can we just ignore the differences? No actually, it turns out that those few differences are actually vitally important as they are one of the main components that leads to variation amongst humans! So we can ignore everything else and keep only the differences? Well sort of, the differences aren’t the only things that matter in the genome we also care about where the differences are and what the sequences adjacent to the differences are. So for some specific questions, sure we can only look at the differences, but in general we cannot ignore all the rest of the genome.
So what do we do?
Ok so what exactly are the important parts of the genome? Can we focus on those? Well that depends on your definition of important. We used to call parts of the genome that we didn’t know the purpose of “Junk DNA”. We later stopped using this name as we kept finding “Junk DNA” that did indeed have a purpose, albeit sometimes that purpose was very much not obvious. So we can’t ignore anything, we need to work with the whole unwieldy string? Not necessarily, as we keep studying the genome and identifying the functions of different portions of it, we get an understanding of the structure and patterns of the genome. We can get an idea for which genes tend to interact with each, which regions of DNA impact a given gene, and which pieces of DNA are unlikely to have any sort of impact. The more we study the genome, the more structure we find and so the better analysis we can do. This is what we do in genome analysis, we leverage certain structures or properties of the genome we know such as the rate of mutation of certain regions, the patterns of co-occurrence of certain genomic features, and the interactions of the regulatory elements of the genome with respect to each other and our genes of interest.
GWAS
One of the more popular and important methods for genome analysis is the aptly named Genome Wide Association Study (GWAS). This type of study works exactly as you might expect; we find two groups of people, a group of people showing a phenotype(trait) of interest and a control group. Assuming a sufficient population size, we can then compare both groups and see if there is a specific genomic feature that the people in the test group have that is different from the people in the control group. If we can identify some features like that, we can then say that the features we observed are correlated with the phenotype of interest, these features are “associated” with the phenotype of interest. If you’ve done some work in statistics, this may start to sound a little alarming. If we look at all 3-5 million differences for each sample, we might have to repeat our statistical analysis many times which may lead to either false positives or a very stringent test. And what happens if the phenotype I’m looking for has multiple possible causes? If I look at every single difference then, I’m splitting the signal I have amongst different features. In either case, this would also necessitate a large number of samples to be able to say anything with confidence, which might also balloon the cost of the study depending on the cost per sample. We happen to know the following though:
For certain locations of the genome, they are associated with other locations of the genome. Plainly this means that for certain locations of the genome, if I know what is at location $x$, then with high probability I also know what is at location $x+1, x+2, \ldots$.
This property is known as linkage disequilibrium (ld), or the non random association of different locations of the genome. Using this property, if we know that a group of $n$ locations are within LD of each other, then we don’t have to test correlation for each of the $n$, we can group all of them into a single test and if we find a correlation we know that something within that region is correlated. Using this property, instead of testing on the order of thousands of features as opposed to on the order of millions of features. And this is exactly what we do, and this has allowed the GWAS to be one of the staple methods of population genetics. Leveraging the structure of the genome allowed this.
New Methods
If I had to summarize some of the key characteristics of genome analysis, it would be:
- Lots of Data
- Lots of Structure (known or unknown)
- Difficulty in attributing causality
- Useful and pertinent applications
These properties lend themselves extremely well to machine learning approaches, where a large amount of quality data with underlying structure often leads to good results. In recent times, this has led to a significant number of advancents as researchers use machine learning based approaches to tasks in this space such as predicting the effects of changing certain portions of the genome, identifying and tagging certain portions segments of the genome, and even using unsupervised methods to identify patterns as of yet undiscovered. This is one of the areas of research that I am particularly interested in, developing new machine learning based methods for these types of analysis.