Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why does the SARS-Cov2 genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa? (2020) (bioinformatics.stackexchange.com)
275 points by jdwg on Feb 11, 2023 | hide | past | favorite | 92 comments


The top answer at the link explains it best:

Good observation! The 3' poly(A) tail is actually a very common feature of positive-strand RNA viruses, including coronaviruses and picornaviruses.

For coronaviruses in particular, we know that the poly(A) tail is required for replication, functioning in conjunction with the 3' untranslated region (UTR) as a cis-acting signal for negative strand synthesis and attachment to the ribosome during translation. Mutants lacking the poly(A) tail are severely compromised in replication.


Adding a poly(A) tail when engineering rna plasmids is so common that it's part of the standard feature library of most plasmid editors.

Here I have a screenshot of the ApE editor displaying one plasmid I made for a neurobio experiment involving the overexpression of two chimeric proteins (actin and profilin, respectively linked to green and red fluorophores; note the editor has autotagged the polyA tail feature):

https://ibb.co/XSCSKC9


Looks almost like a hex editor.


It kind of is. We are not quite advanced enough to have the equivalent of true programming languages for genetics.


But we will. GitHub for genetics would be awesome


It reminds me of Wireshark


ApE is one of those tools that is so old-school but just works. I used to hate using it but it's really grown on me - Snapgene is just too darn expensive and benchling isn't very power user friendly


Can you add any number of A's, or just 1, or shorten it by 1, and it still is functional?


The number of adenine repeats that confer functional properties is quite variable but it definitely needs to be more than "just 1". I've seen anywhere from 25-250 used in designed plasmids. The exact number people use in their engineered sequence is based on a number of factors, not all of them scientific in nature (e.g. companies charge per basepair synthesize a bespoke polypeptide; e.g. you copied the sequence from a previous clone into ApE and that sequence used 30 repeats and worked fine).


Iirc there's a protein that adds 12-15 slowly then another protein comes in and adds another 200+ when it detects the start of the tail. Not sure about the details about how or why it stops tho. At least when considering mRNA getting prepped for nuclear export


You are probably right, I'm not an expert on this process. But there's likely a difference between the innate eukaryotic cell polyadenylation process vs. how coronavirus accomplishes polyadenylation because coronavirus rna never enters the nucleus.


In this case it probably works most reliably with around this number of As, and less/more would decrease reliability, butnitmwould still be functional.

Most of genetics is like that.


With more As you risk running into the upper limit on sequence length for the virus shell and with less you run the risk of quicker degradation and not enough expression


The first point being only a concern if you are a virus. If instead you are engineering say, an rna vaccine, you could package your transcript into lipofectamine instead of a coronavirus envelope ;) but point taken, there's always an upper limit.


Also a concern when engineering viruses for research or other treatments such as the AAVs


True. I should have said "if you are using a virus"


attachment to the ribosome during translation

There's been a lot of analogies with NOP slides in the comments here and there, but if you look at how the process of reading the genome works, this section is more like the leader/trailer on a tape:

https://en.wiktionary.org/wiki/leader#Noun "A piece of material at the beginning or end of a reel or roll to allow the material to be threaded or fed onto something, as a reel of film onto a projector or a roll of paper onto a rotary printing press."

https://en.wiktionary.org/wiki/trailer#Noun "A short blank segment of film at the end of a reel, for convenient insertion of the film in a projector."


And this fits nicely because the genome replication really looks like a roll of film being played


Only if the film gets bent and twisted around itself, binding on itself to activate or deactivate sequences and with the read film actively getting spliced to create slightly different versions of the scenes

Honestly it's fuckin wild, there's a lot going on rather than just linear read->express


As a coder, I find it a bit frustrating how often explanations in biology contain fuzzy phrases like "plays a role in" or "is required for" or "works together with" or "modulates the activity of" - without ever expected what role it plays, why it is required or how it works together.

So props to them for going a bit more into detail here - and also highlighting that the reason for the fuzzy phrases can often be that we literally don't know the details: The empirical basis may be "if this thing is removed then this other thing won't work", without us necessarily knowing why this is the case.


The other side of this is: what constitutes an explanation? What formal structure that you express observations in seems like knowing what's going on to you? Formal structures in fields tend to be matched to what the experimental abilities of the field are.

In programming, we designed our systems to give us what seem like hard bottoms in our formal models. Most programmers don't reason below the level of their structured programming language. Of the ones that do, most treat the processor instructions as a hard bottom. There are layers down and down until you have physicists working on semiconductor properties, but we have intentionally designed the layers so that you can comfortably rest on them.

In biology any formal structure you think in is logically poised over the abyss. What pins it in place is not that it is on philosophical bedrock, but the observations and experiments that the formal structure summarizes.


It isn't unknown its just complicated. There are numerous proteins involved, such as a protein that detects the poly A tail and if not present will degrade the RNA by cleavage.


I think you'll find similar bullshittery in the details of every topic that isn't math. Think of it like finding "TODO" comments in old code.


I mean, historically mathematics have had areas full of "HACK:" comments, too--ones retroactively applied to the whole of Newtonian physics, even, but it's still useful enough that we keep it around!


Newtonian physics is not "a hack". It yields very precise results for simple gravitational interactions, at least up to the point where relativistic effects begin to dominate. Even Einstein's equations are not 100% "perfect". All mathematical models have their limitations. (Although some do produce better approximations than others.)


A hack is in the eye of the beholder, and many hacks are load-bearing and totally fine for the entire timespan for which the thing they're stashed within is expected to be useful. I agree with you that all mathematical models have their limitations--but the choice to use a good-enough one is, over sufficient time and distance, a tradeoff that can be described as such.


All I meant is that the approximations in math are in the assumptions which is far easier to deal with in the grander scheme of things.


Assumptions? Don't worry; we've got those too! if I had a nickel for all the times I've seen somebody pick 1 for "0, 1, is N" input cases? ;)


Interestingly, vaccines also have a long repeating sequence at the end. It provides molecular stability.


Found a wikipedia article describing it:

https://en.wikipedia.org/wiki/Polyadenylation


The biological equivalent of an endline


The stop codon is more like the '\n' or '\0' in computing. Polya tails protect against degradation and is used for nuclear export of RNA.


This is true, but perhaps worth noting the nuclear transport function of polyA tails don't come into play for coronavirus. The payload of coronavirus is a positive-sense single-stranded RNA. Which means it does not need to enter the nucleus for preprocessing and can basically just start replicating shortly after entering a cell. See diagram...

https://upload.wikimedia.org/wikipedia/commons/f/f4/Coronavi...

There might be a software analog to another polyA tail feature: the provision of a 'shelf-life'. Each replication cycle removes a few adenosines, and at a certain point the tail sequence is too short to recruit protection and the RNA is ushered into the degradation pathway.


I don't think the genome has a specific length of the tail (33 is a consensus length)

During genomic assays, the poly a tail will not be a specific length, but a single consensus sequence is still provided.

This was also posted in the first comment:

> Similar to eukaryotic mRNA, the positive-strand coronavirus genome of ~30 kilobases is 5’-capped and 3’-polyadenylated. It has been demonstrated that the length of the coronaviral poly(A) tail is not static but regulated during infection; however, little is known regarding the factors involved in coronaviral polyadenylation and its regulation. Here, we show that during infection, the level of coronavirus poly(A) tail lengthening depends on the initial length upon infection and that the minimum length to initiate lengthening may lie between 5 and 9 nucleotides. By mutagenesis analysis, it was found that (i) the hexamer AGUAAA and poly(A) tail are two important elements responsible for synthesis of the coronavirus poly(A) tail and may function in concert to accomplish polyadenylation and (ii) the function of the hexamer AGUAAA in coronaviral polyadenylation is position dependent. Based on these findings, we propose a process for how the coronaviral poly(A) tail is synthesized and undergoes variation. Our results provide the first genetic evidence to gain insight into coronaviral polyadenylation.

Peng Y-H, Lin C-H, Lin C-N, Lo C-Y, Tsai T-L, Wu H-Y (2016) Characterization of the Role of Hexamer AGUAAA and Poly(A) Tail in Coronavirus Polyadenylation. PLoS ONE 11(10): e0165077


The polyA tail isn't even coded by the genome, it's added after transcription by a processive polyadenylation multiprotein complex with the final tally being the result of partially stochastic processes. So yeah, agreed, the number of adenines is variable.


So it is more like a protective casing? Or like a car bumper, to protect what's inside?


Can a drug target that sequence specifically?


That sequence is present everywhere in your body and living beings, so no.

https://en.wikipedia.org/wiki/Polyadenylation


Not if it's common to both viruses and human cells you can't.


IIRC Adenin binds to Thymin, and there are some viruses and bacteria that have alternative bases, and scientusts have discovered 82 other possible ones.

If the virus could be bound with an artificial RNA strand that had a stronger bond than natural RNA, it could be denatured, and pooped out.

https://devries.chem.ucsb.edu/research/past/base-pairing


Poly-A binding proteins naturally exist. They are used to regulate translation and to sequester mRNA during heat shock stress, IIRC. This prevents mistranslation, again IIRC.

https://faseb.onlinelibrary.wiley.com/doi/10.1096/fasebj.31....


No


Is regex a drug feature we are building towards?


You can't parse DNA with regex.


Not with that attitude.


If stackoverflow taught me anything, you will only summon Zalgo by trying.


They used to say that about email addresses. But hold my beer and I'll be back in like a month with some buggy half-assed crap that kinda does the job and only occasionally crashes the system!


In my experience genetics is a bit more complicated than an email address

Yes, that is an understatement


Is it that simple? From a lay perspective (and granted I’m a little foggy because I actually have covid right now), I’d expect the answer is yes but with huge unintended consequences.


Ok yes, theoretically you can make something targeting the polyA tail. But everything else you body make will also get targeted because this is basically a marker of all RNA for translation.

Now making a drug that targets only viruses and not your body RNA? Possible but it is so hard not much progress has been made.


If this sort of question fascinates you, you might like "Reverse Engineering the source code of the BioNTech/Pfizer SARS-CoV-2 Vaccine"[0], an article written with a tone that I've found to resonate with engineers and like-minded folk.

[0] https://berthub.eu/articles/posts/reverse-engineering-source...



What a fascinating article! Thanks for sharing.

I didn't realise there was so much crossover between embedded design and biology!


That was exceptionally informative and accessible. Thank you


Fantastic article!


It's like a NOP slide for viruses: https://en.wikipedia.org/wiki/NOP_slide

Just kidding...sort of!


User Zoe Sparks on that page covers why they don’t really feel it’s like a nop slide. I think that answer is a good supplement to the accepted one.


I think they miss the point entirely. The environment in the cell is mostly mechanical, but it's also dominated by random forces. If you are unable to guarantee where you are going to "enter the sled" or when the "tail hits the ribozyme" then the nop sled seems to be an equivalent feature.

So.. to me, it's odd they invoke "legitimate code." The comparison I'd consider would be "combative code." For example, the old game "core wars." Thinking in that mindset, I can see several uses for a "nop sled" in "legitimate code."


I don't think this is nearly as true for virii genomes, but larger species have lots of protetive sections of DNA to protect from mutations. If you lose a non-protein-coding section of DNA to mutation, no harm to the species occurs. In humans, only about 1.5% of our DNA codes for protein that is actually generated. Virii are physically extremely tiny in terms of cell size and must be very efficient in terms of storing the DNA within them so way more actually codes, but no doubt there are similar factors at play.


Viral genomes are very compressed. In fact, viruses usually have overlapping genes, where one genomic region can codify for more than one product. See the following review for more information: https://www.nature.com/articles/s41576-021-00417-w



It amazes me that the genome is only 29k long. If you were to write a computer virus now, it probably wouldn't be that short, let alone something that can infect and kill millions of people.


A computer virus that just replicates and causes damage can be a lot smaller than that.


Very interesting. I first wondered how nature can randomly generate such a pattern, and then realized we are just falling for our "built in" pattern recognition: it would feel much more "natural" for the stop sequence to be the encoding of some specific protein without any clearly recognizable pattern... But it would actually be more unlikely to appear/survive mutation than "any long-enough sequence of A".

I also like how it is established that this has an effect on replication, but that as far as I understand we do not understand the underlying process. Humbling.


File formats are really easy to figure out and are a big advantage for moving data around. Even without an academic theory, pretty much everyone in software starts to figure out the same tricks as soon as reliable transmission becomes a goal. I assume that at least one reason for this is that genomes are data, data likes to live in structured formats, and file terminators are more reliable for biology to process than encoding the length of the genome (although, biology being messy, I wouldn't be shocked if both were done). Evolution has a good grasp of engineering principles.

Are there probably desirable chemical properties? Yes. Is nature overloading each part of a genome with uses? More than likely. Has it figured out how to terminate a sequence? Obviously.


So something to keep in mind when looking at biological systems, and especially when looking at genomics - evolution doesn't actually have a master plan or agency, it's just drift and reproduction. There's a lot of reuse and a lot of parsimony that can look elegant, but there's no 'design' process of evolution - it's a pile of things that looked enough like other stuff that over time with some tweaks they could take on dual roles. There's also no separation of concerns - DNA is a molecule, and it's acted upon by other molecules following the same rules of chemical and quantum interactions that affect everything else. Certain RNA sequences aren't transcoded but rather fold into functional molecules and enzymes, and protein folding and subsequent structure and function is affected by how fast the RNA is transcribed, which is affected by the population of available tRNA molecules. Genomics only looks like information - it's still chemistry.


I am entirely unqualified to answer, but I choose to believe it’s the equivalent of scratch (reserved stack) memory in an executable image. If I’m wrong, well at least I’m enjoying it.


I can't see that and not think: base64 null byte padding.


Yeah, I’d be worried if it was AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA===. ;)


The virus printer adds padding to make the genome a multiple of 64


So upon multiple sequential infections, the polytail A will reduce?



Trying to Buffer overflow the cell.


Nominative determinism


Maybe the scientist who made it died before he finished? ATTACGAAAAAAAAAAAAAAAAAAAAAAAA……


Look, if he died while engineering a virus, he wouldn't bother to code "AAAAAAA" he'd just say it!


Clearly he died at his computer and his face landed on the keyboard.


Well that's what it says.



Dictated, not transcribed.


Dictated, but not read


Why the downvotes? Video is hilariously relevant.


Perhaps they were dictating?


Proof that god is dead?


[flagged]


[flagged]


Movie should have been named CATTAGA


This ending is the result of the Lisp closing parenthesis being important enough to be directly mapped to a dedicated nucleotide base.


[flagged]


"Maybe" isn't a good replacement for "I don't know".

Other people do know and you can find this information out.

The body uses it in RNA for denoting the life cycle of reuse to avoid degradation.

The linked article tells you


Of all the possible signatures of SARS-CoV-2 being lab-made, the polyA isn't one or them.

https://en.m.wikipedia.org/wiki/Polyadenylation


[flagged]


Had it started with aaaaaaaaaaaaa, I would've put my money on someone optimizing for the virus yellow pages.


Isn't the start and end of a genome is rather arbitrary?


Doesn't the replication have a direction though? I.e. start and end.


It’s a good thing people aren’t more intelligent, otherwise they would be able to construct and propagate deeper and more advanced conspiracies like this. Most people thankfully don’t even know what a genome is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: