Promotor: Dr. Solon Pissis
Co-Promotor: Prof. Dr. L. Stougie
This thesis focuses on theoretical aspects of sequence analysis for pangenomes, hence we start 85 with a brief motivation on the importance of strings in bioinformatics and the shift towards pangenomes. A possible starting point is 1871, when Friedrich Miescher first discovered the DNA molecule [162, 111]. However, the first accurate description of the DNA structure was published in 1953 by James Watson and Francis Crick [208], building on the earlier work of Rosalind Franklin. 90 Present in the cells of every living organism, DNA is a molecule composed of two complementary strands of successive nucleotides. These nucleotides belong to one of four types: cytosine (C), guanine (G), adenine (A), or thymine (T). Hence, each DNA molecule can be abstracted to the sequence of its nucleotides, seen as letters. This allowed abstracting DNA molecules to a simple string over a 4 letter alphabet, disregarding its molecular complexity. The central importance 95 of DNA in biology stems from its role as the template (via the genetic code) for the synthesis of proteins (through various complex molecular interactions), which in turn determine most of an organism’s characteristics. Similarly, other molecules such as RNA, or proteins themselves, can be described as sequences of letters over a finite alphabet. Genomic data analysis has been facing important challenges that include analyzing an ever 100 increasing number of genome sequences and choosing which genome should be used as a reference. In recent years, these two challenges were merged into the powerful opportunity of using a pangenome – rather than a single genome – as a reference. According to [68], a pangenome is “any collection of genomic sequences to be analyzed jointly or to be used as a reference”. As a consequence, the new -omics science pangenomics imposed a paradigm shift: in several 105 analysis tasks, and in particular for species like humans that enjoy a widespread availability of sequencing data as well as a growing awareness of genomic variants, the simple linear genomic sequence is being replaced by more complex graph-like structures [168, 87]. As opposed to a linear reference, a pangenome reference allows a simultaneous representation, in a compact manner, of variations and commonalities among the underlying sequences.