ECOM 6349 Lecture 5 GenomeonDiet amp Sparsified Genomics Prof Mohammed Alser

>> YOUR LINK HERE: ___ http://youtube.com/watch?v=D47E6g0BCGQ

An example of a read mapper. The goal of read mapping is to locate possible subsequences of the reference genome sequence that are similar to the read sequence while allowing at most E edits, where E is the edit distance threshold. Tolerating a number of differences is essential for correctly finding possible locations of each read due to sequencing errors and genetic variations. Read mapping includes four computational steps, indexing, seeding, pre-alignment filtering, and sequence alignment. First, a read mapper starts with building a large index database using subsequences (called seeds) extracted from a reference genome to enable quick and efficient querying of the reference genome. Second, the mapper uses the prepared index database to determine one or more possible regions of the reference genome that are likely to be similar to each read sequence by matching subsequences extracted from each read with the subsequences stored in the index database. Third, the read mapper uses filtering heuristics to quickly examine the similarity for every read sequence and one potential matching segment in the reference genome identified during seeding. As only a few short subsequences are matched between each read sequence and each reference genome segment, there can be a large number of differences between the two sequences. Hence, filtering heuristics aim to eliminate most of the dissimilar sequence pairs by performing minimal computations. Fourth, the mapper performs sequence alignment to check whether or not the remaining sequence pairs that pass the filter are actually similar. Due to potential differences, the similarity between a read and a reference sequence segment must be identified using an approximate string matching (ASM) algorithm. The ASM typically uses a computationally-expensive dynamic programming (DP) algorithm to optimally (1) examines all possible prefixes of two sequences and tracks the prefixes that provide the highest possible alignment score (known as optimal alignment), (2) identify the type of each difference (i.e., insertion, deletion, or substitution), and (3) locate each difference in one of the two given sequences. Such alignment information is typically output by read mapping into a sequence alignment/map (SAM, and its compressed representation, BAM) file. The alignment score is a quantitative representation of the quality of aligning each base of one sequence to a base from the other sequence. It is calculated as the sum of the scores of all differences and matches along the alignment implied by a user-defined scoring function. DP-based approaches usually have quadratic time and space complexity (i.e., (m2) for a sequence length of m), but they avoid re-examining the same prefixes many times by storing the examination results in a DP table. The use of DP-based approaches is unavoidable when optimality of the alignment results is desired. • • For more information, check the following papers: • Alser, M., Lindegger, J., Firtina, C., Almadhoun, N., Mao, H., Singh, G., Gomez-Luna, J. and Mutlu, O., From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures , Computational and Structural Biotechnology Journal, 2022 • https://www.sciencedirect.com/science... • Alser, M., Rotman, J., Deshpande, D., Taraszka, K., Shi, H., Baykal, P.I., Yang, H.T., Xue, V., Knyazev, S., Singer, B.D. and Balliu, B., Technology dictates algorithms: recent developments in read alignment , Genome Biology, 2021 • https://genomebiology.biomedcentral.c... • Alser, M., Eudine, J. and Mutlu, O., Genome-on-diet: taming large-scale genomic analyses via sparsified genomics , Accepted in Nature Communications, 2024 • https://arxiv.org/pdf/2211.08157

#############################

New on site