RU2806429C1

RU2806429C1 - Whole genome sequencing data processing system

Info

Publication number: RU2806429C1
Application number: RU2023123022A
Authority: RU
Inventors: Евгений Александрович Альберт; Валерий Александрович Павлов; Мария Алексеевна Сайганова; Геннадий Геннадьевич Федонин; Евгений Андреевич Карпулевич; Максим Сергеевич Беленикин; Екатерина Валерьевна Косова; Гаухар Юрьевна Зобкова
Original assignee: Общество С Ограниченной Ответственностью "Эвоген"
Filing date: 2023-09-05
Publication date: 2023-10-31

Abstract

FIELD: biotechnology.

SUBSTANCE: method of processing data (pipeplan) for whole-genome sequencing is described. Raw reads are aligned to the reference genome. The reads are trimmed with the fastp tool and streamed into the minimap2 mapper, from where the data is also streamed into the samatools tool, which marks duplicates and indexes the resulting bam file. The resulting bam file is recalibrated according to the quality metrics provided by the sequencer. Acceleration of quality control of initial data execution is assessed based on the results of trimming, read mapping and SNP/INDELS collating.

EFFECT: invention provides increased functionality for analyzing data from the entire human genome while reducing the time to obtain genetic analysis results ready for interpretation and allows for the analysis of germinal samples with an average coverage of 30x from a set of raw reads to results ready for analysis by a clinical interpreter.

1 cl, 7 dwg

Description

Изобретение относится к биотехнологии, а именно к способу обработки данных (пайплану) полногеномного секвенирования и предназначен для автоматического выполнения анализа герминальных образцов со средним покрытием 30х от набора сырых ридов до результатов, готовых для анализа клиническим интерпретатором.The invention relates to biotechnology, namely to a method for processing data (pipeline) of whole-genome sequencing and is intended for automatically performing analysis of germinal samples with an average coverage of 30x from a set of raw reads to results ready for analysis by a clinical interpreter.

Секвенирование - метод определения нуклеотидной последовательности ДНК и РНК, используемый для определения генетических повреждений (мутаций) в ДНК, являющихся причиной наследственных болезней, наследственных предрасположенностей или особенностей организма, за последние 10 лет стало распространенной процедурой в молекулярной биологии для изучения генома (полной последовательности ДНК из всех хромосом клетки) и транскриптома (совокупности активных РНК, синтезируемых клеткой).Sequencing is a method of determining the nucleotide sequence of DNA and RNA, used to determine genetic damage (mutations) in DNA that causes hereditary diseases, hereditary predispositions or characteristics of the body, over the past 10 years has become a common procedure in molecular biology for studying the genome (the complete DNA sequence from all chromosomes of the cell) and the transcriptome (the totality of active RNAs synthesized by the cell).

Из уровня техники существуют различные технологии секвенирования и пайплайны обработки данных. В первом случае описываются непосредственно способы секвенирования, например, путем синтеза с обратимой терминацией (lllumina), пиросеквенирование (Roche), секвенирование путем лигирования (SOLiD), полупроводниковое секвенирование (lon torrent), во втором - пайпланы, предназначенные для автоматического выполнения анализа герминальных образцов. Можно сказать, что инструменты описывают примерно один и тот же процесс, но с разных сторон. Настоящая заявка относится к пайплайнам обработки данных полногеномного секвенирования.Various sequencing technologies and data processing pipelines exist in the prior art. In the first case, sequencing methods are described directly, for example, by synthesis with reversible termination (lllumina), pyrosequencing (Roche), sequencing by ligation (SOLiD), semiconductor sequencing (lon torrent), in the second - pipelines designed for automatic analysis of germinal samples . We can say that the tools describe approximately the same process, but from different sides. This application relates to pipelines for processing whole genome sequencing data.

Первые попытки считывать последовательность ДНК были сделаны с помощью применения методов секвенирования по Сэнгеру. Недостатком известного способа является низкая пропускная способность и дороговизна при исследовании большого объема данных. Именно эти минусы стали основой для разработки и внедрения технологии - NGS.The first attempts to read DNA sequences were made using Sanger sequencing techniques. The disadvantage of this known method is low throughput and high cost when studying a large volume of data. It was these disadvantages that became the basis for the development and implementation of technology - NGS.

Необходимость разработки NGS была обусловлена стремлением к автоматизации анализа, увеличению объема получаемой информации и снижению стоимости исследования. Принцип технологии NGS основан на массовом одновременном секвенировании тысяч фрагментов ДНК на базе подготовленных однонитевых библиотек. Методика включает три этапа: подготовка библиотек; сиквенс; анализ полученных данных.The need to develop NGS was driven by the desire to automate analysis, increase the volume of information obtained, and reduce the cost of research. The principle of NGS technology is based on massive simultaneous sequencing of thousands of DNA fragments based on prepared single-strand libraries. The methodology includes three stages: preparation of libraries; sequence; analysis of the obtained data.

Преимущество NGS - в снижении стоимости исследования; автоматизации анализа; большом объеме получаемой информации. Методы NGS имеют большую производительность, позволяют выполнять одновременное считывание миллиардов коротких фрагментов нуклеиновых кислот. Кроме того, NGS дает возможность проводить секвенирование сразу нескольких десятков геномов за один запуск анализатора.The advantage of NGS is that it reduces the cost of the study; automation of analysis; large amount of information received. NGS methods have high throughput and allow simultaneous reading of billions of short nucleic acid fragments. In addition, NGS makes it possible to sequence several dozen genomes at once in one run of the analyzer.

Поскольку геном человека состоит примерно из 3,1 миллиарда пар оснований, а каждый фрагмент последовательности, или прочитанный, обычно имеет размер от 100 до 500 или 1000 нуклеотидов, время и превышение, затрачиваемые на построение таких полноразмерных геномных последовательностей и определение их вариантов, довольно велики, часто требуют использования нескольких различных компьютерных ресурсов с использованием нескольких различных алгоритмов в течение длительных периодов времени.Since the human genome consists of approximately 3.1 billion base pairs, and each sequence fragment, or read, is typically between 100 and 500 or 1000 nucleotides in size, the time and overhead required to construct such full-length genomic sequences and identify their variants is quite large , often require the use of several different computer resources using several different algorithms over long periods of time.

Из уровня техники известен источник AU2016226288 (G06F19/00; G06N3/00; G16B30/10; G16B50/30 дата публикации 14.09.2017 г. ) в котором раскрыты система, способ и устройство для выполнения конвейера анализа последовательности данных о генетической последовательности.The prior art source AU2016226288 (G06F19/00; G06N3/00; G16B30/10; G16B50/30 publication date 09.14.2017) in which a system, method and device for performing a sequence analysis pipeline for genetic sequence data is disclosed.

Согласно изобретению, система включают интегральную схему, образованную набором жестко соединенных цифровых логических схем, которые соединены между собой физическими электрическими межсоединениями. Зашитые цифровые логические схемы организованы как набор механизмов обработки, причем каждый механизм обработки сформирован из подмножества зашитых цифровых логических схем для выполнения одного или нескольких шагов в конвейере анализа последовательности при считывании геномных данных. В различных случаях каждое подмножество проводных цифровых логических схем может быть сформировано в проводной конфигурации для выполнения одного или нескольких шагов операции вариантного вызова.According to the invention, the system includes an integrated circuit formed by a set of hardwired digital logic circuits that are interconnected by physical electrical interconnections. The hardwired digital logic circuits are organized as a set of processing engines, with each processing engine formed from a subset of the hardwired digital logic circuitry to perform one or more steps in the sequence analysis pipeline when reading genomic data. In various cases, each subset of wired digital logic circuits may be configured in a wired configuration to perform one or more steps of a variant call operation.

AU2016226288 описывает биоинформатический конвейер способный обрабатывать данные секвенирования человека от сырых прочтений до vcf файлов, содержащих вариации генома, относительно референса. Пайлпайн в AU2016226288 выполнена на специфическом аппаратном обеспечении FPGA, что делает такой подход не универсальным с точки зрения развития и изменений анализа, в сравнении с модульными пайплайнами, и их частной имплементацией, описанной в текущей заявке. За счет использования модульной структуры и аппаратного обеспечения общего назначения, пайплайн описанный в текущей заявке имеет расширенный функционал, в сравнении с AU2016226288, позволяя так же обнаруживать геномные вариации, гаплогруппы, hla типы и проводить другие геномные анализы.AU2016226288 describes a bioinformatics pipeline capable of processing human sequencing data from raw reads to vcf files containing genome variations relative to a reference. The pipeline in AU2016226288 is implemented on specific FPGA hardware, which makes this approach not universal in terms of development and changes in analysis, in comparison with modular pipelines and their private implementation described in the current application. Due to the use of a modular structure and general-purpose hardware, the pipeline described in the current application has expanded functionality compared to AU2016226288, also allowing the detection of genomic variations, haplogroups, hla types and other genomic analyzes.

Известен ускоритель биоинформатического анализа MegaBOLT. Он ускоряет такие алгоритмы, как SOAPnuke, Minimap2, BWA, GATK HaplotypeCaller + MuTect2, благодаря многопотоковой и высокопараллелизованной архитектуре вычислений, что способствует значительному ускорению анализа массивных данных.The MegaBOLT bioinformatics analysis accelerator is well known. It accelerates algorithms such as SOAPnuke, Minimap2, BWA, GATK HaplotypeCaller + MuTect2 with a multi-threaded and highly parallelized computing architecture, which helps significantly speed up analysis of massive data.

MegaBOLT поддерживает анализ секвенирования всего генома (WGS), секвенирования всего экзома (WES) и панельного секвенирования данных зародышевой линии или соматических данных. Это до 300 раз быстрее, чем классический алгоритм. Интегрированный с многозадачной системой планирования собственной разработки, MegaBOLT поддерживает одновременные многозадачные вычисления на одном единственном сервере. Эффективность вычислений дополнительно повышается с 16% до 28%. (источник https://en.mgi-tech.com/products/software_info/2/)MegaBOLT supports analysis of whole genome sequencing (WGS), whole exome sequencing (WES), and panel sequencing of germline or somatic data. This is up to 300 times faster than the classic algorithm. Integrated with a proprietary multi-task scheduling system, MegaBOLT supports simultaneous multi-task computing on a single server. Compute efficiency is further improved from 16% to 28%. (source https://en.mgi-tech.com/products/software_info/2/)

Недостатками известного решения являются низкая скорость анализа, сложное планирование, сложность в использовании и высокая стоимость вычислений. Данное решение принято в качестве ближайшего аналога.The disadvantages of the known solution are low analysis speed, complex planning, difficulty in use and high computational cost. This solution was adopted as the closest analogue.

Информация, получаемая при использовании технологий высокопроизводительного секвенирования, представляет собой миллионы коротких (порядка 75-300 букв) прочтений. Анализ таких данных в большинстве случаев заключается в картировании полученных прочтений на референсный геном и детекции отличий от него.The information obtained using high-throughput sequencing technologies consists of millions of short (about 75-300 letters) reads. Analysis of such data in most cases consists of mapping the obtained reads to the reference genome and detecting differences from it.

Обработка данных высокопроизводительного секвенирования в рамках масштабного проекта несет в себе три основные трудности. Во-первых, она может быть выполнена множеством разных инструментов, как с открытым исходным кодом, так и проприетарных. Выбор конкретных решений требует наличия экспертизы в области биологии и технологий секвенирования, а также понимания задач, решаемых с помощью того или иного анализа. Во-вторых, большой объем анализируемых данных требует существенных вычислительных и временных затрат на обработку, а также создания автоматизированной IT системы осуществляющий контроль над анализом. В-третьих, геномные данные являются чувствительными и их анализ может требовать их обработки локально, не прибегая к облачным вычислениям.Processing high-throughput sequencing data in a large-scale project presents three major challenges. First, it can be performed by many different tools, both open source and proprietary. Choosing specific solutions requires expertise in biology and sequencing technologies, as well as an understanding of the problems solved by a particular analysis. Secondly, a large volume of analyzed data requires significant computational and time costs for processing, as well as the creation of an automated IT system that controls the analysis. Third, genomic data is sensitive and its analysis may require processing locally, without resorting to cloud computing.

Существующие решения для анализа, опирающиеся на академические разработки, в большинстве своем не предполагают работы с большими объемами данных и не могут обеспечить необходимых скоростей обработки. Коммерческие решения, с закрытым исходным кодом напротив не обеспечивают достаточную гибкость добавлении новых типов анализа данных, а также опираются на специализированное аппаратное обеспечение, что увеличивает их цену и усложнят ограничивает пользователя в выборе вычислительных мощностей.Existing analysis solutions based on academic developments, for the most part, do not involve working with large volumes of data and cannot provide the necessary processing speeds. Commercial solutions with closed source code, on the contrary, do not provide sufficient flexibility to add new types of data analysis, and also rely on specialized hardware, which increases their price and complicates the user's choice of computing power.

Задача, решаемая изобретением, заключается в том, чтобы предложенные изменения в пайплайн обеспечивали анализ данных одного образца WGS человека с покрытием 30х примерно за 5 часов. Одновременно задача направлена на то, чтобы время расчета было дополнительно снижено за счет параллельного анализа образцов. При решении указанных задач достигается технический результат, выражающийся в повышении функциональной возможности анализа данных полного генома человека при уменьшении времени получения готовых для интерпретации результатов генетического анализа.The challenge addressed by the invention is that the proposed changes to the pipeline will enable analysis of a single human WGS sample with 30x coverage in approximately 5 hours. At the same time, the goal is to ensure that the calculation time is further reduced through parallel analysis of samples. When solving these problems, a technical result is achieved, which is expressed in increasing the functionality of analyzing data from the complete human genome while reducing the time to obtain genetic analysis results ready for interpretation.

Благодаря изобретению по сравнению с ближайшим аналогом производительность системы анализа данных полного генома человека выше в 7,5 раз по количеству обработки в день и в 6, 6 раз при сравнении мощности обработки в месяц:Thanks to the invention, compared to the closest analogue, the performance of the data analysis system for the complete human genome is 7.5 times higher in terms of the amount of processing per day and 6.6 times higher when comparing processing power per month:

Ближайший аналог
MegaboltClosest analogue
Megabolt Предлагаемое решение Suggested Solution Мощность (обработок в день)Capacity (processes per day) 88 6060 Мощность (обработок в месяц)Capacity (processes per month) 250250 16501650

Предложенное решение имеет больше функциональных возможностей анализа данных полного генома человека по сравнению с ближайшим аналогом, в частности: The proposed solution has more functionality for analyzing data from the entire human genome compared to its closest analogue, in particular:

Ближайший аналог
MegaboltClosest analogue
Megabolt Заявленное решениеDeclared decision Контроль качества (QC) WGSQuality Control (QC) WGS ++ ++ Картирование (Read mapping > position sorting > duplicate marking > BQSR) WGSMapping (Read mapping > position sorting > duplicate marking > BQSR) WGS ++ ++ Коллинг герминальных вариантов (WGS)Calling germline variants (WGS) ++ ++ Коллинг соматических вариантов (WGS)Calling of somatic variants (WGS) ++ -- Анализ HLA типированияHLA typing analysis -- ++ Определение Mt и Y гаплогруппDetermination of Mt and Y haplogroups -- ++ Анализ WESWES analysis ++ ++ Коллинг CNV (WGS)Calling CNV (WGS) -- ++ Коллинг структурных вариантов (WGS)Calling of structural variants (WGS) -- ++ Анализ мобильных элементов (WGS)Mobile element analysis (WGS) -- ++ Анализ экспансии повторов (WGS)Repeat expansion analysis (WGS) -- ++

Основой для описываемого пайплайна служат рекомендации для анализа и соответствующий программный код, поддерживаемый broad institute (https://github.com/broadinstitute/warp/tree/develop/pipelines/broad/dna_seq/germline/single_sample/wgs). Пайплайн предоставляемый на приведенном репозитории осуществляет обработку одного образца WGS человека с покрытием 30х примерно за 24 часа. Тогда как предложенное решение позволяет параллельно анализировать 5 образцов со средней скоростью на образец 2,5 часа.The basis for the described pipeline is recommendations for analysis and the corresponding program code maintained by the broad institute (https://github.com/broadinstitute/warp/tree/develop/pipelines/broad/dna_seq/germline/single_sample/wgs). The pipeline provided in the above repository processes one human WGS sample with 30x coverage in approximately 24 hours. Whereas the proposed solution allows parallel analysis of 5 samples with an average speed of 2.5 hours per sample.

На вход пайплайн принимает произвольное количество пар файлов прочтений для одного образца, а также набор необходимых референсных данных для каждого анализа.As input, the pipeline accepts an arbitrary number of pairs of read files for one sample, as well as a set of necessary reference data for each analysis.

Все необходимые для работы пайплайна файлы предоставляются в виде tar.gz архива, однако так же могут быть автоматически скачаны и нужным образом отформатированы запуском скриптов hg19_dwl.sh или hg38_dwl.sh, которые создают референсные файлы для hg19 и hg38 сборок, соответственно.All files necessary for the pipeline to work are provided in the form of a tar.gz archive, but they can also be automatically downloaded and properly formatted by running the scripts hg19_dwl.sh or hg38_dwl.sh, which create reference files for hg19 and hg38 assemblies, respectively.

Отдельные файлы, полученные специально для работы пайплайна, и создание которых в рамках простых скриптов затруднительно поставляются вместе с исходным кодом пайплайна.Separate files obtained specifically for the operation of the pipeline, and the creation of which within the framework of simple scripts are difficult to supply along with the source code of the pipeline.

Все, получающиеся в результате выполнения пайплайна выходные файлы имеют одинаковый префикс, указанный при запуске в файле WholeGenomeGermlineSingleSample.inputs.plumbing.json в поле “sample name”. Выходные файлы сохраняются в папках, указанных в полях “WholeGenomeGermlineSingleSample.copy_path”. Если папки, указанные в этих полях, не существуют, то они будут автоматически созданы по мере выполнения пайплайна. Отдельно, при необходимости, можно отключить копирование выходных файлов из временной рабочей директории Cromwell-execution, проставляя значение поля “WholeGenomeGermlineSingleSample.copy_output”.All output files resulting from the execution of the pipeline have the same prefix, specified when launched in the WholeGenomeGermlineSingleSample.inputs.plumbing.json file in the “sample name” field. The output files are saved in the folders specified in the “WholeGenomeGermlineSingleSample.copy_path” fields. If the folders specified in these fields do not exist, they will be automatically created as the pipeline runs. Separately, if necessary, you can disable copying of output files from the temporary working directory of Cromwell-execution by entering the value of the “WholeGenomeGermlineSingleSample.copy_output” field.

Предлагаемое решение поясняется следующими материалами.The proposed solution is illustrated by the following materials.

На фиг. 1 - показана общая блок-схема ускорительной системы обработки данных полногеномного секвенирования.In fig. 1 - shows a general block diagram of an accelerator system for processing whole-genome sequencing data.

На фиг. 2 - Картирование пар fq.gz файлов на геном.In fig. 2 - Mapping pairs of fq.gz files onto the genome.

На фиг. 3 - Получение файлов vcf, gvcf.In fig. 3 - Obtaining vcf, gvcf files.

На фиг. 4 - показан пример из отчета.In fig. 4 - shows an example from the report.

На фиг. 5 - показана реализация изобретения, где красными рамками выделены критерии, присвоение которых автоматизировано благодаря предлагаемому решению.In fig. 5 shows an implementation of the invention, where criteria are highlighted in red frames, the assignment of which is automated thanks to the proposed solution.

На фиг. 6 - показаны примеры презентации дупликации (chr13) и делеции (chr18) в результатах модуля поиска CNV.In fig. 6 - shows examples of the presentation of duplication (chr13) and deletion (chr18) in the results of the CNV search module.

На фиг. 7 - Пример картирования прочтений на участок гена ATXN10, содержащий повторы мотива ATTCT.In fig. 7 - Example of mapping reads to a region of the ATXN10 gene containing repeats of the ATTCT motif.

Предлагаемая система обработки данных полногеномного секвенирования состоит из модулей, причем модуль, выполняющий картирование сырых ридов и генотипирование SNP/INDELS является базовым, на который опираются все остальные типы анализов.The proposed system for processing full-genome sequencing data consists of modules, and the module that performs mapping of raw reads and SNP/INDELS genotyping is the basic one on which all other types of analyzes are based.

Запуск модулей “Аннотирование”, “Y и “MT” гаплогруппы, а также когортный коллинг в текущей реализации системы осуществляется отдельно от всех остальных модулей, представляющих собой единый пайплайн.The launch of the “Annotation”, “Y and “MT” haplogroup modules, as well as cohort calling in the current implementation of the system is carried out separately from all other modules, which represent a single pipeline.

Далее более подробно описаны модули, образующую систему обработки данных полногеномного секвенирования.The following describes in more detail the modules that form the whole-genome sequencing data processing system.

Модуль SNP/INDELSSNP/INDELS module

Модуль SNP/INDELS является базовым в пайплайне, так как осуществляет картирование прочтений на референсный геном и генотипирование vcf и INDELS. Модуль состоит из трех блоков.The SNP/INDELS module is basic in the pipeline, as it performs mapping of reads to the reference genome and genotyping vcf and INDELS. The module consists of three blocks.

Первый блок обеспечивает картирование исходных прочтений подаваемых парами в формате.fastq.gz на референсный геном человека (hg19 или hg38 по выбору) с получением файла в формате bam (фиг. 2). Для фильтрации, картирования, сортинга и маркирования дубликатов используется связка программ fastp, minimap2 и samtools. Ускорение на данном шаге обеспечивается отказом от записи промежуточных данных на диск. Этот этап хорошо параллелизуется по ядрам сервера и его скорость напрямую зависит от их количества. Если на вход подавалось несколько пар fq.gz файлов отдельные отдельные bam файлы сливаются в единый файл. Скорость слияния файлов лимитируется скоростью диска, на котором идет анализ. Полученный единый файл формата bam подвергается пересчету метрик качества коллинга нуклеотидов секвенатором (BQSR), что увеличивает точность детекции SNP/INDELS в дальнейшем анализе. Параллелизм на данном этапе обеспечивается разбитием bam файла на несколько отдельных файлов меньшего объема и их независимым анализом. Лимитирующим ресурсом на данном шаге является количество оперативной памяти сервера. Для длительного хранения финальный bam файл переводится в cram формат, с потерей информации о качестве отдельных букв рида, но с сохранением информации о картировании. Такой подход позволяет максимально экономить место, занимаемое файлом, сохраняя основную информацию, необходимую при работе с таким файлов с целью проверки результатов анализа.The first block provides mapping of the original reads submitted in pairs in the .fastq.gz format to the reference human genome (hg19 or hg38 by choice) to obtain a file in bam format (Fig. 2). To filter, map, sort and mark duplicates, a bunch of fastp, minimap2 and samtools programs are used. Acceleration at this step is ensured by refusing to write intermediate data to disk. This stage is well parallelized across server cores and its speed directly depends on their number. If several pairs of fq.gz files were input, individual bam files are merged into a single file. The speed of file merging is limited by the speed of the disk on which the analysis is performed. The resulting single bam file is subject to recalculation of the nucleotide collating quality metrics by the sequencer (BQSR), which increases the accuracy of SNP/INDELS detection in further analysis. Parallelism at this stage is ensured by splitting the bam file into several separate smaller files and analyzing them independently. The limiting resource at this step is the amount of server RAM. For long-term storage, the final bam file is converted to cram format, with the loss of information about the quality of individual read letters, but with the preservation of mapping information. This approach allows you to save as much space as possible on the file, preserving the basic information necessary when working with such files in order to check the analysis results.

Второй логический блок осуществляет коллинг SNP/INDELS и формирование vcf и, опционально gvcf, файлов, содержащих информацию по полиморфизмам анализируемого образца (фиг. 3). Коллинг осуществляется программой HaplotypeCaller из пакета GATK4. Параллелизация на данном этапе осуществляется за счет независимого анализа участков генома, и лимитируется объемом оперативной памяти сервере. Полиморфизмы в vcf файле фильтруются по предустановленным порогам метрик качества.The second logical block carries out SNP/INDELS collating and the formation of vcf and, optionally gvcf, files containing information on polymorphisms of the analyzed sample (Fig. 3). Calling is carried out by the HaplotypeCaller program from the GATK4 package. Parallelization at this stage is carried out through independent analysis of genome sections, and is limited by the amount of RAM on the server. Polymorphisms in the vcf file are filtered by predefined thresholds of quality metrics.

Третий логический блок собирает статистику по прохождению каждого шага Модуля SNP/INDELS и формирует отдельный отчет в html формате для контроля качества анализируемых образцов. Собираемая статистика характеризует:The third logical block collects statistics on the completion of each step of the SNP/INDELS Module and generates a separate report in html format to control the quality of the analyzed samples. The collected statistics characterize:

- Входные данные (количество ридов, нуклеотидный состав, загрязнение адаптерами, качество чтений, размер вставки)- Input data (number of reads, nucleotide composition, adapter contamination, read quality, insert size)

- Качество картирования ридов на референс и глубину прочтения (для одной аутосомы и половых хромосом)- Quality of read mapping to reference and read depth (for one autosome and sex chromosomes)

- Количество обнаруженных SNP и INDELS процент полиморфизмов не входящих в dbSNP, количество трансверсий и транзиций- Number of detected SNPs and INDELS, percentage of polymorphisms not included in dbSNP, number of transversions and transitions

Модуль когортного коллингаCohort calling module

Модуль когортного коллинга запускается отдельно от основного пайплайна. Модуль принимает на вход набор файлов формата.gvcf, полученных после выполнения модуля поиска SNP/INDELS на нескольких образцах (например на трио, или отдельной когорте пациентов) и возвращает единый файл формата.vcf содержащий результирующий набор SNP/INDELS для всех образцов.The cohort calling module runs separately from the main pipeline. The module takes as input a set of .gvcf files obtained after running the SNP/INDELS search module on several samples (for example, on a trio, or a separate cohort of patients) and returns a single .vcf file containing the resulting set of SNP/INDELS for all samples.

Когортный коллинг осуществляет генотипирование набора.gvcf файлов с помощью программы HaplotypeCaller, пакета GATK4. Использование данного модуля облегчает анализ SNP/INDELS в связанных образцах, стандартизируюя репрезентацию аллелей и увеличивая точность коллинга (для когорт от нескольких десятков пациентов).Cohort calling performs genotyping of a set of .gvcf files using the HaplotypeCaller program, GATK4 package. Using this module facilitates SNP/INDELS analysis in linked samples, standardizing allele representation and increasing colling accuracy (for cohorts of several dozen patients).

Модуль аннотирования SNP/INDELSSNP/INDELS Annotation Module

Модуль аннотирование SNP/INDELS в текущей реализации пайплайна запускается отдельно, для каждого полученного файла формата.vcf. Для каждого варианта, записанного в.vcf файле, модуль ставит в соответствие информацию из различных баз данных, содержащих клинически значимую информацию о вариантах, а также агрегирует ее в ACMG ранг. Работа модуля основана на программе с открытым исходным кодом OpenCravat (https://github.com/KarchinLab/open-cravat ).The SNP/INDELS annotation module in the current pipeline implementation is launched separately for each received .vcf file. For each variant recorded in the .vcf file, the module matches information from various databases containing clinically relevant information about the variants, and also aggregates it into an ACMG rank. The module is based on the open source program OpenCravat (https://github.com/KarchinLab/open-cravat).

Модуль аннотирования может работать с файлами формата.vcf, полученными, как из модуля SNP/INDELS, так и из модуля когортного коллинга. Результатом работы модуля является файл формата. sqlite, предназначенный для загрузки в интерфейс программы OpenCravat. В текущей реализации модуля, аннотация осуществляется по следующим базам данных:The annotation module can work with .vcf files obtained from both the SNP/INDELS module and the cohort calling module. The result of the module's operation is a file format. sqlite, designed to be loaded into the OpenCravat program interface. In the current implementation of the module, annotation is carried out using the following databases:

Частотные:Frequency:

- GnomadV3- GnomadV3

- Внутренние частоты выборки Эвоген- Evogen internal sampling rates

Клинические:Clinical:

- ClinVar- ClinVar

- OMIM- OMIM

- HPO-HPO

Структурные:Structural:

- InterPro- InterPro

- Gene structures- Gene structures

- Информация по спалйс сайтам dbscsnv11 (из dbNSFP)- Information on splice sites dbscsnv11 (from dbNSFP)

- Информация по повторяющимся последовательностям в геноме (rmsk)- Information on repetitive sequences in the genome (rmsk)

In-silico аннотирование вариантов:In-silico annotation options:

- metaSVM (из dbNSFP)- metaSVM (from dbNSFP)

- Gerp ++ (из dbNSFP)- Gerp++ (from dbNSFP)

Помимо этого, реализована функция присвоения критерия патогенности по ACMG (фиг. 5), на базе программы IntereVar, интегрированной в OpenCravat и с обновленными базами данных.In addition, the function of assigning a pathogenicity criterion according to ACMG has been implemented (Fig. 5), based on the IntereVar program integrated into OpenCravat and with updated databases.

Критерии ACMGACMG criteria

Модифицировано из InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines. 2017 Полный список ACMG критериев. Критерии, присвоение которых автоматизировано в текущем пайплайне обведены в красные рамки.Modified from InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines. 2017 Complete list of ACMG criteria. Criteria whose assignment is automated in the current pipeline are circled in red.

Модуль поиска CNVCNV search module

Модуль для поиска изменений копийности может быть включен опционально проставлением соответствующего параметра при запуске пайплайна. Модуль оценивает изменения копийности участков хромосомы, ориентировочного размера от 100 т.п.о. до целой хромосомы, а также выявлять участки с потерей гетерозиготности (LOH). Модуль работает на основе программы с открытым исходным кодом Freec (https://github.com/BoevaLab/FREEC ).The module for searching for copy number changes can be optionally enabled by setting the appropriate parameter when starting the pipeline. The module evaluates changes in the copy number of chromosome regions, with an approximate size of 100 kb. to the whole chromosome, as well as to identify areas with loss of heterozygosity (LOH). The module is based on the open source program Freec (https://github.com/BoevaLab/FREEC).

Работа Freec основана на сравнении покрытия участка генома в небольшом регионе к среднему покрытию по всему геному для оценки обычных CNV или собирая статистику по биалельной частоте вариантов, для обнаружения LOH путем сбора статистики по биалельной частоте вариантов. Детекция CNV и LOH опирается на полученные в ходе работы модуля SNP/INDELS bam и vcf файлы. Данный подход показывает хорошие результаты для детекции больших перестроек (больше 1МБ участков генома), для оценки достоверности перестроек меньше 1МБ рекомендуется учитывать частоты встречаемости данной перестройки в проанализированных ранее образцах, близость теломерных и центромерных участков, а также результаты модуля по поиску структурных вариантов (см. ниже). Результатом работы модуля по поиску CNV является таблица с обнаруженными вариациями с указанием их статистической достоверности, а также визуальное представление биаллельных частот и среднего покрытия по каждой хромосоме (фиг. 4). Примеры презентации дупликации (chr13) и делеции (chr18) в результатах модуля поиска CNV показаны на фиг. 6.Freec's work is based on comparing genomic coverage in a small region to average coverage across the entire genome to estimate common CNVs, or by collecting statistics on the bialelic frequency of variants, to detect LOH by collecting statistics on the bialelic frequency of variants. Detection of CNV and LOH is based on bam and vcf files obtained during the operation of the SNP/INDELS module. This approach shows good results for detecting large rearrangements (more than 1 MB of genome regions); to assess the reliability of rearrangements less than 1 MB, it is recommended to take into account the frequency of occurrence of this rearrangement in previously analyzed samples, the proximity of telomeric and centromeric regions, as well as the results of the module for searching for structural variants (see. below). The result of the CNV search module is a table with detected variations indicating their statistical significance, as well as a visual representation of biallelic frequencies and average coverage for each chromosome (Fig. 4). Examples of the presentation of duplication (chr13) and deletion (chr18) in the results of the CNV search module are shown in Fig. 6.

Модуль поиска структурных вариантовStructural variant search module

Модуль по поиску структурных вариантов может быть включен опционально, проставлением соответствующего параметра при запуске пайплайна (см. Раздел “Опции запуска пайплайна”). Модуль анализирует так называемые сплит прочтения и дискордант прочтения, для выявления инсерций, делеций, транслокаций и сложных перестроек участков генома. Следует отметить, что данный подход чувствителен к среднему покрытию в анализируемом образце. Модуль основан на работе программы с открытым исходным кодом smoove (https://github.com/brentp/smoove ).The module for searching for structural options can be enabled optionally by setting the appropriate parameter when launching the pipeline (see Section “Pipeline launch options”). The module analyzes the so-called split reads and discordant reads to identify insertions, deletions, translocations and complex rearrangements of genomic regions. It should be noted that this approach is sensitive to the average coverage in the analyzed sample. The module is based on the open source program smoove (https://github.com/brentp/smoove).

Работа модуля опираться на полученные в ходе работы модуля SNP/INDELS bam файлы. Результатом работы является таблица наиболее вероятных структурных перестроек разбитая отдельно на гомо- и гетерозиготы с указанием генов, затронутых предполагаемой перестройкой. Ассоциация перестрок с генами осуществляется с помощью инструмента VEP (https://github.com/Ensembl/ensembl-vep ). Помимо итоговой таблицы создаются отдельные bam файлы содержащие сплит и дискордант прочтения для визуальной оценки, а так же vcf файл содержащий все варианты обнаруженные с помощью smoove без дополнительной фильтрации на достоверность.The work of the module is based on the SNP/INDELS bam files obtained during the work of the module. The result of the work is a table of the most probable structural rearrangements, divided separately into homo- and heterozygotes, indicating the genes affected by the proposed rearrangement. Association of rearrangements with genes is carried out using the VEP tool (https://github.com/Ensembl/ensembl-vep). In addition to the final table, separate bam files are created containing split and discordant readings for visual assessment, as well as a vcf file containing all options detected using smoove without additional filtering for reliability.

Модуль поиска экспансии повторовRepetition expansion search module

Модуль для поиска экспансии повторов может быть включен опционально, проставлением соответствующего параметра при запуске пайплайна. Модуль анализирует прочтения картированные на ряд заранее обозначенных участков генома, содержащих повторы коротких последовательностей ДНК и определяет количество таких повторений в гаплотипах образца. Модуль работает на базе программы с открытым исходным кодом ExpansionHunter (https://github.com/Illumina/ExpansionHunter ).The module for searching for repetition expansion can be enabled optionally by setting the appropriate parameter when starting the pipeline. The module analyzes reads mapped to a number of pre-designated genomic regions containing repeats of short DNA sequences and determines the number of such repeats in the sample haplotypes. The module runs on the open source program ExpansionHunter (https://github.com/Illumina/ExpansionHunter).

Работа модуля опираться на полученные в ходе работы модуля SNP/INDELS bam файлы. Результатом работы модуля поиска экспансии повтором является.pdf документ, отображающий участки генома, где было обнаружено отличие количества повторов от референсного значения, а также графическое представление прочтений, перекартированных на данный регион (фиг. 7). Помимо этого сохраняется bam файл, содержащий прочтения, перекартированные программой ExpansionHunter на регионы с повторами, а так же.vcf файл, содержащий результаты по всем анализируемым повторам, включая, те, копийность которых не отличается от референсных значений.The work of the module is based on the SNP/INDELS bam files obtained during the work of the module. The result of the repeat expansion search module is a .pdf document displaying genomic regions where a difference in the number of repeats from the reference value was found, as well as a graphical representation of reads remapped to this region (Fig. 7). In addition, a bam file is saved containing reads remapped by the ExpansionHunter program into regions with repeats, as well as a .vcf file containing the results for all analyzed repeats, including those whose copy numbers do not differ from the reference values.

На фиг. 7 показан пример картирования прочтений на участок гена ATXN10, содержащий повторы мотива ATTCT.In fig. Figure 7 shows an example of read mapping to a region of the ATXN10 gene containing repeats of the ATTCT motif.

Модуль поиска вставок мобильных элементовMobile element insertion search module

Модуль для поиска вставок мобильных элементов может быть включен опционально, проставлением соответствующего параметра при запуске пайплайна. Модуль ищет детектирует вставки мобильных элементов семейств ALU, SVA, LINE1 в геном и работает на базе программы с открытым исходным кодом TEMP2 (https://github.com/weng-lab/TEMP2 ).The module for searching for insertions of mobile elements can be enabled optionally by setting the appropriate parameter when starting the pipeline. The module searches for and detects insertions of mobile elements of the ALU, SVA, LINE1 families into the genome and works based on the open source program TEMP2 (https://github.com/weng-lab/TEMP2).

Работа модуля опираться на полученные в ходе работы модуля SNP/INDELS bam файлы. Результатом работы модуля являются два файла формата.tsv. Первый файл содержит полный список обнаруженных инсерций, и их ассоциацию с генами полученную с помощью инструмента VEP (https://github.com/Ensembl/ensembl-vep ). Второй файл содержит те инсерции, которые, базируясь на поле IMPACT программы VEP, могут влиять на функции ассоциированных с ними генов.The work of the module is based on the SNP/INDELS bam files obtained during the work of the module. The output of the module is two files in .tsv format. The first file contains a complete list of detected insertions and their association with genes obtained using the VEP tool (https://github.com/Ensembl/ensembl-vep). The second file contains those insertions that, based on the IMPACT field of the VEP program, can affect the functions of the genes associated with them.

Модуль HLA типированияHLA typing module

Модуль для типирования HLA может быть включен опционально, проставлением соответствующего параметра при запуске пайплайна. Модуль направлен на определение гаплотипов HLA и работает на базе программы с открытым исходным кодом Kourami (https://github.com/Kingsford-Group/kourami ).The HLA typing module can be enabled optionally by setting the appropriate parameter when starting the pipeline. The module is aimed at determining HLA haplotypes and is based on the open source program Kourami (https://github.com/Kingsford-Group/kourami).

Работа модуля опираться на полученные в ходе работы модуля SNP/INDELS bam файлы. Прочтения картированные на HLA регионы, а также не картированные и частично картированные прочтения агрегируются и перекартируются на базу последовательностей гаплотипов HLA локуса. Результатом работы модуля служит файл формата.tsv содержащий предсказанные типы HLA аллелей, согласно установленной номенклатуре.The work of the module is based on the SNP/INDELS bam files obtained during the work of the module. Reads mapped to HLA regions, as well as unmapped and partially mapped reads, are aggregated and remapped to the HLA locus haplotype sequence database. The result of the module's work is a file in .tsv format containing predicted types of HLA alleles, according to the established nomenclature.

Модуль определения Mt и Y гаплогруппModule for determining Mt and Y haplogroups

Модуль аннотирования гаплогрупп в текущей реализации пайплайна запускаются отдельно, для каждого полученного файла формата.vcf. Модуль аннотирования Mt гаплогруппы использует программу с открытым исходным кодом (https://github.com/seppinho/haplogrep-cmd). Модуль аннотирования Y гаплогруппы использует модификацию программы с открытым исходным кодом y-leaf (https://github.com/genid/Yleaf ).The haplogroup annotation module in the current pipeline implementation is launched separately for each received .vcf file. The Mt haplogroup annotation module uses an open source program (https://github.com/seppinho/haplogrep-cmd). The Y haplogroup annotation module uses a modification of the open source program y-leaf (https://github.com/genid/Yleaf).

Для аннотирования митохондриальной (МТ) гаплогруппы используется филогенетическое древо всемирной вариации митохондриальной ДНК человека Phylotree 17-й сборки. На вход программа берет vcf-файл, затем производит классификацию и на выходе получается наиболее вероятная МТ-гаплогруппа для данного индивида. Для идентификации гаплогруппы y-leaf использует филогенетически информативные SNP из базы данных ISOGG (International Society of Genetic Genealogy). На вход y-leaf берет vcf-файл, относящийся к индивиду мужского пола, после чего производит классификацию, выдавая на выход искомую Y-гаплогруппу.To annotate the mitochondrial (MT) haplogroup, the Phylotree 17th assembly phylogenetic tree of worldwide variation in human mitochondrial DNA is used. The program takes a vcf file as input, then performs a classification and the output is the most probable MT haplogroup for a given individual. To identify the haplogroup, y-leaf uses phylogenetically informative SNPs from the ISOGG (International Society of Genetic Genealogy) database. As input, y-leaf takes a vcf file related to a male individual, after which it performs a classification, outputting the desired Y-haplogroup.

Способ обработки данных полногеномного секвенирования включает следующие этапы.The method for processing whole genome sequencing data includes the following steps.

Первый этап анализа данных представляет собой выравнивание сырых прочтений на референсный геном. В отличии от признанного в академической среде решения на базе best practice от института broad (GATK) все инструменты, используемые на данном этапе выстроены в единый конвейер, передающий данные между программами без записи промежуточных файлов и часть заменена более быстрыми аналогами.The first stage of data analysis is the alignment of raw reads to the reference genome. In contrast to the academically recognized solution based on best practice from the broad institute (GATK), all the tools used at this stage are built into a single pipeline that transfers data between programs without writing intermediate files and some are replaced with faster analogues.

Прочтения триммируются инструментом fastp и потоком данных передаются в картировщик minimap2 откуда так же потоком данных передаются в инструмент samatools осуществляющий маркирование дубликатов и индексирование получившегося bam файла. Количество потоков для исполнения данного шага регулируется пользователем и зависит от конфигурации сервера и количества образцов, анализируемых параллельно. Данные на этом этапе читаются и записываются на nvme диск, позволяющий использовать до 130 потоков работы с данными на этом этапе.The reads are trimmed by the fastp tool and sent as a data stream to the minimap2 mapper, from where they are also transferred as a data stream to the samatools tool, which carries out marking of duplicates and indexing the resulting bam file. The number of threads for executing this step is regulated by the user and depends on the server configuration and the number of samples analyzed in parallel. Data at this stage is read and written to the nvme disk, allowing up to 130 threads to work with data at this stage.

Полученный bam файл направляется в следующий этап обработки, представляющий собой рекалибрацию метрик качества, проставляемых секвенатором (BQSR и ApplyBQSR в стандартном пайплайне обработки геномных данных GATK). В пайплайне описанном в данной заявке, использование инструментов BQSR и ApplyBSQR оптимизированно путем разбивки входных данных на равные геномные интервалы, в отличии от использования разбивки данных по отдельным хромосомам. Такой подход обеспечивает равномерную загрузку вычислительных ресурсов сервера, а также одновременное завершение обработки каждого геномного интервала и, в следствии этого, экономию вычислительного времени.The resulting bam file is sent to the next processing stage, which is the recalibration of quality metrics supplied by the sequencer (BQSR and ApplyBQSR in the standard GATK genomic data processing pipeline). In the pipeline described in this application, the use of the BQSR and ApplyBSQR tools is optimized by partitioning the input data into equal genomic intervals, as opposed to using data partitioning on individual chromosomes. This approach ensures uniform loading of the server's computing resources, as well as simultaneous completion of processing of each genomic interval and, as a result, saving computing time.

Полученный bam файл используется в качестве входных данных для других модулей, а также переводится в формат cram, для длительного хранения. Шаг трансформации bam в cram реализован посредством инструмента crumble с генерацией cram файла, без информации о метриках качества коллинга нуклеотидов, но весящего примерно в 10 раз меньше исходного bam файла, что обеспечивает существенную экономию места в файловом хранилище. Преимуществом такого подхода является то, что такой cram файл подходит для визуализации картирования и для его ручной оценки, если же необходимо получить полноценный bam файл с метриками качества для стороннего анализа, то обычно он необходим для определенного участка генома и восстанавливается из комбинации cram файла и fastq файлов. Из cram файла экстрагируются ID прочтений интересующего нас региона, а по ним из fastq файла экстрагируются оригинальные прочтения и, при необходимости повторно картируются на геном.The resulting bam file is used as input for other modules, and is also converted into cram format for long-term storage. The bam to cram transformation step is implemented using the crumble tool with the generation of a cram file, without information about the quality metrics of nucleotide colliding, but weighing approximately 10 times less than the original bam file, which provides significant space saving in file storage. The advantage of this approach is that such a cram file is suitable for mapping visualization and for its manual evaluation; if it is necessary to obtain a full-fledged bam file with quality metrics for third-party analysis, then it is usually needed for a certain region of the genome and is restored from a combination of a cram file and fastq files. The read IDs of the region of interest to us are extracted from the cram file, and from them the original reads are extracted from the fastq file and, if necessary, re-mapped to the genome.

Блок генерации vcf и gvcf файлов использует стандартные инструменты HaplotypeCaller пакета GATK. Оптимизация скорости работы пайплайна на этом этапе достигается разбивкой генома на равномерные интервалы для анализа, аналогично этапу рекалибрации метрик качества.The vcf and gvcf file generation block uses the standard HaplotypeCaller tools of the GATK package. Optimizing the speed of the pipeline at this stage is achieved by dividing the genome into uniform intervals for analysis, similar to the stage of recalibration of quality metrics.

Контроль качества выполнения исходных данных оценивается по результатам тримминга, картирования прочтений и коллинга SNP/INDELS. Сбор метрик качества является достаточно время затратным процессом. В описываемом пайплайне ускорение этого шага достигается сбором метрик не по полному геному, а по его репрезентативным участкам, что уменьшает необходимые вычисления. Возможные загрязнения сторонним генетическим материалом оцениваются по митохондриальной ДНК, что драматически быстрее обычной оценки уровня загрязнения по всей геномной ДНК.Quality control of the initial data is assessed based on the results of trimming, read mapping, and SNP/INDELS collating. Collecting quality metrics is a fairly time-consuming process. In the described pipeline, acceleration of this step is achieved by collecting metrics not for the entire genome, but for its representative sections, which reduces the necessary calculations. Possible contamination with third-party genetic material is assessed using mitochondrial DNA, which is dramatically faster than the usual assessment of the level of contamination using entire genomic DNA.

Дополнительные модули анализа на основе получаемых bam файлов так же были собраны исходя из необходимости экономить общее вычислительное время.Additional analysis modules based on the resulting bam files were also collected based on the need to save overall computing time.

Блоки CNV анализа и анализа инсерций транпозонов настроены аналогично блокам генерации Vcf файлов и блокам рекалибрации скоров исходный bam файл разделяется на равные блоки, содержащие равное количество информации, для паралеллизации обработки данных.The CNV analysis and transposon insertion analysis blocks are configured similarly to the Vcf file generation blocks and speed recalibration blocks; the original bam file is divided into equal blocks containing an equal amount of information to parallelize data processing.

Блоки HLA типирования и определения гаплогрупп дополнительно ускорены тем, что на вход в них подаются bam файлы, содержащие только риды, из регионов, необходимых для анализаHLA typing and haplogroup determination blocks are further accelerated by the fact that they receive bam files containing only reads from the regions required for analysis as input

Данные характеристики процессов актуальны при запуске анализа на сервере с конфигурацией 128 CPU, 256Gb RAM, nmve диск, объемом от 2Т.These process characteristics are relevant when running the analysis on a server with a configuration of 128 CPU, 256Gb RAM, nmve disk, with a capacity of 2T or more.

Благодаря предложенному изобретению пайплайн способен выполнять следующие типы анализов:Thanks to the proposed invention, the pipeline is capable of performing the following types of analyses:

- Идентификация SNP/INDELS в образце- Identification of SNP/INDELS in a sample

- Аннотирование (в частности ACMG)- Annotation (in particular ACMG)

- Определение Y и МТ гаплогрупп- Determination of Y and MT haplogroups

- Когортный коллинг вариантов- Cohort calling options

- Изменения копийности участков хромосомы, ориентировочного размера от 1Мб до целой хромосомы- Changes in the copy number of chromosome sections, approximate size from 1Mb to the whole chromosome

- Анализ на структурные варианты размером ориентировочного размера от 100 до 100 тыс. пн- Analysis for structural variants with an approximate size of 100 to 100 thousand bp

- Вставки мобильных элементов (ALU, SVA, LINE1)- Mobile element inserts (ALU, SVA, LINE1)

- Экспансию повторяющихся участков генома- Expansion of repeating regions of the genome

- HLA типирование образца- HLA typing of sample

Разработанная конфигурация учитывает необходимость оптимизации времени анализа образцов. Пайплайн позволяет оптимизировать выполнение необходимых анализов и выдачу заключений специалистами в режиме поточной обработки образцов клиентов.The developed configuration takes into account the need to optimize sample analysis time. The pipeline allows you to optimize the performance of the necessary analyzes and the issuance of conclusions by specialists in the on-line processing of customer samples.

Claims

Способ обработки данных полногеномного секвенирования, характеризующийся тем, что включает этап анализа данных, представляющий собой выравнивание сырых прочтений на референсный геном, прочтения триммируются инструментом fastp и потоком данных передаются в картировщик minimap2, откуда так же потоком данных передаются в инструмент samatools, осуществляющий маркирование дубликатов и индексирование получившегося bam файла, полученный bam файл направляется в следующий этап обработки, представляющий собой рекалибрацию метрик качества, проставляемых секвенатором, при этом ускорение контроля качества выполнения исходных данных оценивают по результатам тримминга, картирования прочтений и коллинга SNP/INDELS, что достигается сбором метрик по репрезентативным участкам генома.A method for processing full-genome sequencing data, characterized by the fact that it includes a data analysis stage, which is the alignment of raw reads to the reference genome, the reads are trimmed by the fastp tool and transferred by a data stream to the minimap2 mapper, from where they are also transferred by a data stream to the samatools tool, which carries out marking of duplicates and indexing of the resulting bam file, the resulting bam file is sent to the next stage of processing, which is the recalibration of quality metrics provided by the sequencer, while the acceleration of quality control of the execution of the initial data is assessed based on the results of trimming, read mapping and SNP/INDELS collating, which is achieved by collecting metrics on representative regions of the genome.