CN114944197A - Automatic serotype analysis and identification method and system based on sequencing data - Google Patents

Automatic serotype analysis and identification method and system based on sequencing data Download PDF

Info

Publication number
CN114944197A
CN114944197A CN202210540274.4A CN202210540274A CN114944197A CN 114944197 A CN114944197 A CN 114944197A CN 202210540274 A CN202210540274 A CN 202210540274A CN 114944197 A CN114944197 A CN 114944197A
Authority
CN
China
Prior art keywords
serotype
key
sequencing data
database
allele
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210540274.4A
Other languages
Chinese (zh)
Other versions
CN114944197B (en
Inventor
刘健
孙嘉良
陈娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202210540274.4A priority Critical patent/CN114944197B/en
Publication of CN114944197A publication Critical patent/CN114944197A/en
Application granted granted Critical
Publication of CN114944197B publication Critical patent/CN114944197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides an automatic serotype analysis and identification method and system based on sequencing data, which relate to the technical field of gene sequencing data analysis, and the method comprises the following steps: obtaining microbial genome sequencing data; comparing the sequencing data of the microbial genome with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores; determining organisms to which the sequencing data of the microbial genome belong according to the key alleles and corresponding comparison scores; determining a sequence type in a sequence type database using key alleles of the organism; the serotype database is searched by using the sequence type, the serotype of the sequencing data of the microbial genome is determined according to the mapping relation between the sequence type and the serotype, so that the automation of bioinformatics analysis and identification is realized, meanwhile, customized bioinformatics analysis can be performed on the short read length sequencing data and the long read length sequencing data generated by different platforms, and an accurate analysis result is obtained.

Description

Automatic serotype analysis and identification method and system based on sequencing data
Technical Field
The application belongs to the technical field of gene sequencing data analysis, and particularly relates to an automatic serotype analysis and identification method and system based on sequencing data.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The serotyping technology based on the sequencing data information of the microbial genome is widely applied to the automatic serotyping process requiring analysis programs. Research into serotype analysis of single species has yielded some success.
The inventors have found that existing work on serotyping, which is generally focused on the analysis of sequencing data of a single species of microorganism, and in particular the identification and analysis of salmonella serotypes, does not meet the analysis requirements of the currently rapidly developing large number of sequencing data of multiple species. And the related aspects of the compounds are usually not comprehensive enough, only focus on a single aspect of the sequence type or serotype, cannot give consideration to multiple aspects, and cannot link information of the multiple aspects.
Disclosure of Invention
In order to overcome the defects of the prior art, the application provides an automatic serotype analysis and identification method and system based on sequencing data, which are used for identifying serotypes of multiple species and customized bioinformatics analysis and are beneficial to improving the accuracy of a serotype analysis and identification result.
The technical scheme adopted by the application is as follows:
in a first aspect, an embodiment of the present application provides an automated serotype analysis and identification method based on sequencing data, including:
obtaining microbial genome sequencing data;
comparing the microbial genome sequencing data with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores;
determining an organism to which the microbial genome sequencing data belongs based on the key alleles and corresponding alignment scores;
determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence type, and determining the serotype of the sequencing data of the microbial genome according to the mapping relation between the sequence type and the serotype.
In one possible embodiment, before obtaining the sequencing data of the microorganism genome, the method further comprises: a key allele-sequence type-serotype association database (hereinafter simply referred to as association database) was constructed.
In one possible embodiment, the building process of the association database includes: collecting information of related key alleles, sequence types and serotypes; mining the incidence relation between the key allele and the sequence type, the incidence relation between the sequence type and the serotype, and the information of the key allele, the sequence type and the serotype; and constructing the key allele-sequence type-serotype association database according to the association relationship and the information.
In one possible embodiment, the key allele-sequence type-serotype association database comprises a key allele database, a sequence type database and a serotype database, and the databases establish association relations through indexes; the sequence type database records the mapping relation of different combinations of key alleles to each organism sequence type; the serotype database records the association between sequence types and serotypes, and is used for serotype identification of various microorganisms.
In one possible embodiment, the frequency of each serotype of a sequence type of an organism in the association database is calculated according to the law of large numbers, and the probability that the serotype of the organism is a known serotype is determined from the frequency; and determining the association relation between the sequence types and the serotypes according to the probability.
In one possible embodiment, the organism to which the microbial genome sequencing data belongs is evaluated using a sigmoid scoring strategy.
In one possible embodiment, the organism to which the microbial genome sequencing data belongs is evaluated by:
Figure BDA0003650045320000031
Figure BDA0003650045320000032
wherein x represents the number of different alleles in an allele locus of the organism, θ represents a weight associated with the marker, and s represents a score associated with one allele of the organism; allel represents the key allele of the organism, allels represents all the key alleles of the organism, f represents the final score of the organism; and determining the organism to which the microbial genome sequencing data belongs according to the final score.
In a second aspect, embodiments of the present application provide an automated serotype analysis and identification system based on sequencing data, comprising:
the acquisition module is used for acquiring microbial genome sequencing data;
the comparison module is used for comparing the sequencing data of the microbial genome with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores;
a determination module for determining an organism to which the microbial genome sequencing data belongs based on the key alleles and corresponding alignment scores;
an identification module for determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence types, and identifying the serotypes of the sequencing data of the microbial genome according to the mapping relationship between the sequence types and the serotypes.
In a third aspect, an embodiment of the present application provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the automated serotype analysis and identification method based on sequencing data as described in any one of the possible embodiments of the first aspect and the first aspect.
In a fourth aspect, the present embodiments provide a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the automated serotype analysis and identification method based on sequencing data as described in the first aspect and any one of the possible embodiments of the first aspect.
The beneficial effect of this application is:
by automated bioinformatic analysis steps comprising: obtaining microbial genome sequencing data; comparing the sequencing data of the microbial genome with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores; determining an organism to which the microbial genome sequencing data belongs based on key alleles and corresponding alignment scores; determining a sequence type in a sequence type database using key alleles of the organism; the serotype database is searched by using the sequence type, the serotype of the sequencing data of the microbial genome is determined according to the mapping relation between the sequence type and the serotype, the customized bioinformatics analysis can be carried out on the short read length and the long read length sequencing data generated by different platforms, and an accurate analysis result is obtained.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow diagram of an automated serotype analysis identification method based on sequencing data as provided in an embodiment of the present application;
FIG. 2 is a flow diagram of an automated serotype analysis identification method based on sequencing data as provided in another embodiment of the present application;
FIG. 3 is a block diagram of a database of linkage allele-sequence type-serotype associations as provided in the examples herein;
FIG. 4 is a block diagram of an automated serotype analysis and identification system based on sequencing data as provided in an embodiment of the present application;
fig. 5 is a schematic diagram of a computer device provided in an embodiment of the present application.
Detailed Description
The present application will be further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
At this stage, the serotyping work usually focuses on the analysis of sequencing data of microorganisms of a single species, especially the identification and analysis of salmonella serotypes, and cannot meet the analysis requirements of the currently rapidly developing large amount of sequencing data of multiple species. And the related aspects of the compounds are usually not comprehensive enough, only focus on a single aspect of the sequence type or serotype, cannot give consideration to multiple aspects, and cannot link information of the multiple aspects. More advanced analytical techniques or tools should provide versatile analysis for serotype identification, while the analysis covers more species, and provides simpler configuration options, making it more user friendly. Based on the above, the present application provides an automated serotype analysis and identification method based on sequencing data, which is used for realizing the sequencing and serotype identification based on microbial genome and applicable to multiple species.
Example one
As shown in fig. 1 and fig. 2, the automated serotype analysis and identification method based on sequencing data provided in the embodiments of the present application includes the following steps:
s101: and (4) acquiring microbial genome sequencing data.
The sequencing data of the microbial genome comprise different types, specifically comprise short-read long-sequence data, long-read long-sequence data and assembled data.
S102: and comparing the microbial genome sequencing data with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores.
In a specific implementation, this example first aligns the microbial genome sequencing data to a key allele-sequence type-serotype association database. After analyzing the alignment results, information on the pattern (one pattern associated with each organism), allele _ num, input sequence length, alignment length, and number of identical matches is extracted. The list of potential microorganisms is evaluated from the extracted information and if the allele coverage of the organism by the microbial genome sequencing data is high, then the likelihood that these sequences belong to that organism is high. In particular, when the sequence length and alignment length of an organism equals the number of identical matches, a high score marker is assigned; when the length of a pair of input sequences, the alignment length and the same matching number are equal, a middle score mark is distributed; when the input sequence length, alignment length, and number of identical matches are not equal, a low score flag is assigned.
For the organisms with the highest scores in the possible list, the embodiment obtains the corresponding modes, Allle and Allle _ num information, constructs the executed equivalent search conditions on the serotype association database of the key Allele sequence types, generates the sequence types and the serotypes of the input sequencing data on the basis, and completes the corresponding microorganism identification.
In the embodiment of the present application, as an alternative embodiment, before obtaining the sequencing data of the genome of the microorganism, the method further includes: constructing a key allele-sequence type-serotype association database.
In a specific implementation, the key allele-sequence type-serotype association database consists of three parts: a key allele database for searching key alleles, a sequence type database associated with key alleles, and a serotype database associated with sequence types.
In this embodiment, as an optional embodiment, the building process of the association database includes: collecting relevant key allele, sequence type and serotype information; mining the incidence relation between the key allele and the sequence type, the incidence relation between the sequence type and the serotype, and the information of the key allele, the sequence type and the serotype; and constructing the key allele-sequence type-serotype association database according to the association relationship and the information.
In particular implementations, relevant key allele, sequence type, and serotype information is collected, associations between sequence types and serotypes are mined, support is provided for serotype identification of multiple microorganisms, and association databases are constructed. The association database contains 1044 key alleles of 135 organisms, each allele locus containing tens or hundreds of different allele sequences. In addition, the association database contains 45898 associations between sequence types and serum types. Table 1 lists several organisms and their alleles, where each organism may be associated with several different allelic loci (e.g., allele _1, allele _2, …), each of which may have a different number of alleles (e.g., pgi (17) describes 17 alleles in pgi of an allelic locus).
Table 1 example of organisms and alleles in a relational database
Figure BDA0003650045320000071
Figure BDA0003650045320000081
In the embodiment of the present application, as an optional embodiment, the key allele-sequence type-serotype association database includes a key allele database, a sequence type database, and a serotype database, and each database establishes an association relationship by an index; the sequence type database records the mapping relation of different combinations of key alleles to each organism sequence type; the serotype database records the association between sequence types and serotypes, and is used for serotype identification of various microorganisms.
In a specific implementation, alleles at the same allelic locus may be associated with multiple sequences, for example, the allelic locus aroC of salmonella enterica is associated with 2106 sequences, as shown in table 2. The key allele database stores the downloaded key alleles (the relationship between gene sequences and alleles is a one-to-one mapping, as shown at 1:1 in FIG. 3). This example provides for downloading gene sequences from a local script and building a BLAST index to find similar key alleles by fast alignment.
TABLE 2 Salmonella allele aroC and corresponding sequence examples
Allele Sequence
aroC_1 …GTTTTTCGCCCGGGACACGCGGATTACACCTATGAGCAGA…
aroC_2 …CTGCGCGATTACCGTGGCGGTGGACGTTCTTCCGCGCGTG…
aroC_3 …CTTCCGCGCGTGAAACCGCGATGCGCGTAGCGGCAGGGGC…
aroC_4 …GATCGCCAAGAAATACCTGGCGGAAAAGTTCGGCATCGAA…
aroC_5 …GATATTCCGCTGGAGATTAAAGACTGGCGTCAGGTTGAGC…
Further, the key allele-sequence type associations are specifically:
the sequence type database records the mapping of different combinations of key alleles to sequence types for each organism. The facility provides a local script to collect and store the mappings in SQLite. Table 3 shows an example of a Salmonella enterica protocol in the sequence-based database, which is composed of the sequence types (see column ST) and the corresponding alleles (see columns aroC, dnaN, hemD, etc.) of Salmonella enterica. The allele values in table 3 are the corresponding sequence numbers. Different combinations of alleles are associated with different sequence types, and these combinations are used in this example to identify sequence types in an organism.
TABLE 3 Salmonella sequence types and corresponding alleles
ST aroC dnaN hemD hisD purE sucA thrA
1 1 1 1 1 1 1 5
2 1 1 2 1 1 1 5
3 1 1 2 1 1 1 9
4 43 41 16 13 34 13 4
5 16 43 45 43 36 39 42
Further, the sequence type-serotype association is specifically:
serotype databases record the sequence types and associations between serotypes for serotype identification of a variety of microorganisms. The tool provides a local script to construct and store associations in SQLite. The relationship between serotypes and sequence types is a many-to-many mapping (denoted as n: n in FIG. 3).
In the embodiment of the application, as an optional embodiment, the frequency of each serotype of an organism sequence type in the association database is calculated according to the law of large numbers, and the probability that the serotype of the organism is a known serotype is determined according to the frequency; and determining the association relationship between the sequence types and the serotypes according to the probability.
In particular implementations, after calculating the frequency fi of each serotype of a sequence type for an organism in the association database, the present example assigns the frequency fi as a probability that the serotype of the given organism is a known serotype according to the law of majority. Table 4 shows an example of a salmonella enterica protocol in a serotype database, the columns of which contain sequence types and associated serotypes. For example, if a given organism has an ST value of 1, the Typhi and Enteritidis probabilities for its respective serotype are 0.9995 and 0.0005, respectively. With the help of the correlation information in the serotype database, the present technology can identify the possible serotypes of a plurality of microorganisms from the sequence types.
TABLE 4 Salmonella sequence types and associated serotypes
ST Serotype
1 Typhi:0.9995;Enteritidis:0.0005
2 Typhi:0.9990;others:0.001
4 Montevideo:0.9286;others:0.0714
5 Newport:0.6667;others:0.3333
6 Enteritidis:1.0
8 Typhi:1.0
10 Dublin:0.84;Typhi:0.02;unknown:0.13;Naestved:0.01
11 Enteritidis:0.98;others:0.02
13 Agona:0.97;Derby:0.004;others:0.026
S103: determining the organism to which the microbial genome sequencing data belongs according to the key allele and the corresponding alignment score.
As an alternative example, the organism to which the microbial genome sequencing data belongs is evaluated by:
Figure BDA0003650045320000101
Figure BDA0003650045320000102
wherein x represents the number of different alleles in an allele locus of the organism, θ represents a weight associated with the marker, and s represents a score associated with one allele of the organism; allel represents the key allele of the organism, allels represents all the key alleles of the organism, f represents the final score of the organism; and determining the organism to which the microbial genome sequencing data belongs according to the final score.
In a specific implementation, this scoring process can be described by algorithm 1. Theta 1 And theta 2 Is the weight associated with the marker, org is the possible organism.The score table consists of the possible organisms and their corresponding final scores f. Lines 1-5 parse the organism and obtain all key alleles of possible organisms; 6-10 using equation (1) to calculate a score associated with an allele of a potential organism; lines 11-12 calculate the final score f for each possible organism using equation (2).
Figure BDA0003650045320000111
S104: determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence type, and determining the serotype of the sequencing data of the microbial genome according to the mapping relation between the sequence type and the serotype.
In a specific implementation, after generating the most likely organism with the highest final score, the present example obtains the sequence types in the sequence type database using the key alleles of the most likely organism, searches the serotype database using the sequence types, and obtains the possible serotypes according to the mapping relationship between the sequence types and the serotypes. Note that for serotype identification of salmonella, this example can further enhance the identification capabilities of the present technology by utilizing gene sequences associated with the antigens used in SeqSero 2. To date, this example has completed the identification of various microbial serotypes.
For researchers and clinicians without professional bioinformatics knowledge, the embodiment realizes automation of bioinformatics analysis and identification, and comprises the steps of obtaining microbial genome sequencing data, comparing the microbial genome sequencing data, serotyping and scoring, identifying the sequence type and the serotype of a multi-microbial species, constructing a key allele-sequence type-serotype association database, and performing customized bioinformatics analysis aiming at short read length and long read length sequencing data generated by different platforms to obtain accurate analysis results.
Example two
Referring to fig. 4, a block diagram of an automated serotype analysis and identification system based on sequencing data according to an embodiment of the present application is shown in fig. 4, in which the automated serotype analysis and identification system 400 based on sequencing data includes:
an obtaining module 410 for obtaining microbial genome sequencing data;
a comparison module 420, configured to compare the microbial genome sequencing data with each key allele in a key allele database, and record a key allele with a similarity greater than a preset threshold and a corresponding comparison score;
a determining module 430 for determining an organism to which the microbial genome sequencing data belongs based on the key alleles and corresponding alignment scores;
an identification module 440 for determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence types, and identifying the serotypes of the sequencing data of the microbial genome according to the mapping relationship between the sequence types and the serotypes.
EXAMPLE III
Referring to fig. 5, fig. 5 is a schematic diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 5, the computer device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, when the computer device 500 runs, the processor 510 and the memory 520 communicate through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the automated serotype analysis and identification method based on sequencing data in the method embodiment shown in fig. 1 and fig. 2 may be performed.
Example four
Based on the same application concept, the embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the automated serotype analysis and identification method based on sequencing data described in the above method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. An automated serotype analysis and identification method based on sequencing data, comprising:
obtaining microbial genome sequencing data;
comparing the microbial genome sequencing data with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores;
determining an organism to which the microbial genome sequencing data belongs based on the key alleles and corresponding alignment scores;
determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence type, and determining the serotype of the sequencing data of the microbial genome according to the mapping relation between the sequence type and the serotype.
2. The automated serotype analysis and identification method of claim 1 further comprising, prior to obtaining the microbial genome sequencing data: constructing a key allele-sequence type-serotype association database.
3. The automated serotype analysis and identification method of claim 2 wherein said association database is constructed by a process comprising: collecting relevant key allele, sequence type and serotype information; mining the incidence relation between the key allele and the sequence type, the incidence relation between the sequence type and the serotype, and the information of the key allele, the sequence type and the serotype; and constructing the key allele-sequence type-serotype association database according to the association relationship and the information.
4. The automated serotype analysis and identification method of claim 3 wherein the key allele-sequence-serotype association databases comprise a key allele database, a sequence type database and a serotype database, each database establishing association through indexing; the sequence type database records the mapping relation of different combinations of key alleles to each organism sequence type; the serotype database records the association between sequence types and serotypes, and is used for serotype identification of various microorganisms.
5. The automated serotype analysis and identification process of claim 4 wherein the frequency of each serotype of an organism sequence type in the database of key allele-sequence-serotype associations is calculated according to the law of large numbers, and the probability of the serotype of the organism being a known serotype is determined from said frequency; and determining the association relation between the sequence types and the serotypes according to the probability.
6. The automated serotype analysis identification process of claim 1 wherein a sigmoid scoring strategy is used to evaluate the organism to which the microbial genome sequencing data belongs.
7. The automated serotype analysis and identification process of claim 6 wherein the organisms to which the microbial genome sequencing data belongs are evaluated by:
Figure FDA0003650045310000021
Figure FDA0003650045310000022
wherein x represents the number of different alleles in an allele locus of the organism, θ represents a weight associated with the marker, and s represents a score associated with one allele of the organism; allel represents the key allele of the organism, allels represents all the key alleles of the organism, f represents the final score of the organism; and determining the organism to which the microbial genome sequencing data belongs according to the final score.
8. An automated serotype analysis and identification system based on sequencing data comprising:
the acquisition module is used for acquiring microbial genome sequencing data;
the comparison module is used for comparing the sequencing data of the microbial genome with each key allele in a key allele database, and recording the key alleles with similarity greater than a preset threshold and corresponding comparison scores;
a determining module for determining an organism to which the microbial genome sequencing data belongs based on the key alleles and corresponding alignment scores;
an identification module for determining a sequence type in a sequence type database using key alleles of the organism; searching a serotype database using the sequence types, and identifying the serotypes of the sequencing data of the microbial genome according to the mapping relationship between the sequence types and the serotypes.
9. A computer device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine readable instructions when executed by the processor performing the steps of the automated serotype analysis and identification method based on sequencing data of any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of the automated serotype analysis and identification method according to any one of claims 1 to 7 based on sequencing data.
CN202210540274.4A 2022-05-18 2022-05-18 Automatic serotype analysis and identification method and system based on sequencing data Active CN114944197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210540274.4A CN114944197B (en) 2022-05-18 2022-05-18 Automatic serotype analysis and identification method and system based on sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210540274.4A CN114944197B (en) 2022-05-18 2022-05-18 Automatic serotype analysis and identification method and system based on sequencing data

Publications (2)

Publication Number Publication Date
CN114944197A true CN114944197A (en) 2022-08-26
CN114944197B CN114944197B (en) 2024-06-25

Family

ID=82906907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210540274.4A Active CN114944197B (en) 2022-05-18 2022-05-18 Automatic serotype analysis and identification method and system based on sequencing data

Country Status (1)

Country Link
CN (1) CN114944197B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010133257A1 (en) * 2009-05-22 2010-11-25 Fondazione Parco Tecnologico Padano Method for detection and identification of bacterial strains belonging to the classes escherichia coli, salmonella, campylobacter and listeria
US20120270216A1 (en) * 2011-04-19 2012-10-25 Life Technologies Corporation Compositions and methods for detecting and identifying salmonella enterica strains
CN104059977A (en) * 2014-06-25 2014-09-24 上海交通大学 Salmonella serotype identification method and kit thereof
US20150032711A1 (en) * 2013-07-06 2015-01-29 Victor Kunin Methods for identification of organisms, assigning reads to organisms, and identification of genes in metagenomic sequences
CN110423833A (en) * 2019-08-28 2019-11-08 华南理工大学 A kind of multiple PCR method based on specific target identification Listeria monocytogenes serotype
CN111462821A (en) * 2020-04-10 2020-07-28 广州微远基因科技有限公司 Pathogenic microorganism analysis and identification system and application
CN112530519A (en) * 2020-12-14 2021-03-19 广东美格基因科技有限公司 Method and system for detecting microorganisms and drug resistance genes in sample
CN112863603A (en) * 2021-03-12 2021-05-28 南开大学 Automatic analysis method and system for bacterial whole genome sequencing data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010133257A1 (en) * 2009-05-22 2010-11-25 Fondazione Parco Tecnologico Padano Method for detection and identification of bacterial strains belonging to the classes escherichia coli, salmonella, campylobacter and listeria
US20120270216A1 (en) * 2011-04-19 2012-10-25 Life Technologies Corporation Compositions and methods for detecting and identifying salmonella enterica strains
US20150032711A1 (en) * 2013-07-06 2015-01-29 Victor Kunin Methods for identification of organisms, assigning reads to organisms, and identification of genes in metagenomic sequences
CN104059977A (en) * 2014-06-25 2014-09-24 上海交通大学 Salmonella serotype identification method and kit thereof
CN110423833A (en) * 2019-08-28 2019-11-08 华南理工大学 A kind of multiple PCR method based on specific target identification Listeria monocytogenes serotype
CN111462821A (en) * 2020-04-10 2020-07-28 广州微远基因科技有限公司 Pathogenic microorganism analysis and identification system and application
CN112530519A (en) * 2020-12-14 2021-03-19 广东美格基因科技有限公司 Method and system for detecting microorganisms and drug resistance genes in sample
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing
CN112863603A (en) * 2021-03-12 2021-05-28 南开大学 Automatic analysis method and system for bacterial whole genome sequencing data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董蓉;吴清平;张菊梅;于泓鹏;马连营;郭伟鹏;: "食源性致病沙门氏菌血清型生物标志物的研究", 现代食品科技, no. 05, 31 December 2017 (2017-12-31) *

Also Published As

Publication number Publication date
CN114944197B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
Bickhart et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities
CN110349629B (en) Analysis method for detecting microorganisms by using metagenome or macrotranscriptome
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN114420212B (en) Escherichia coli strain identification method and system
KR102587515B1 (en) Method for providing target nucleic acid sequence data sets for target nucleic acid molecules
CN115719616B (en) Screening method and system for pathogen species specific sequences
CN114519351A (en) Subject text rapid detection method based on user intention embedded map learning
CN113470743A (en) Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
US20220359039A1 (en) Electronic Methods And Systems For Microorganism Characterization
Teng et al. MALDI-TOF MS for identification of Tsukamurella species: Tsukamurella tyrosinosolvens as the predominant species associated with ocular infections
CN117766020A (en) Method, device and system for detecting chromosome aneuploidy
CN116864007A (en) Analysis method and system for gene detection high-throughput sequencing data
CN114944197B (en) Automatic serotype analysis and identification method and system based on sequencing data
US10329609B2 (en) Universal DNA profiling
CN116153410B (en) Microbial genome reference database, construction method and application thereof
JP7151556B2 (en) Microorganism identification system and program for identification of microorganisms
CN110600083B (en) Calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data
CN113921088A (en) Metagenome contig binning method using reference database
CN114596917A (en) Method and device for eliminating bacterial contamination sequence by sequencing data
US20200234792A1 (en) Method for obtaining microorganism information using tetra-nucleotide frequency
豊間根耕地 Studies on identification and evaluation of CRISPR diversity on human skin microbiome for development of a new personal identification method
CN118230820A (en) Metagene sequencing data-based drug-resistant gene species source identification method
CN117894367A (en) Screening and evaluating method for conservation of specific sequences of microorganisms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant