WO2008000186A1 - A method for identifying novel gene and the resulting novel genes - Google Patents

A method for identifying novel gene and the resulting novel genes Download PDF

Info

Publication number
WO2008000186A1
WO2008000186A1 PCT/CN2007/070153 CN2007070153W WO2008000186A1 WO 2008000186 A1 WO2008000186 A1 WO 2008000186A1 CN 2007070153 W CN2007070153 W CN 2007070153W WO 2008000186 A1 WO2008000186 A1 WO 2008000186A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
software
protein
sequences
database
Prior art date
Application number
PCT/CN2007/070153
Other languages
French (fr)
Chinese (zh)
Other versions
WO2008000186A8 (en
Inventor
Zailin Yu
Zhihua Zheng
Y. Tom Tang
Genny Yan Yu Fu
Original Assignee
Beijing Bioway-Fortune Research Center For Gene Drugs Ltd.
Tianjin Sinobiotech Ltd.
Fortunerock, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bioway-Fortune Research Center For Gene Drugs Ltd., Tianjin Sinobiotech Ltd., Fortunerock, Inc. filed Critical Beijing Bioway-Fortune Research Center For Gene Drugs Ltd.
Priority to CNA2007800202904A priority Critical patent/CN101460625A/en
Publication of WO2008000186A1 publication Critical patent/WO2008000186A1/en
Publication of WO2008000186A8 publication Critical patent/WO2008000186A8/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K14/00Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof
    • C07K14/435Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof from animals; from humans
    • C07K14/775Apolipopeptides
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61PSPECIFIC THERAPEUTIC ACTIVITY OF CHEMICAL COMPOUNDS OR MEDICINAL PREPARATIONS
    • A61P9/00Drugs for disorders of the cardiovascular system
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/92Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving lipids, e.g. cholesterol, lipoproteins, or their receptors
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61KPREPARATIONS FOR MEDICAL, DENTAL OR TOILETRY PURPOSES
    • A61K38/00Medicinal preparations containing peptides
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the present invention relates to the creation of new biological computer analysis methods and pathways for obtaining new functional genes.
  • the results demonstrate that this analytical method allows for the acquisition of new gene sequences that are consistent with the published human genome chromosomal DNA sequences.
  • This method can be used to analyze and acquire new genes that have biological functions and are related to the diagnosis and treatment of human health and disease, especially genes that are genetic drugs or drug targets.
  • Gene drugs are based on the products of functional genes or genes found in genomics research, and are made by corresponding technologies such as biology, molecular biology or biochemistry, bioengineering, etc., and control intermediate products by corresponding analytical techniques. And finished quality bioactive substance products, clinically useful for the treatment, prevention and diagnosis of certain diseases.
  • Recombinant protein drugs, vaccines, DNA drugs, RNA drugs, and gene therapy drugs are all genetic drugs.
  • Gene drug target refers to the functional gene and gene product (functional protein) found in genomics research, starting from biology, chemistry, physics, molecular biology or biochemistry, bioengineering and other related technologies.
  • An antagonist or inhibitor for example: obtaining a specific antibody, causing the functional protein to lose biological activity by antigen, antibody binding, or screening for a small molecule compound having a biological activity inhibiting the gene product (antibody or small molecule compound) ) as a drug for the treatment and diagnosis of human diseases.
  • the "traditional" gene drug and drug target gene discovery steps are based on the symptoms of the disease to find the difference between physiological and biochemical indicators of normal people and patients, for example: human growth hormone is due to the patient's height is the same as normal People are relatively short, and through various analyses, they are found to be endogenously deficient in human growth hormone secretion and cause insufficiency.
  • human growth hormone is extracted from human urine in the early stage, and then injected into The patient
  • the isolated and purified natural protein is sequenced, and then the DNA sequence is deduced from the protein sequence, synthesized, compounded, detected (DNA probed), and the gene fragment is "displayed” to obtain the complete sequence.
  • DNA probed DNA probed
  • DNA probed DNA probed
  • the gene fragment is "displayed” to obtain the complete sequence.
  • a foreign system such as E. coli
  • Prepared, purified recombinant proteins, genetically engineered drugs are established through preclinical testing (animal testing) and clinical trials. This process can be referred to as a "traditional" or "classical” genetic drug discovery program.
  • the present invention provides a method of discovering a novel gene, the method comprising the steps of:
  • step 2) performing secretory signal peptide analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a secreted signal peptide and a protein sequence not containing a secreted signal peptide;
  • step 1) performing a transmembrane region analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a transmembrane region and a protein sequence containing no transmembrane region;
  • steps 2) and 3 Combining the sequence of results obtained in steps 2) and 3), roughly dividing the sequence into: a sequence containing a secreted signal peptide and not containing a transmembrane region; a sequence containing a secreted signal peptide and containing a transmembrane region; a three-category sequence of secreted signal peptides and sequences having a transmembrane region in the sequence of 5 to 8 in the sequence of transmembrane regions;
  • the matching condition is: the sequence similarity is 15% ⁇ 95%, preferably 20% to 90%, more preferably more than 25% and less than 90%, and it is required that these mutation points are distributed as uniformly as possible throughout the matched sequence;
  • the method of discovering novel genes of the present invention is accomplished by a computer system platform, the computer system platform comprising:
  • the following software for sequence editing - software for converting sequences in fasta format to sequences in tabular format; software for converting sequences in GenBank format to sequences in fasta format; reverse sequence complementary program for DNA sequences; translation program for DNA sequences; acquisition of GenBank Software for CDS sequences in a formatted sequence file; software that combines two simple sequence fragments and filters out duplicates between them;
  • Software that implements the deletion of any sequence in the database software that implements the operation of inserting sequence sequences into sequences; software that batches or individually acquires certain sequences in large databases; performs DNA on temporary, unindexed databases, Software for obtaining protein sequence access; software for directly acquiring sequence data on GenBank via network; software for directly acquiring sequence data on local database from local network; program for indexing database in Fasta format; database for GenBank format Indexing program; software for fragment sequence acquisition of genomic sequences; software for facilitating acquisition of a fragment in a sequence;
  • the computer system platform is preferably based on the Linux operating system.
  • the bioinformatics processing program and the established system platform technology of the present invention can be used to discover new genes and analyze their products, so that humans can more clearly understand the relationship between gene expression and diseases, and improve the level of disease treatment.
  • the present invention employs the procedure "reverse” opposite to the conventional "traditional” process described above to perform functional genomics studies of gene drugs, with the aim of greatly accelerating the screening of novel gene drugs, and the present invention is designed to show conventional "traditional” gene drugs.
  • the search method is simpler, requires less computer equipment, and is easier to operate and master, and can shorten the results for several years.
  • the invention firstly compiles a novel computer program software processing system to specifically screen genetic drugs and drug target genes.
  • the self-programming program uses a published human genomic DNA sequence to manipulate a Linux system platform through a series of program software to predict new protein (gene) sequence (ORF) coding.
  • This software operating system will combine the operating system and advantages of disease types, disease occurrence, formation mechanisms, mechanisms, genetic information, such as the use of bioinformatics to predict secreted peptides, signal peptides, transmembrane regions, and Various existing functional genomics tools and computational tools are integrated and augmented with a new self-programming software system to achieve predictive screening and splicing of novel genes.
  • the possible ORF sequences predicted by the computer through functional genomics studies, using high-throughput screening methods, the steps are to screen for gene drugs at the cellular and animal levels.
  • the splicing, cloning and amplification of genes are accomplished using molecular biology techniques.
  • high-throughput screening methods such as quantitative PCR and gene chip technology
  • the computer bioinformatics discovery and analysis technology platform of the present invention can be used for new gene discovery in humans, and can also be used, but not limited to, for genetic discovery and analysis purposes of animals, plants, and microorganisms.
  • the computer-generated bioinformatics analysis program and the established feasible platform technology of the present invention combine existing published human genome research materials and information with a program designed and operated by the present invention to analyze a large amount of data and libraries, and obtain new ones therefrom.
  • the predicted gene is designed to solve the shortcomings in technology and time to obtain new genes using traditional techniques.
  • the programming involved in the present invention has the following advantages: 1) rapid analysis and acquisition of new possible genes; 2) simple and efficient operation procedures; 3) acquired new genes with biological functions And the possibility of clinical application as a gene drug and gene drug target.
  • LDL low density lipoprotein
  • HDL high density lipoprotein
  • the protein in lipoproteins is called apolipoprotein (Apolip 0 p r0 tein).
  • Lipoprotein combines with cholesterol to form lipoprotein cholesterol, which operates cholesterol inside and outside the cell.
  • the clinical significance of the reduction of high-density lipoprotein cholesterol may indicate a predisposition to coronary heart disease.
  • the clinical significance of low-density lipoprotein cholesterol increase may indicate coronary heart disease and cerebrovascular disease caused by atherosclerosis.
  • the key step in the reverse transport of cholesterol is to transfer cholesterol from the cell to the extracellular lipoprotein.
  • An important component of various lipoproteins is apolipoprotein.
  • Apolipoproteins are responsible for transporting different lipoproteins to various parts of the body.
  • Apolipoprotein is a protein located on the surface of lipoproteins, which are composed of amino acids in a certain order. They are present in various types of lipoproteins in a variety of forms and in different ratios.
  • Various lipoproteins also have different functions and different metabolic pathways due to the different types of apolipoproteins they contain.
  • AI-BP apoA-I binding protein
  • the AI-BP-encoded gene, APOA1BP is located on chromosome lq21 and consists of 6 exons and 5 introns. 2.5kbo Northern blot analysis confirmed that APOAlBP mRNA is ubiquitously expressed in kidney, heart, liver, and thyroid gland. High expression in the adrenal gland and testis.
  • the AI-BP protein was not found in normal human serum, but there was a high level of AI-BP in serum samples from patients with septic syndrome. Healthy human AI-BP protein has a significant amount in cerebrospinal fluid and urine.
  • the present invention also discloses for the first time two novel genes similar to apolipoprotein-related proteins obtained by the procedures and methods of the present invention, which are located on human chromosome 19. These two genes differ from the apolipoprotein interacting protein genes that are now available: 1) are located on different chromosomes; 2) have no secreted peptides; 3) the amino acid sequence of the protein is only 40.0 compared to the known ⁇ gene.
  • the present invention describes various known or disclosed biological information materials, the acquisition of libraries and their localized work content, the obtained libraries and materials are, but not limited to, NCBI remote databases, biological information analysis required for downloading Related latest database. These include: Human Expression DNA Sequence Tag Database, Non-redundant Protein Sequence Database, Nucleotide Database, Patented Protein Sequence Database, Human Chromosomal Sequence Database, etc. All of these downloaded databases are formatted on the local computer. Convert it to a sequence format database that is recognized by the local program.
  • All of the applied databases, libraries, and databases that are highlighted in the present invention are from publicly available data, and are validated and digitally processed by local computers to form ready to be retrieved by the local computer and can be used with the present invention. Programming fusion and programming.
  • the biological information analysis program used in the present invention mainly has, but is not limited to,
  • sequence alignment is - blastall: NCBI (National Center for Biotechnology Information) blast package, which can perform rough alignment of gene sequences; WU-blast: University of Washington's blast package, which is new The functions of gene retrieval and analysis are excellent; Fasta: EMBL (European Molecular Biology Laboratory) sequence alignment software package; clustalw: multiple sequence alignment analysis software; sim4: expression sequence and chromosome genome sequence alignment software ;
  • the software used for database editing is:
  • Pressdb WU-blast program-specific nucleotide sequence database formatting software; im-index: mainly used to index the sequence database to achieve large database operability; setdb: WU-blast program-specific protein sequence database format Software
  • the software used for sequence stitching has -
  • Cap4/Phrap Sequence splicing software from the University of Washington Genomics Research Center; merger: simple sequence splicing software;
  • Tmpred predicts transmembrane of protein sequences
  • Signalp predicts signal peptides of protein sequences
  • remap Sequence cleavage site analysis software
  • restrict sequence cleavage information statistical software
  • showorf DNA sequence translation software
  • pepinfo graphical display of various amino acid content in protein sequences
  • pepstats statistics of various amino acids in protein sequences The content also yields molecular weight, isoelectric point, charged charge and light absorption at 280 nm
  • pepwheel graphically shows the helical wheel of all amino acid residues in the protein sequence
  • Proparam mainly used to comprehensively determine the hydrophilicity/hydrophobicity of the protein
  • Tmap Graphic showing the transmembrane region of the protein
  • ps_scan Protein active site/domain analysis software.
  • the invention provides an independent bio-computer program for the prediction of new genes and analysis of the results of the examples.
  • the present invention also includes all programs that are run on a local computer and that form the new gene discovery and analysis technology system platform of the present invention.
  • sequence format conversion software can convert the sequence of fasta format into a table format sequence; gb2f as ta : Sequence format conversion software, converting sequences in GenBank format to sequences in fasta format; tt-comp-dna: sequence editing software, DNA sequence reverse complement program; translate: sequence editing software, DNA sequence translation program; gb2cds : sequence editing software , obtain the CDS sequence in the sequence file in GenBank format; tt-zip-2: sequence editing software, mainly used to merge two simple sequence fragments, and filter out the repeated parts between them;
  • the software used for database operations is:
  • Im-del database editing software, can delete any sequence in the database; im-insert: database editing software, can realize the operation of adding sequence to the sequence database; im-retrieve: database editing software, batch or single Obtain some sequences in a large database; tt _get: software for DNA and protein sequence acquisition operations on temporary unindexed databases; rfetch: database operation software to directly obtain sequence data on GenBank via the network; lfetch: database Operating software, the local network directly obtains the sequence data on the local database; biofaseqindex: database editing software, the program for indexing the database in Fasta format; biogbseqindex: database editing software, the program for indexing the database in GenBank format; tt_subseq_genome : software for fragment sequence acquisition of genomic sequences; tt-sub-seq: software for sequence editing, which facilitates the acquisition of a fragment in a sequence;
  • the software used for plotting the alignment comparison results has a - drawBlast: blast result plotting program that can make a rough comparison of the results of the blast results;
  • the software used for data analysis is:
  • Tt_tmpred_p Data parser software, dedicated to parsing tt-tmpred to generate analysis result data; parser-bx: parser software, software for parsing the output of programs such as blastn, blastp, blastx; parser-fasta: parser software, Software for parsing the results of the fasty comparison program output; ps_signalp: data parser software, parsing the result data generated by the pepsigp program; tt_pblast: blastn result parsing software, automatic machine analysis for a large number of result output; used to assist other programs
  • the software that runs is - tt-cycle: auxiliary software, mainly used to meet the requirements of some programs that cannot be automated. Automated operation;
  • the re-optimized software is:
  • Ed_ca P 4 Recompiled Cap4 program, the implementation can automatically complete the configuration of the cap4 runtime environment; extractcontigs: converts the score matrix data output from cap4 into a file in fasta format; pepsigp: Recompiled Signalp software, only a single prediction The signal peptide program is improved to achieve comprehensive forecasting of batch automation; primers-for-fulllength- clone: batch primer design software; tt_fasty_l: improved fasty program, the main purpose is to achieve convenient operation; tt-tmpred: recompiled protein sequence Transmembrane region prediction, the improved sequence can achieve batch analysis.
  • the computer system platform of the present invention may also include other authoritative biological information data analysis software packages, such as Emboss (Biology Sequence Analysis Software Package, which is available at http://eTnboss.sourceforge.net).
  • Source code can be purchased at http: ⁇ www.ch.embnet.org/ to obtain the Linux version of the source code
  • Singlp secretory signal peptide analysis, available at http: ⁇ www.cbs.dtu.
  • Predict Protein protein basic information analysis prediction, available for free download at http://www.predictprotein.org/
  • Clustalw preface and sequence analysis, available at http:/ /www.ebi.ac.uk/Download to the free Linux software
  • Primer primer design analysis, available at http://primer3.sourceforge.net/.
  • the preferred operating environment of the software of the present invention is a Linux system. To filter and filter large amounts of data, it is necessary to obtain the source code of the above software, and use the core functional part to recompile the software suitable for the operation of the platform.
  • nucleotide sequence libraries are downloaded from the NCBI remote database.
  • the process and method of the database are downloaded from the NCBI remote database.
  • the use of various biological information analysis software and systems is enumerated, in particular, the computer analysis system platform of the present invention is compiled, so that independent software analysis systems can work together to discover and analyze new genes. jobs.
  • a comprehensive process framework diagram of a computer operational analysis program for discovering new genes is presented.
  • the computer and software engineering completed by the present invention is an independent and complete biological information processing system platform, which can be copied, copied and transplanted, and can be used for, but not limited to, discovery and functional analysis of new genes. , demonstration, teaching, business purposes, clinical treatment and medical diagnostic applications.
  • This information processing platform in the present invention is actually applied to the discovery and analysis of new protein sequences (see Example 3 for specific operations) to obtain 35 possible new protein sequences.
  • Two of them are similar to the known apolipoprotein A1BP gene and are disclosed as an example.
  • the two new genes, BFC06016 and BFC06104 have the nucleotide sequences shown in Seq ID No. 1 and Seq ID No. 3, respectively; they are in GenBank accession numbers DQ778079 and DQ778080, respectively.
  • the amino acids encoded by the nucleotide sequence are the sequences shown by Seq ID No. 2 and Seq ID No. 4, respectively.
  • genes can be obtained by whole DNA sequence synthesis methods and used in biological and clinical application research and product development applications.
  • a method and technique for synthesizing a whole gene DNA sequence are exemplified in detail.
  • the main method is to use PCR to distribute synthetic DNA fragments and assemble them into full gene sequences, and the results of DNA sequencing confirmed the synthesis.
  • the two new human genes were found to be biologically active by quantitative PCR, and their expression profiles varied with human tissues and organs. Preliminary experimental results revealed that these two genes are biologically active.
  • the DNA and protein amino acid sequences of the present invention are applicable to new drug development and clinical diagnosis, preferably these genes are drug or drug target genes for diagnostic and therapeutic purposes related to cardiovascular diseases, more preferably as Gene drug or gene therapy drug target.
  • nucleotides of the novel gene of the present invention can be introduced into a host cell by recombinant cloning techniques to allow expression of the encoded protein.
  • the host cell is genetically engineered (transduced or transformed or transfected) into a host system carrying the vector plasmid containing the novel gene mentioned in the present invention in an invasive manner, a viral infection "phage” or the like.
  • the engineered host cells can be cultured in a medium containing conventional nutrients and modified as appropriate to facilitate the promoter.
  • the expression cells are selected by controlling the selection of transformants or amplification of the nucleotide conditions encoding the nucleotide strands of these genes, such as temperature and pH, in an appropriate manner.
  • the recombinant vector carries a nucleotide comprising a protein encoding a novel gene.
  • the recombinant vector may be an expression vector which expresses the fusion protein in a host cell by a nucleotide sequence encoding.
  • the form can be, but is not limited to, fusion or separate insertion.
  • Host organisms and somatic cells include, but are not limited to, vertebrates (such as humans, monkeys, rats, rabbits, etc.) fish, chickens, insects, plants, yeast, fungi, bacteria, and the like.
  • Nucleotides encoding the present invention can be expressed as proteins under the action of a suitable promoter.
  • suitable promoters include, but are not limited to, adenovirus promoters, such as the major late promoter of adenovirus; or heterologous promoters, such as the CMV promoter and the RSV promoter; inducible promoters may have MMT promoters , a heat-stimulated promoter, an albumin promoter, an ApoAI promoter, and a human globulin promoter; a viral thymidine promoter is a herpesvirus thymidine kinase promoter; a retroviral LTR promoter includes a modification Post-LTR promoter; beta-actin promoter; human growth hormone promoter.
  • a native promoter can also be used to control the expression of a protein encoded by a nucleotide.
  • recombinant cells have the ability to express a nucleic acid sequence encoding a protein as described herein.
  • the recombinant engineered cells can express the novel protein of the present invention continuously or in the presence or absence of an inducing agent.
  • Recombinant engineered cell forms include, but are not limited to, cells of vertebrates (i.e., humans, monkeys, mice, rabbits, fish, chickens, etc.), insects, plants, yeast, fungi, and bacteria.
  • Antibodies to novel proteins obtained according to the present invention and known techniques include, but are not limited to, polyclonal, monoclonal or humanized antibodies and the practical application of such antibodies.
  • the specific antibody is preferably produced by immunizing an animal.
  • the specific antibodies have important applications in clinical diagnosis, treatment, and as biological agents.
  • the present invention also provides a nucleic acid having a nucleotide sequence of at least 95%, preferably at least 96%, more preferably at least 97%, further preferably at least 98%, further preferably Seq ID No. 1 or Seq ID No. 3. At least 99% sequence homology, and the protein encoded by the nucleic acid has the same function as the protein encoded by Seq ID No. 1 or Seq ID No. 3, respectively.
  • the nucleic acid encoding the protein of the present invention are as follows: After substitution of the amino acid sequence defined by S e q ID N 0 .2 or S e q ID N 0 .4, deletion or addition of one or several amino acids and with Seq ID No. 2 or Seq ID
  • the protein represented by No. 4 has the same function as a protein derived from a protein represented by Seq ID No. 2 or Seq ID No. 4.
  • the present invention also provides a protein having an amino acid sequence of at least 90%, preferably at least 92%, more preferably at least 95%, further preferably at least 97%, still more preferably at least 99, with Seq ID No. 2 or Seq ID No. 4. % sequence homology, and the protein has the same function as the protein shown by Seq ID No. 2 or Seq ID No. 4.
  • the protein of the present invention is substituted, deleted or added with one or several amino acids in the amino acid sequence defined by Seq ID No. 2 or Seq ID No. 4 and with Seq ID No. 2 or Seq ID No. 4
  • the protein shown has the same function as a protein derived from a protein represented by Seq ID No. 2 or Seq ID No. 4. DRAWINGS
  • Figure 1 is a diagram of a computer network hardware connection framework for new gene discovery.
  • 2A, 2B and 2C are respectively a flow chart of a method for novel gene discovery, a flow chart of protein sequence clustering, and a comprehensive process framework diagram of a computer operation analysis program.
  • Figure 3 shows the DNA nucleotide sequence (A1) of two newly discovered apolipoprotein A1BP-like genes BFC06016 (A) and the corresponding amino acid sequence (A2) and BFC06104 (B) DNA nucleosides The acid sequence (B1) and the corresponding amino acid sequence (B2).
  • FIG 4 shows the results of protein hydrophobicity/hydrophilic prediction of BCF06016 (A) and BFC06104 (B) predicted by computer using ProParam software.
  • Figure 5A and Figure 5B show the results of protein transmembrane region analysis of BFC06016 gene by Tmpred/tmap analysis software, which proves that there is no transmembrane region (Fig. 5A). Similarly, BFC06104 has no transmembrane. Zone ( Figure 5B).
  • Figures 6A and 6B show the helical wheel of each amino acid residue in the protein sequence using the pepwheel pattern.
  • Figure 6A shows the results of BFC06016 and Figure 6B for the BFC06104 protein amino acid spiral wheel analysis.
  • Figures 7A and 7B use pepinfo to count the amino acid content and distribution of various amino acids in the protein sequence.
  • Figure 7A shows the results of the BFC06016 gene analysis
  • Figure 7B shows the results of the BFC06104 gene analysis.
  • Figure 8 shows that the BFC06016 (A) and BFC06104 (B) genes are located on the human chromosome 19 DNA sequence.
  • Figure 9 shows the amino acid homology between BFC06016 and BFC06104 and the known apolipoprotein A1BP gene.
  • the star character (*) represents the same amino acid between the three genes; the blank symbol ( ) indicates that the amino acid is not the same among the three; the lower dot symbol (.) represents the amino acid semi-conservative mutation; the upper and lower two points (:) represent the amino acid conserved mutation .
  • the amino acid homology between BFC06016 and apolipoprotein A1BP was 40.0%; the amino acid homology between BFC06104 and apolipoprotein A1BP was 41.5%.
  • Figure 10 is a flow chart showing the complete synthesis of a new gene nucleotide sequence predicted by a computer.
  • Figure 11 is a quantitative map of the expression of these genes observed in different human tissues and cell lines using qPCR detection primers prepared using the common DNA sequences of the two newly discovered genes.
  • Figure 12 shows the newly discovered gene clones and protein expression and results in bacteria. detailed description
  • the bioinformatics analysis program used in the present invention is derived from public channels or commercial software, and mainly includes blastall: NCBI (National Biotechnology Information Center) blast software package, which can realize the comparison of approximate gene sequences; WU- Blast: The blasting software package at the University of Washington, which performs well in the search and analysis of new genes; Fasta: EMBL (European Molecular Biology Laboratory) sequence alignment software package; cap4/Phrap: University of Washington Genomics Science Sequence splicing software for the research center; Tmpred: predicting transmembrane of protein sequences; Signalp: signal peptide for predicting protein sequences; clustalw: multi-sequence alignment analysis software; pressdb: database editing software, nucleotide sequence specific for WU-blast programs Database formatting software; sim4: expression sequence and chromosomal genome sequence alignment software; im_index: database editing software, mainly used to index the sequence database, realize the operability of large databases; setdb: database editing software, WU-blast program
  • Computer program for the execution of the present invention Mainly has the following software: tbl2f aS ta- n /f aS ta2tbl- n: sequence format conversion software, which can convert the sequence of fasta format into a sequence of table format; gb2f as ta : sequence Format conversion software, can convert the sequence of GenBank format into a sequence of fasta format; drawBlast: blast result drawing program, can make a rough comparison diagram by blast result data; ed-cap4: Recompiled Cap4 program, can Automatically complete the configuration of the cap4 runtime environment; extractcontigs: convert the score matrix data output by cap4 to a file in the fasta format; im-del: database editing software, which can delete any sequence in the database; im-insert: database editing Software, can achieve the operation of inserting sequences into the sequence database; im-retrieve: database editing software, batch or single to obtain certain sequences in large databases; pepsigp: recompiled Signalp software, only
  • Example 3 a new gene acquisition process
  • the flow chart of the novel gene discovery method of the present invention, the protein sequence clustering flow chart, and the computer operation analysis program integrated flow framework are shown in Figures 2A, 2B and 2C.
  • the patent protein database is parsed to obtain all protein sequences in the database of 500aa or less, preferably 400aa or less, more preferably 300aa or less (programs are fasta2tbl-n, tbl2fasta_n), and all sequences are transmitted to pepsigp for secretory signal peptides.
  • the results were transferred to ps-signal analysis, and the protein sequence containing the secreted signal peptide and the protein sequence containing no secreted signal peptide were obtained respectively; through the tt-cycle, all the sequences containing the signal peptide were transmembrane by tt-tmpred Regional dynamic predictive analysis, the results are directly sent to the program tt-tmp r ed_p for analysis, and then the sequence is divided into: a sequence containing a secreted signal peptide and containing no transmembrane region; containing a secreted signal peptide and containing a transmembrane region Sequence; a sequence of sequences having a transmembrane region containing a secreted signal peptide and having a transmembrane region of 5 to 8 (greater than or equal to 5 and less than or equal to 8) (see Figure 2B for details);
  • the amino acid sequence fragment obtained as a model is tblastn-aligned with the human expression
  • An expression sequence tag that matches (with a sequence similarity of 15% to 95%, preferably 20% to 90%, more preferably greater than 25% and less than 90%, preferably the mutation is distributed as uniformly as possible throughout the matched sequence), for example
  • all the sequence fragments containing the parameters that meet the parameter preferences can be obtained and sent to the tt_pblast for analysis by the pipeline to obtain all expressions that meet the preferred conditions.
  • the sequence of tags is sent to the script for filterA and ployT replacement filtering; establish the environment necessary for cap4 to run.
  • These fasta format sequences are all converted to xml data exchange format files using fastaclust2caml.
  • Cap4 and Phraps software are started to completely splicing these sequences, and then the contiguous data files are restored to the FASTA format file by extractcontigs, and the sequences are merged; the blastx ratio of these sequences and the non-secreted protein database is set first.
  • parser-bx parsing excludes all perfectly matched sequences, and then the remaining sequences and the patent protein sequence database are tt-fasty-1, and parser-fasta parses to obtain the remaining sequences;
  • the remaining programs and human chromosome sequence databases are blastn alignment verification analysis, and the patent nucleotide sequence database is blastn aligned to correct mutation or deletion problems on the sequence, and the nucleotide sequence database as well as the human expressed sequence tag database do blastn ratio Whether the analysis solves the problem of insufficient sequence length, and whether the blastx verification sequence has been found in the non-redundant protein database has been discovered; comparing the results obtained by the five repeated runs analysis, the full-length gene sequence can be obtained. This can be determined using Sim4 software.
  • ProParam can be used for predictive analysis of hydrophobicity/hydrophilicity of the protein
  • Signalp can be used for secretory signal peptide analysis using Signalp
  • Trpred and tmap can be used for protein transmembrane region analysis
  • the secondary structure of the protein can be analyzed using gamier; the spiral wheel of each amino acid residue in the protein sequence can be graphically displayed using pepwheel; the content of amino acids of various natures in the protein sequence can be counted using pepinfo and The distribution of these amino acids is shown; pepstat can be used to calculate the content of various amino acids in the protein sequence and obtain information such as molecular weight, isoelectric point, charged charge and light absorption value at 280 nm.
  • PubMed literature search a large number of related literatures are collected through PubMed literature search. The discovered genes are predicted for biological functions.
  • Example 4 Obtainment of a new apolipoprotein A1BP-like BFC06016 and BFC06104 genes According to the above new gene acquisition operation procedure, the actual operation is performed at the terminal of the server, and the present invention obtains 35 protein sequences predicted by the computer, which is possible. New gene candidates. Two new genes similar to the apolipoprotein A1BP gene are now numbered BFC06016 and BFC06104, respectively. Seq ID No. 1 and Seq ID No. 2 are the DNA sequence and amino acid sequence of BFC06016 and are listed in Figure 3 (A). Seq ID No. 3 and Seq ID No. 4 are the DNA sequence and amino acid sequence of BFC06104 and are listed in Figure 3 (B).
  • FIG. 4 (B) also has no secretory signal peptide); using Tmpred/tmap analysis software for protein transmembrane region analysis, BFC06016 gene protein transmembrane region analysis results, proved that it has no transmembrane The region (Fig. 5A), for the same reason, also proved that BFC06104 has no transmembrane region (Fig. 5B); the spiral wheel of each amino acid residue in the protein sequence is shown by the pepwheel pattern, and Fig. 6A and Fig. 6B are BFC06016 and BFC06104, respectively. Results of protein amino acid helical round analysis; using pepinfo to calculate the content of various amino acids in the protein sequence and Cloth, FIG. 7A shows the results of genetic analysis BFC06016, FIG 7B shows the results BFC06104 genetic analysis.
  • Example 5 Comparison between the human gene BFC06016 and BFC06104 similar to the apolipoprotein A1 binding protein gene and known apolipoprotein A1BP
  • Sim4 specific application software determines the position of the full-length gene on the chromosome; A1BP known human apolipoprotein genes are located in the human chromosome 1 (see Document Ritter et al Genetics, 79: 693-702,2002 ) 0
  • the BFC06016 and BFC06104 genes predicted by the computer analysis method designed by the present invention are located on human chromosome 19, respectively, as shown in Fig. 8 (A) and Fig. 8 (B).
  • the full-length cDNA sequences of the BFC06016 and BFC06104 genes were obtained in a human cDNA library, which are Seq ID No. 5 and Seq ID No. 6, respectively.
  • Amino acid sequence comparisons with the known human apolipoprotein AIBP are shown in Figure 9. Its amino acid homology with apolipoprotein A1BP was 41.5% and 40.0%, respectively.
  • the star character (*) represents the same amino acid among the three genes; the blank symbol ( ) indicates that the amino acid is not the same among the three; the lower dot symbol (.) represents the amino acid semi-conservative mutation; the upper and lower two points (:) represent the amino acid conserved mutation.
  • the amino acid homology between BFC06016 and apolipoprotein A1BP was 40.0%; the amino acid homology between BFC06104 and apolipoprotein A1BP was 41.5%.
  • Example 6 a brief description of molecular cloning techniques
  • coli was purchased from GIBCO/BRL. Purification of plasmid DNA, recovery of DNA fragments, and the like were carried out using a commercial Qiagen purification column. Pichia pastoris or BL21DE3 strains were used for protein expression and preparation.
  • Seq ID No. 7 5'- CACATATGAGCAGCGCAGCCGGCCCAGACCCGTCGGAGG CGCCCGAAGAGCGGC -3' Synthesis 1-57 positive strand, 54 bases long.
  • Seq ID No. 8 5'- GGGCGGCTGCCTCCGCGGTGCTGAGGAAATGCCGCTCTTC
  • GGGCGCCTCCG -3' Synthesis 37-87 is reverse complemented and is 51 bases in length.
  • Seq ID No. 9 5'- CCGCGGAGGCAGCCGCCCTGGAGCGGGAGCTGCTGGAGG ATTATCGCTTTGGGCGGC -3' 70-126 Positive strand, 57 bases long.
  • Seq ID No. 10 5'- CAGCCACGGCACTAGCATGACCGCACAGCTCCACGAGCT GCTGCCGCCCAAAGCGATA -3' 111-168 Reverse complement, 58 bases long.
  • Seq ID No. ll 5'- TGCTAGTGCCGTGGCTGTGACCAAGGCGTTCCCGTTGCC CGCTCTCTCCCGGAAGCAG -3' 152-209 Positive strand, 58 bases long.
  • Seq ID No. 12 5'- CTGCCCCGTTCTGCTCCGGGCCACACACGACCAGCACCG TCCTCTGCTTCC GGGAGAG -3' 195-252 Reverse complement, 58 bases long.
  • Seq ID No. 13 5'- GCAGAACGGGGCAGTGGGGCTGGTCTGTGCCCGGCACC
  • Seq ID No. 14 5'-GCAGGTCCAGCGAGCGTGTGGGGTAGAAGATGGTGGGT TCATACTCAAA CACCCGC -3' 278-333 Reverse complement, 56 bases long.
  • Seq ID No. 15 5'- CACGCTCGCTGGACCTGCTGCATCGGGACCTGACCACCC AGTGCGAGAAGATGGAC -3' 316-371 Positive strand, 56 bases long.
  • Seq ID No. 16 5'- ATGAGCTGCACCTCAGTGGGCAGGTAGCTCAGGAAGGG GATGTCCATCTTC TCGC -3' 358-412 is a reverse complement, 55 bases long.
  • Seq ID No. 17 5'- CCTGCCCACTGAGGTGCAGCTCATTAACGAAGCCTATGG GCTGGTGGTGGATGCCGT -3' 389-445 positive strand, 57 bases long.
  • Seq ID No.18 5'- GGGGCCCCCGACCTCGCCCGGCTCCACGCCGGGGCCCA
  • GTACGGCATCCACCACC -3' 431-485 is a reverse complement, 55 bases long.
  • Seq ID No. 19 5'-GCCGGGCGAGGTCGGGGGCCCCTGCACCCGCGCGCTGG CCACGCTCAAGCTGCTGTCC -3' 464-521 positive strand, 58 bases long.
  • Seq ID No.20 5'- GCCTGAGGGGATGTCCAGGCTCACGAGGGGGATGGACA GCAGCTTGAGCG TGGCC -3' 500-554 is reverse complemented and is 55 bases in length.
  • Seq ID No. 21 5'- CATCCCCTCAGGCTGGGACGCAGAGACCGGCAGCGATT CGGAGGACGGG CTGCGGCCTG -3' 542-600 positive strand, 59 bases long.
  • Seq ID No. 22 5'-GCGCAGCGCTTGGGCGCCGCGAGAGACACCAGCACGTC AGGCCGCAGCCC GTCCTCCGA -3' 579-637 Reverse complement, 59 bases long.
  • Seq ID No. 23 5'- CGTGCTGGT GTCTCTCGCG GCGCCCAAGCGCTGCGCTG G CCGCTTCTCCGGGCGCCACC -3' 602-660 positive strand, 59 bases long.
  • Seq ID No. 24 5'- CTTGCGGCGCACGTCATCGGGCACGAACCTGCCGGCCA CGAAGTGGTGGCG CCCGGAGA -3' 646-704 reverse complement, 59 bases long.
  • Seq ID No.25 5'- TGACGTGCGCCGCAAGTTCGCTCTGCGCCTGCCGGGATA
  • Seq ID No. 26 5'- TAGCGGCCGCTCACAGTGCCGCGACGCAGTCGGTGCCC GTGTATCCCGGC -3' 719-768 reverse complement, 50 bases long.
  • Seq ID No. 27 5'- CACATATGATGAGCAGCGCAG - 3' 1-21 positive strand, 21 bases long.
  • Seq ID No. 28 5'- TAGCGGCCGCTCACAGTGCCGC -3' 747-768 is reverse complementary and is 22 bases long.
  • oligonucleotide strand primer of the DNA sequence to be synthesized Starting from the first oligonucleotide strand primer of the DNA sequence to be synthesized, first, every four oligonucleotide strands are grouped, and a long-chain DNA fragment is synthesized by PCR. For example, Seq ID No. 7, Seq ID No. 8, Seq ID No. 9, and Seq ID No. 10 are a group. In a 25 ⁇ l PCR buffer reaction volume, the primer contents were ⁇ : ⁇ : ⁇ : 100 pM of primer, 20 mM dNTP, appropriate amount of water and lu T4 DNA poly-polymerization (T4 Taq Polymerase).
  • the PCR cycle reaction of 5 procedures is performed first, and then the oligoribonucleotide chain primers at both ends are added (here, If the first and second sets of products are combined, S e qID N 0 .7 and S e q ID No. 14 are each added to 100 pM).
  • a larger DNA fragment was prepared by the same PCR cycle procedure. However, the 72 ° C holding time in the cycle can be appropriately increased.
  • the full sequence synthesis of the designed DNA can be completed by the demonstration of operation diagram 10.
  • the BFC06016 and BFC06104 computer predicted gene sequences were used to obtain synthesis and preparation, and the 5' end thereof contained an Nde I restriction enzyme.
  • the full-length DNA synthesized by PCR was inserted into the pTA vector, and contained two EcoRI and Notl sites, respectively, to the left and right of the insertion site.
  • the DNA sequence was verified by sequencing to confirm that the synthesized DNA sequence was correct.
  • This plasmid was named pTA-BFC06016.
  • cDNA For quantitative PCR analysis, mRNA from human tissues and cell lines from various sources was used to synthesize cDNA. In a reaction volume of 25 ⁇ , 100 units of M-MLV reverse transcriptase (Ambion), 0.5 mM dNTPs were added. (Epicentre) and 40 ng/ml random 6 nucleotide primer (Fisher). The sample was reacted at 25 ° C for 10 minutes, at 42 ° C for 50 minutes, then at 70 ° C for 15 minutes, diluted to 500 ⁇ l and finally stored at -20 ° C. cDNA can also be It was purchased from a reagent company (including the use of the Clontech MTC cDNA library).
  • PCR primers and probes (6-FAM-labeled at the 5' end and TAMRA at the 3' end) were designed using ABI primer design software, which uses a common DNA sequence design PCR detection of two newly discovered genes. Primer. The synthesis was performed by Qiagen, Biosearch Technologies Inc or Applied Biosystems Inc.
  • Primer used Primer-F (nucleotide position 244-263) Seq ID No 29: 5'- CTGGAGGA
  • the QPCR reaction uses the ABI7700 Sequence Reaction Detection System. In a 25 ⁇ reaction volume, containing 5 ⁇ cDNA template, IX TaqMan Universal PCR Mixture Reaction (ABI), PCR primers for ⁇ , probe content 200 nM, and IX VIC-labeled Beta-2-Microglobulin endogenous control (ABI).
  • the PCR reaction conditions were 50 ° C, 2 minutes; 90 ° C, 10 minutes; then repeated 40 cycles at 95 ° C, 15 seconds, 60 ° C, 1 minute. Analysis of the results used sequence detection software (ABI) and application comparison CT methods to calculate the difference in multiples of the gene product.
  • the cloned cDNA fragment was cloned into an expression vector (6 histidine DNA sequences were inserted in the nucleotide reading frame of the gene).
  • IPTG was used to induce expression and purification of the target protein in E. coli (BL21/DE3) under the action of the T7 promoter (Fig. 12).
  • the target protein (with or without 6 XHis) was purified by affinity chromatography on a Ni column. After the purified protein was directly mixed with the immunoadjuvant, rabbits immunized with 3-4 kg were injected subcutaneously, and the immunization was repeated 3 times at intervals of 15-20 days.
  • the purified protein antigen is then injected directly into the vein to boost the immune response.
  • the antibody titer obtained is greater than 1:68, preferably 1:500.
  • the prepared antibody can be directly used for immunological tests for biological activity, functional, and clinical detection purposes, and may be used, but not limited to, ELISA, Westerten Blot, and the like.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Biotechnology (AREA)
  • Pathology (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Urology & Nephrology (AREA)
  • Physics & Mathematics (AREA)
  • Hematology (AREA)
  • Wood Science & Technology (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Endocrinology (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Animal Behavior & Ethology (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Toxicology (AREA)
  • Gastroenterology & Hepatology (AREA)
  • General Chemical & Material Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Cell Biology (AREA)
  • Cardiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Food Science & Technology (AREA)
  • General Physics & Mathematics (AREA)

Abstract

The present invention discloses a method for identifying novel gene using bioinformatics analyses, computer simulation forecasting technique and molecular biological technique. In particular, using human genome sequence data, a computer analysis and forecast means are obtained by self-programming and analyses. And thus a series of novel genes are identified and preparated. These new genes, which are designed as BFC06016 and BFC06104, are similar to human apolioprotein AIBP, and their accession numbers in GenBank are DQ778079 and DQ778080, respectively. The genes and its encoded proteins are possible to be related to the metabolism in body ofcholesterol, and can be used as the candidate targets of medicaments in the diagnoses and treatment of human cardiovascular disease.

Description

发现新基因的方法以及所发现的新基因 技术领域  Methods for discovering new genes and new genes discovered
本发明涉及建立新的生物计算机分析方法和途经, 来获得新的具有功能的基因。 结果证明这种分析方法可以获得新的基因序列, 并与已公开的人类基因组染色体 DNA 序列吻合。这一方法可用于分析和获得新的具有生物学功能, 并与人类健康和疾病的诊 断、 治疗相关的基因, 尤其是作为基因药物或药靶的基因。  The present invention relates to the creation of new biological computer analysis methods and pathways for obtaining new functional genes. The results demonstrate that this analytical method allows for the acquisition of new gene sequences that are consistent with the published human genome chromosomal DNA sequences. This method can be used to analyze and acquire new genes that have biological functions and are related to the diagnosis and treatment of human health and disease, especially genes that are genetic drugs or drug targets.
背景技术 Background technique
1.功能基因组学研究  Functional genomics research
基因药物是以基因组学研究中发现的功能性基因或基因的产物为起始材料, 通过 生物学、分子生物学或生物化学、 生物工程等相应技术制成的, 并以相应分析技术控制 中间产物和成品质量的生物活性物质产品,临床上可用于某些疾病的治疗、预防和诊断。 重组蛋白质药物、 疫苗、 DNA药物、 RNA药物以及基因治疗药物等均属于基因药物。 基因药靶是指以基因组学研究中发现的功能性基因及基因的产物(功能蛋白)为起始材 料, 通过生物学、 化学、 物理学、 分子生物学或生物化学、 生物工程等相应技术制成拮 抗物或抑制剂, 例如: 获得特异性的抗体, 通过抗原、 抗体结合使得功能蛋白失去生物 活性、或筛选出小分子化合物具有抑制该基因产物的生物学活性的物质(抗体或小分子 化合物)作为药物, 用于人类疾病的治疗和诊断目的。  Gene drugs are based on the products of functional genes or genes found in genomics research, and are made by corresponding technologies such as biology, molecular biology or biochemistry, bioengineering, etc., and control intermediate products by corresponding analytical techniques. And finished quality bioactive substance products, clinically useful for the treatment, prevention and diagnosis of certain diseases. Recombinant protein drugs, vaccines, DNA drugs, RNA drugs, and gene therapy drugs are all genetic drugs. Gene drug target refers to the functional gene and gene product (functional protein) found in genomics research, starting from biology, chemistry, physics, molecular biology or biochemistry, bioengineering and other related technologies. An antagonist or inhibitor, for example: obtaining a specific antibody, causing the functional protein to lose biological activity by antigen, antibody binding, or screening for a small molecule compound having a biological activity inhibiting the gene product (antibody or small molecule compound) ) as a drug for the treatment and diagnosis of human diseases.
"传统"的基因药物及药靶基因发现步骤是根据疾病的症状进行分析, 以寻找出 正常人与患者在生理生化各项指标之间的差异, 例如: 人生长激素是由于患者身高比正 常同龄人较为矮小,通过各种分析发现是人生长激素分泌不足而导致的内源性缺失而致 功能不全, 通过人为补充这些缺乏蛋白质(如早期是从人尿中提取出人生长激素, 然后 注射给患者)而达到治疗的临床目的。 随后随着科学技术的发展, 对分离、 纯化的天然 蛋白质进行测序, 再由蛋白序列推算出 DNA序列, 合成、 复合、 探测 (DNA探测), 用其 "展示"出基因片段, 再得到全序列。 用基因工程技术在外源***(如大肠杆菌) 中表达。 制备、 纯化的重组蛋白质, 通过临床前试验(动物试验)和临床试验而形成基 因工程药物, 这一过程可以称为 "传统"或 "经典"的基因药物发现程序。  The "traditional" gene drug and drug target gene discovery steps are based on the symptoms of the disease to find the difference between physiological and biochemical indicators of normal people and patients, for example: human growth hormone is due to the patient's height is the same as normal People are relatively short, and through various analyses, they are found to be endogenously deficient in human growth hormone secretion and cause insufficiency. By artificially supplementing these lack of protein (for example, human growth hormone is extracted from human urine in the early stage, and then injected into The patient) achieves the clinical purpose of the treatment. Then, with the development of science and technology, the isolated and purified natural protein is sequenced, and then the DNA sequence is deduced from the protein sequence, synthesized, compounded, detected (DNA probed), and the gene fragment is "displayed" to obtain the complete sequence. . Expressed in a foreign system (such as E. coli) using genetic engineering techniques. Prepared, purified recombinant proteins, genetically engineered drugs are established through preclinical testing (animal testing) and clinical trials. This process can be referred to as a "traditional" or "classical" genetic drug discovery program.
2. 用于新基因发现的技术与方法  2. Techniques and methods for new gene discovery
生物信息学的学科进展过去很快, 也有很大进展, 已公开的专利和研究文献如 The discipline of bioinformatics has progressed very quickly and has made great progress. Published patents and research literature such as
Zailin Yu et al ( 2002 ), WIPO patent publication# WO 02/052047 A2; USPTO publication #: 20020155473A1 ;贺福初等,中国专利公开号 CN1657537A; Tang, YT et al. (2002), USPTO Patent: 6,365,371; Bandman, O et al.(2000), USPTO Patent: 6,020,164; Hamady M et al.(2006), BMC Bioinformatics, Published online 2006 January 4: 10.1186/1471-2105-7-1; Schattner P et al. (2006), RNA 12: 15-25; Skupski MP et al. (1999), Nucleic Acids Research, 27(1):35-38; Aaron Levine et al. (2001), Nucleic Acids Res. 29(19): 4006 - 4013, Nishikawa T et al.(2000), Genome Informatics (11): 12 - 23; Legato J et al. (2003), Physiological Genomics (13):179-181 ; Gary B et al. (2002), Nucleic Acids Res. 30(23): 5310 - 5317; Zondervan K et al.(2002), Fertil Steril. 78(4):777-781 ; Kontkanen O et al (2002), Expert Opin Ther Targets. 6(3):363-374; Kumar R et al (2002), J Mol Biol. 319(3):593-602. Zailin Yu et al (2002), WIPO patent publication# WO 02/052047 A2; USPTO publication #: 20020155473A1; He Fuchu, Chinese Patent Publication No. CN1657537A; Tang, YT et al. (2002), USPTO Patent: 6,365,371; Bandman , O et al. (2000), USPTO Patent: 6,020,164; Hamady M et al. (2006), BMC Bioinformatics, Published online 2006 January 4: 10.1186/1471-2105-7-1; Schattner P et al. (2006) , RNA 12: 15-25; Skupski MP et al. (1999), Nucleic Acids Research, 27(1): 35-38; Aaron Levine et al. (2001), Nucleic Acids Res. 29(19): 4006 - 4013, Nishikawa T et al. (2000), Genome Informatics (11): 12 - 23; Legato J et al. (2003), Physiological Genomics (13): 179-181; Gary B et al. (2002), Nucleic Acids Res. 30(23): 5310 - 5317; Zondervan K et al. (2002), Fertil Steril. 78(4): 777-781; Kontkanen O et al (2002), Expert Opin Ther Targets. 6(3): 363-374; Kumar R et al (2002), J Mol Biol. 319(3): 593-602.
Chapman MA et al.(2004), Genome Res. 14(2):313-318; Uenishi H et al (2004), Nucleic Acids Res. (32): 484-488; Bass MP et al (2004), Pac Symp Biocomput. (9):93-103; Ritter M et al (2001), GENOMICS. 79: 693-702; YonanAL et al. (2003), Genes Brain Behav. (5):303-320; 张德礼, 等, 《遗传学报》, 2004年 31卷 5期: 431-443。  Chapman MA et al. (2004), Genome Res. 14(2): 313-318; Uenishi H et al (2004), Nucleic Acids Res. (32): 484-488; Bass MP et al (2004), Pac Symp Biocomput. (9): 93-103; Ritter M et al (2001), GENOMICS. 79: 693-702; YonanAL et al. (2003), Genes Brain Behav. (5): 303-320; Zhang Deli, et al. , Journal of Genetics, 2004, Vol. 31, No. 5: 431-443.
李永青, 等, 《生命科学研究》, 2001年 5卷 2期: 141-145; 朱传炳, 等, 《湖南 师范大学自然科学学报》, 2004年 27卷 3期: 79-82; 祁震宇, 等, 《中华实验外科杂 志》, 2005年 22卷 7期: 849-851 ; 谢正祥, 等, 《中国医学物理学杂志》, 2006年 23 卷 1期: 62-63等分别描述了生物信息的应用和新的基因的发现和分析, 作为本发明的 相关文献给予引用。 发明内容  Li Yongqing, et al., Life Science Research, 2001, Vol. 5, No. 2: 141-145; Zhu Chuanbing, et al., Journal of Natural Science of Hunan Normal University, 2004, Vol. 27, No. 3: 79-82; Zhen Zhenyu, et al. Chinese Journal of Experimental Surgery, 2005, Vol. 22, No. 7, 849-851; Xie Zhengxiang, et al., Chinese Journal of Medical Physics, 2006, Volume 23, No. 1: 62-63, etc. The discovery and analysis of the genes are cited as relevant documents of the present invention. Summary of the invention
根据一个方面, 本发明提供一种发现新基因的方法, 该方法包括以下步骤: According to one aspect, the present invention provides a method of discovering a novel gene, the method comprising the steps of:
1 )从已公开发表的蛋白质序列数据库中获取长度为 500aa以下, 优选的是 400aa 以下、更优选的是 300aa以下的所有蛋白质序列,并将这些序列转为统一的可识别格式;1) obtaining, from a published protein sequence database, all protein sequences having a length of 500 aa or less, preferably 400 aa or less, more preferably 300 aa or less, and converting the sequences into a uniform recognizable format;
2)对从步骤 1 ) 中获得的蛋白质序列批量进行分泌型信号肽分析, 分别获得含有 分泌型信号肽的蛋白质序列和不含有分泌型信号肽的蛋白质序列; 2) performing secretory signal peptide analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a secreted signal peptide and a protein sequence not containing a secreted signal peptide;
3 )对从步骤 1 ) 中获得的蛋白质序列批量进行跨膜区域分析, 分别获得含有跨膜 区域的蛋白质序列和不含有跨膜区域的蛋白质序列;  3) performing a transmembrane region analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a transmembrane region and a protein sequence containing no transmembrane region;
4)综合步骤 2)和步骤 3 ) 中获得的结果序列, 大致地将序列分成: 含有分泌型 信号肽且不含有跨膜区域的序列; 含有分泌型信号肽且含有跨膜区域的序列; 含有分泌 型信号肽且含有跨膜区域的序列中跨膜区域的数目为 5〜8的序列这三个大类;  4) Combining the sequence of results obtained in steps 2) and 3), roughly dividing the sequence into: a sequence containing a secreted signal peptide and not containing a transmembrane region; a sequence containing a secreted signal peptide and containing a transmembrane region; a three-category sequence of secreted signal peptides and sequences having a transmembrane region in the sequence of 5 to 8 in the sequence of transmembrane regions;
5 )分别用这三个大类的序列对表达序列标签文库做序列比对分析, 并获得具有一 定匹配的表达序列标签, 所述匹配条件为: 序列相似程度为 15%〜95%, 优选为 20%〜 90%, 更优选为大于 25%并小于 90%, 并且要求这些突变点尽可能均匀分布在整个匹配 的序列中;  5) Perform sequence alignment analysis on the expressed sequence tag library by using the sequences of the three major classes, and obtain the expression sequence tags with certain matching, the matching condition is: the sequence similarity is 15%~95%, preferably 20% to 90%, more preferably more than 25% and less than 90%, and it is required that these mutation points are distributed as uniformly as possible throughout the matched sequence;
6)分别对这些表达序列标签进行不同算法的聚类拼接分析; 和  6) performing clustering and splicing analysis on different expressions of these expressed sequence tags;
7) 与已知数据库的序列进行分析比较, 获得新的全长基因。  7) Analyze and compare the sequences of known databases to obtain new full-length genes.
优选地, 本发明的发现新基因的方法通过计算机***平台完成, 该计算机***平 台包括:  Preferably, the method of discovering novel genes of the present invention is accomplished by a computer system platform, the computer system platform comprising:
以下用于序列比对的软件:  The following software for sequence alignment:
美国国立生物技术信息中心的 blast软件包;华盛顿大学的 blast软件包; 欧洲 分子生物学实验室的序列比对软件包; Clustalw多序列比对分析软件; 以下用于氨基酸序列功能预测的软件- 蛋白质活性位点 /功能域分析软件;  The blast package of the National Center for Biotechnology Information; the blast package of the University of Washington; the sequence alignment software package of the European Molecular Biology Laboratory; the Clustalw multiple sequence alignment analysis software; the following software for protein sequence function prediction - protein Active site/domain analysis software;
以下用于序列编辑的软件- 将 fasta格式的序列转为表格格式的序列的软件;将 GenBank格式的序列转为 fasta格式的序列的软件; DNA序列反向互补程序; DNA序列翻译程序; 获 取 GenBank格式的序列文件中的 CDS序列的软件;合并两个简单的序列片段, 并过滤掉它们之间的重复部分的软件; 以下用于数据库操作的软件: The following software for sequence editing - software for converting sequences in fasta format to sequences in tabular format; software for converting sequences in GenBank format to sequences in fasta format; reverse sequence complementary program for DNA sequences; translation program for DNA sequences; acquisition of GenBank Software for CDS sequences in a formatted sequence file; software that combines two simple sequence fragments and filters out duplicates between them; The following software for database operations:
实现对数据库中任意一个序列的删除的软件; 实现对序列数据库的***增加 序列的操作的软件; 批量或单个获取大型数据库中的某些序列的软件; 对临 时的没有建立索引的数据库进行 DNA、 蛋白质序列获取操作的软件; 通过网 络远程直接获取 GenBank上的序列数据的软件; 由本地网络直接获取本地数 据库上的序列数据的软件; 针对 Fasta格式的数据库建立索引的程序; 针对 GenBank格式的数据库建立索引的程序; 对基因组序列进行片段序列获取的 软件; 方便获取序列中的某个片段的软件;  Software that implements the deletion of any sequence in the database; software that implements the operation of inserting sequence sequences into sequences; software that batches or individually acquires certain sequences in large databases; performs DNA on temporary, unindexed databases, Software for obtaining protein sequence access; software for directly acquiring sequence data on GenBank via network; software for directly acquiring sequence data on local database from local network; program for indexing database in Fasta format; database for GenBank format Indexing program; software for fragment sequence acquisition of genomic sequences; software for facilitating acquisition of a fragment in a sequence;
以下用于序列比对结果分析做图的软件- 通过 blast的结果数据做出大致比对示意图的软件;  The following software for plotting the results of sequence alignment analysis - software that makes a rough alignment of the results of the blast;
以下用于数据解析的软件:  The following software for data parsing:
对批量分析蛋白质序列跨膜区域预测分析结果进行解析的软件; 对 blastru blastp, blastx等程序输出的结果进行解析的软件; 对 fasty比对程序输出的结 果进行解析的软件; 对批量自动化预测信号肽的程序产生的结果数据进行解 析的软件; 对大量的结果输出实现机器自动分析的软件;  Software for analyzing batch analysis of protein sequence transmembrane region prediction analysis results; software for parsing results of blastru blastp, blastx, etc.; software for parsing results of fasty comparison program output; predictive signal peptide for batch automation Software that parses the resulting data for the program; software that automates the analysis of the machine for a large number of results;
以下用于辅助其它程序运行的软件:  The following software is used to assist other programs to run:
配合部分不能实现自动化操作的程序实现全面自动化运行的软件; 以下重新优化的软件:  A software that fully automates the operation of a program that cannot be automated; the following re-optimized software:
自动完成 cap4运行环境的配置的软件;将 cap4输出的得分矩阵数据转为 fasta 格式的文件的软件; 批量自动化预测信号肽的程序; 批量分析蛋白质序列跨 膜区域预测的软件。  Software that automates the configuration of the cap4 runtime environment; software that converts the score matrix data output from cap4 to files in the fasta format; programs that automate the prediction of signal peptides in batches; and software that analyzes the cross-membrane region predictions of protein sequences in batches.
该计算机***平台优选基于 Linux操作***。  The computer system platform is preferably based on the Linux operating system.
利用本发明的生物信息处理程序和建立的***平台技术, 可以用于发现新的基因 和分析其产物, 使得人类可以更清楚地了解基因的表达和疾病之间的关系, 提高疾病治 疗的水平。  The bioinformatics processing program and the established system platform technology of the present invention can be used to discover new genes and analyze their products, so that humans can more clearly understand the relationship between gene expression and diseases, and improve the level of disease treatment.
本发明采用与上述常规 "传统"的过程相反的程序 "反向"来进行基因药物功能 基因组学研究, 目的是大大加快新型基因药物的筛选工作, 本发明设计表明与常规"传 统"的基因药物寻找方法相比, 与其它已有的基因发现技术和方法相比更为简捷、对计 算机设备要求更低、 同时便于操作和掌握, 可缩短数年时间来获得结果。  The present invention employs the procedure "reverse" opposite to the conventional "traditional" process described above to perform functional genomics studies of gene drugs, with the aim of greatly accelerating the screening of novel gene drugs, and the present invention is designed to show conventional "traditional" gene drugs. Compared with other existing gene discovery techniques and methods, the search method is simpler, requires less computer equipment, and is easier to operate and master, and can shorten the results for several years.
本发明首先自编一个新型计算机程序软件处理***专一进行基因药物和药靶基因 筛选。 自编的程序是利用已公开发表的人类基因组 DNA序列, 通过一个系列程序软件 操作 Linux***平台, 预测新的蛋白质(基因)序列 (ORF)编码。 这一软件操作*** 将考虑到疾病种类、疾病的发生、 形成机理、机制、 遗传学信息的操作***与优势相结 合, 如利用生物信息技术来预测分泌肽、 信号肽、跨膜区, 并将各种已有功能基因组学 手段与计算工具, 用一个新的自编软件***进行统合、扩增其功能与运算手段以达到可 以进行新型基因的预测筛选、 拼接。 其二, 对计算机预测出的可能 ORF序列, 通过功 能基因组学研究, 应用高通量筛选方法, 步骤是在细胞水平和动物水平进行基因药物筛 选。 利用分子生物学技术完成基因的拼接克隆和扩增。 然后对所得到的 DNA序列信息 利用比较生物学和药理学技术, 应用基因调控、基因敲除与敲入(Knock out/in)、转染、 反义 RNA、 SiRNA等实验室方法, 来研究基因定位、 表达、 过量表达、 低水平表达、 差异表达, 用高通量筛选的方法如定量 PCR、基因芯片技术, 通过对药靶基因的筛选验 证来决定新预测基因的生物学功能, 从而获得具有原创性的基因药物药靶基因。 其三, 进一步研究候选基因药物靶基因的生化特性和细胞功能, 通过免疫组织化学、病理学和 其它预测方法来进行高效、特异性地确定其生物学、 临床上应用价值, 从而获得新的原 创性基因药物候选者, 并对它们展开临床前细胞学和动物试验研究及临床试验, 来验证 所发现的基因及其产物在临床上的作用。 The invention firstly compiles a novel computer program software processing system to specifically screen genetic drugs and drug target genes. The self-programming program uses a published human genomic DNA sequence to manipulate a Linux system platform through a series of program software to predict new protein (gene) sequence (ORF) coding. This software operating system will combine the operating system and advantages of disease types, disease occurrence, formation mechanisms, mechanisms, genetic information, such as the use of bioinformatics to predict secreted peptides, signal peptides, transmembrane regions, and Various existing functional genomics tools and computational tools are integrated and augmented with a new self-programming software system to achieve predictive screening and splicing of novel genes. Second, the possible ORF sequences predicted by the computer, through functional genomics studies, using high-throughput screening methods, the steps are to screen for gene drugs at the cellular and animal levels. The splicing, cloning and amplification of genes are accomplished using molecular biology techniques. Then using the comparative biology and pharmacology techniques for the obtained DNA sequence information, applying gene regulation, knockout and knock-in, transfection, Laboratory methods such as antisense RNA and SiRNA to study gene mapping, expression, overexpression, low level expression, differential expression, and screening by high-throughput screening methods such as quantitative PCR and gene chip technology To determine the biological function of the new predicted gene, to obtain the original gene drug drug target gene. Third, further study the biochemical characteristics and cell functions of the candidate gene drug target genes, and obtain the new originality by immunohistochemistry, pathology and other prediction methods to efficiently and specifically determine the biological and clinical application value. Candidates for sex gene drugs, and conduct preclinical cytology and animal studies and clinical trials to verify the clinical role of the genes and their products.
本发明的计算机生物信息发现和分析技术平台, 可以用于人类新的基因发现, 也 还可以用于, 但不局限于, 动物、 植物、 微生物的基因发现和分析目的应用。  The computer bioinformatics discovery and analysis technology platform of the present invention can be used for new gene discovery in humans, and can also be used, but not limited to, for genetic discovery and analysis purposes of animals, plants, and microorganisms.
本发明涉及的计算机编制的生物信息分析程序和建立的可行平台技术将已有的公 开的人类基因组研究资料和信息与本发明设计和运行的程序一起对大量的资料、文库进 行分析, 从中获得新的预测基因, 目的是解决使用传统技术获得新的基因, 在技术和时 间上存在的不足。 总之, 与常规生物信息分析方法比较, 本发明涉及的程序编制具有下 列优点: 1 )能快速分析和获得新的可能基因; 2)操作程序简便和高效; 3)获得的新基因 具有生物学功能和具有作为基因药物和基因药靶的临床应用的可能。  The computer-generated bioinformatics analysis program and the established feasible platform technology of the present invention combine existing published human genome research materials and information with a program designed and operated by the present invention to analyze a large amount of data and libraries, and obtain new ones therefrom. The predicted gene is designed to solve the shortcomings in technology and time to obtain new genes using traditional techniques. In summary, compared with conventional bioinformatics analysis methods, the programming involved in the present invention has the following advantages: 1) rapid analysis and acquisition of new possible genes; 2) simple and efficient operation procedures; 3) acquired new genes with biological functions And the possibility of clinical application as a gene drug and gene drug target.
利用本发明描述的技术和方法, 实际用于已公开的人基因组序列, 进行新基因的 分析和获得, 结果显示计算机预测技术, 可使发明者获得大量具有生物学功能的可能基 因。 在本发明的机器操作运行中本发明获得了 35个具有潜在生物学功能的新基因, 在 本发明描述中仅以所发现的两个与人载脂蛋白 A1结合蛋白 (Apolioprotein Al Binding Protein, APA1BP)类似的、 尚未报道的基因为例, 来证明本发明所描述的生物计算机 模拟预测模式是可行的。  Using the techniques and methods described herein, the actual human genome sequences are actually used for the analysis and acquisition of new genes, and the results show that computer prediction techniques allow the inventors to obtain a large number of possible genes with biological functions. In the machine operation of the present invention, the present invention obtains 35 new genes with potential biological functions. In the description of the present invention, only two Apolipoprotein Al Binding Proteins (APA1BP) were found. A similar, unreported gene is taken as an example to demonstrate that the bio-computer simulation prediction mode described in the present invention is feasible.
人血液中有低密度脂蛋白 (LDL)和高密度脂蛋白 (HDL)。 脂蛋白中的蛋白质称 为载脂蛋白 (Apolip0pr0tein)。 脂蛋白与胆固醇相结合, 形成脂蛋白胆固醇, 进行胆固 醇在细胞内外的运转。高密度脂蛋白胆固醇减少的临床意义可提示易患冠心病。低密度 脂蛋白胆固醇增多的临床意义可提示易患动脉粥样硬化所导致的冠心病、脑血管病。胆 固醇逆向转运的关键步骤是将胆固醇从细胞内转移到细胞外的脂蛋白上,各类脂蛋白的 重要组分是载脂蛋白。载脂蛋白负责把不同的脂蛋白运输到身体的各个部位。载脂蛋白 是位于脂蛋白表面的蛋白质, 由氨基酸按一定顺序组合而成。它们以多种形式和不同的 比例存在于各类脂蛋白中。各种脂蛋白也因其所含的载脂蛋白的种类不同, 而具有不同 的功能和不同的代谢途径。 Human blood has low density lipoprotein (LDL) and high density lipoprotein (HDL). The protein in lipoproteins is called apolipoprotein (Apolip 0 p r0 tein). Lipoprotein combines with cholesterol to form lipoprotein cholesterol, which operates cholesterol inside and outside the cell. The clinical significance of the reduction of high-density lipoprotein cholesterol may indicate a predisposition to coronary heart disease. The clinical significance of low-density lipoprotein cholesterol increase may indicate coronary heart disease and cerebrovascular disease caused by atherosclerosis. The key step in the reverse transport of cholesterol is to transfer cholesterol from the cell to the extracellular lipoprotein. An important component of various lipoproteins is apolipoprotein. Apolipoproteins are responsible for transporting different lipoproteins to various parts of the body. Apolipoprotein is a protein located on the surface of lipoproteins, which are composed of amino acids in a certain order. They are present in various types of lipoproteins in a variety of forms and in different ratios. Various lipoproteins also have different functions and different metabolic pathways due to the different types of apolipoproteins they contain.
Ritter, M等 2002年公布了其发现的一个新的载脂蛋白蛋白相互作用蛋白, 并将它 命名为 AI-BP(apoA-I binding protein)。 AI-BP编码的基因, APOA1BP, 其位于染色体 lq21,由 6个外显子和 5个内含子构成, 2.5kbo Northern杂交分析证明 APOAlBP mRNA 是普遍表达的, 并且在肾、 心脏、 肝脏、 甲状腺、 肾上腺和睾丸中高度表达。 AI-BP蛋 白在正常人的血清中没有发现,但在败血综合症患者的血清样本中却有高水平的 AI-BP。 健康人的 AI-BP蛋白在脑脊液和尿中有很显著的量。用 apoA-I或 HDL刺激肾近曲小管 细胞, 可诱发浓度依赖性的 AI-BP的分泌, 如果用 apoA-II、 BSA、 或 LDL刺激就不会 产生分泌。并且这种情况只发生在肾近曲小管,在其它的组织中 apoA-I不能刺激 AI-BP 的分泌。试验明了在肾脏管细胞中, AI-BP对 apoA-I降解或再吸收中起重要的作用( Ritter, 等, Genomics, 79: 693— 702, 2002)。 发现新的与载脂蛋白具有相互作用蛋白基因, 可能使我们更好地搞清楚与胆固醇代谢相关的途经,预防和控制与心血管相关疾病的发 生和治疗。 Ritter, M et al. published a new apolipoprotein-interacting protein discovered in 2002 and named it AI-BP (apoA-I binding protein). The AI-BP-encoded gene, APOA1BP, is located on chromosome lq21 and consists of 6 exons and 5 introns. 2.5kbo Northern blot analysis confirmed that APOAlBP mRNA is ubiquitously expressed in kidney, heart, liver, and thyroid gland. High expression in the adrenal gland and testis. The AI-BP protein was not found in normal human serum, but there was a high level of AI-BP in serum samples from patients with septic syndrome. Healthy human AI-BP protein has a significant amount in cerebrospinal fluid and urine. Stimulation of renal proximal tubule cells with apoA-I or HDL induces a concentration-dependent secretion of AI-BP, which is not produced if stimulated with apoA-II, BSA, or LDL. And this happens only in the renal proximal convoluted tubules. In other tissues, apoA-I does not stimulate AI-BP. Secretion. It has been shown that AI-BP plays an important role in the degradation or resorption of apoA-I in renal tubular cells (Ritter, et al, Genomics, 79: 693-702, 2002). The discovery of new interacting protein genes with apolipoproteins may allow us to better understand the pathways involved in cholesterol metabolism, prevent and control the development and treatment of cardiovascular-related diseases.
因此, 本发明还首次公布了利用本发明的程序和方法获得的两个类似载脂蛋白相 关蛋白的新基因, 它们坐落在人第 19号染色体上。 这两个基因与现已公开的载脂蛋白 相互作用蛋白基因的不同有: 1 ) 坐落在不同染色体; 2)不具有分泌肽; 3 ) 蛋白质氨 基酸序列与已知 ΑροΑΙΒΡ基因相比较, 仅有 40.0% (BFC06016, GenBank登录号为 DQ778079)和 41.5% (BFC06104, GenBank登录号为 DQ778080) 的同源性; 4)应用 瞬时定量 PCR (QPCR或 Real Time PCR)技术测定了不同的人组织和细胞系, 表明新 发现的基因具有不同的基因表达水平, 证明其具生物活性。  Therefore, the present invention also discloses for the first time two novel genes similar to apolipoprotein-related proteins obtained by the procedures and methods of the present invention, which are located on human chromosome 19. These two genes differ from the apolipoprotein interacting protein genes that are now available: 1) are located on different chromosomes; 2) have no secreted peptides; 3) the amino acid sequence of the protein is only 40.0 compared to the known ΑροΑΙΒΡ gene. Homology of % (BFC06016, GenBank accession number DQ778079) and 41.5% (BFC06104, GenBank accession number DQ778080); 4) Determination of different human tissues and cell lines by transient quantitative PCR (QPCR or Real Time PCR) , indicating that the newly discovered genes have different gene expression levels, demonstrating their biological activity.
1、 一种发现新基因的计算机模拟预测***平台  1. A computer simulation prediction system platform for discovering new genes
本发明描述了各种已知或公开的生物信息资料、文库的获取和其被本地化工作内容, 所获得的文库和资料有, 但不局限于, NCBI远程数据库, 下载所需要的生物信息分析 相关的最新数据库。 其中包括了: 人表达 DNA序列标签数据库, 非冗余蛋白质序列数 据库, 核苷酸数据库, 专利蛋白质序列数据库, 人染色体序列数据库等。 在本地计算机 对所有这些下载回来的数据库进行格式化处理。将其转化为本地程序可以识别的序列格 式数据库。  The present invention describes various known or disclosed biological information materials, the acquisition of libraries and their localized work content, the obtained libraries and materials are, but not limited to, NCBI remote databases, biological information analysis required for downloading Related latest database. These include: Human Expression DNA Sequence Tag Database, Non-redundant Protein Sequence Database, Nucleotide Database, Patented Protein Sequence Database, Human Chromosomal Sequence Database, etc. All of these downloaded databases are formatted on the local computer. Convert it to a sequence format database that is recognized by the local program.
这些文库中含有一发表和公开的人类染色体 DNA序列测定结果和人 mRNA、 cDNA测序结果文库, 已公开的蛋白质序列数据库。  These libraries contain published and published human chromosomal DNA sequence determination results and human mRNA, cDNA sequencing result libraries, and published protein sequence databases.
在本发明中重点描述的所有所应用的数据库, 文库和资料库均来自公开可以获得 的资料, 并经验证和本地计算机数字处理, 而形成可随时由本地计算机调取, 并能与本 发明的编程融合和程序化。  All of the applied databases, libraries, and databases that are highlighted in the present invention are from publicly available data, and are validated and digitally processed by local computers to form ready to be retrieved by the local computer and can be used with the present invention. Programming fusion and programming.
本发明所使用的生物信息分析程序主要有, 但不局限于,  The biological information analysis program used in the present invention mainly has, but is not limited to,
用于序列比对的软件有- blastall: NCBI (美国国立生物技术信息中心) 的 blast软件包, 可以实现大致 的基因序列的比对工作; WU-blast: 华盛顿大学的 blast软件包, 其在新基因 的检索分析方面所做的功能比较优秀; Fasta: EMBL (欧洲分子生物学实验室) 的序列比对软件包; clustalw: 多序列比对分析软件; sim4: 表达序列和染色 体基因组序列比对软件;  The software used for sequence alignment is - blastall: NCBI (National Center for Biotechnology Information) blast package, which can perform rough alignment of gene sequences; WU-blast: University of Washington's blast package, which is new The functions of gene retrieval and analysis are excellent; Fasta: EMBL (European Molecular Biology Laboratory) sequence alignment software package; clustalw: multiple sequence alignment analysis software; sim4: expression sequence and chromosome genome sequence alignment software ;
用于数据库编辑的软件有:  The software used for database editing is:
pressdb : WU-blast程序专用的核苷酸序列数据库格式化软件; im— index: 主 要用于对序列数据库建立索引, 实现大数据库的可操作性; setdb: WU-blast 程序专用的蛋白质序列数据库格式化软件;  Pressdb : WU-blast program-specific nucleotide sequence database formatting software; im-index: mainly used to index the sequence database to achieve large database operability; setdb: WU-blast program-specific protein sequence database format Software
用于序列拼接的软件有- The software used for sequence stitching has -
Cap4/Phrap: 华盛顿大学基因组科学研究中心的序列拼接软件; merger: 简单 的序列拼接软件; Cap4/Phrap: Sequence splicing software from the University of Washington Genomics Research Center; merger: simple sequence splicing software;
用于氨基酸序列功能预测的软件有- Software for predicting the function of amino acid sequences is -
Tmpred:预测蛋白质序列的跨膜; Signalp:预测蛋白质序列的信号肽; remap: 序列酶切位点分析软件; restrict: 序列酶切信息统计软件; showorf: DNA序 列翻译软件; pepinfo: 图形方式显示蛋白质序列中各种不同性质氨基酸的含 量; pepstats: 统计蛋白质序列中各种氨基酸的含量同时得出分子量、等电点、 带电荷以及 280nm的光吸收值; pepwheel: 图形方式显示蛋白质序列中所有 氨基酸残基的螺旋轮; Proparam: 主要用于综合确定蛋白质的亲水性 /疏水性; Tmap: 图形显示蛋白质的跨膜区域; ps_scan: 蛋白质活性位点 /功能域分析软 件。 Tmpred: predicts transmembrane of protein sequences; Signalp: predicts signal peptides of protein sequences; remap: Sequence cleavage site analysis software; restrict: sequence cleavage information statistical software; showorf: DNA sequence translation software; pepinfo: graphical display of various amino acid content in protein sequences; pepstats: statistics of various amino acids in protein sequences The content also yields molecular weight, isoelectric point, charged charge and light absorption at 280 nm; pepwheel: graphically shows the helical wheel of all amino acid residues in the protein sequence; Proparam: mainly used to comprehensively determine the hydrophilicity/hydrophobicity of the protein Tmap: Graphic showing the transmembrane region of the protein; ps_scan: Protein active site/domain analysis software.
在本发明的一个具体实施方式中,本发明提供一种独立的生物计算机程序以用于新 的基因的预测和结果实例分析。 本发明还包括所有在本地计算机上编制可供运行的程 序, 并形成本发明的新基因发现和分析技术***平台。 特别是, 包括, 但不局限于, 用于序列编辑的软件有- 主要有 tbl2fasta— n/fasta2tbl— n:序列格式转换软件,可将 fasta格式的序列转为 表格格式的序列; gb2fasta: 序列格式转换软件, 将 GenBank格式的序列转为 fasta格式的序列; tt— comp— dna: 序列编辑软件, DNA序列反向互补程序; translate: 序列编辑软件, DNA序列翻译程序; gb2cds: 序列编辑软件, 获取 GenBank格式的序列文件中的 CDS序列; tt— zip— 2: 序列编辑软件, 主要用于 合并两个简单的序列片段, 并过滤掉它们之间的重复部分; In a specific embodiment of the invention, the invention provides an independent bio-computer program for the prediction of new genes and analysis of the results of the examples. The present invention also includes all programs that are run on a local computer and that form the new gene discovery and analysis technology system platform of the present invention. In particular, including, but not limited to, software for sequence editing - mainly tbl2fasta - n / fasta2tbl - n: sequence format conversion software, can convert the sequence of fasta format into a table format sequence; gb2f as ta : Sequence format conversion software, converting sequences in GenBank format to sequences in fasta format; tt-comp-dna: sequence editing software, DNA sequence reverse complement program; translate: sequence editing software, DNA sequence translation program; gb2cds : sequence editing software , obtain the CDS sequence in the sequence file in GenBank format; tt-zip-2: sequence editing software, mainly used to merge two simple sequence fragments, and filter out the repeated parts between them;
用于数据库操作的软件有:  The software used for database operations is:
im— delete: 数据库编辑软件, 可以实现对数据库中任意一个序列的删除; im— insert: 数据库编辑软件, 可以实现对序列数据库的***增加序列的操作; im— retrieve: 数据库编辑软件, 批量或单个获取大型数据库中的某些序列; tt _get: 对临时的没有建立索引的数据库进行 DNA、蛋白质序列获取操作的软 件; rfetch:数据库操作软件,通过网络远程直接获取 GenBank上的序列数据; lfetch: 数据库操作软件, 由本地网络直接获取本地数据库上的序列数据; biofaseqindex: 数据库编辑软件, 针对 Fasta格式的数据库建立索引的程序; biogbseqindex: 数据库编辑软件, 针对 GenBank格式的数据库建立索引的程 序; tt— subseq_genome: 对基因组序列进行片段序列获取的软件; tt— sub— seq: 序列编辑的软件, 方便获取序列中的某个片段;  Im-del: database editing software, can delete any sequence in the database; im-insert: database editing software, can realize the operation of adding sequence to the sequence database; im-retrieve: database editing software, batch or single Obtain some sequences in a large database; tt _get: software for DNA and protein sequence acquisition operations on temporary unindexed databases; rfetch: database operation software to directly obtain sequence data on GenBank via the network; lfetch: database Operating software, the local network directly obtains the sequence data on the local database; biofaseqindex: database editing software, the program for indexing the database in Fasta format; biogbseqindex: database editing software, the program for indexing the database in GenBank format; tt_subseq_genome : software for fragment sequence acquisition of genomic sequences; tt-sub-seq: software for sequence editing, which facilitates the acquisition of a fragment in a sequence;
用于序列比对结果分析做图的软件有 - drawBlast: blast结果做图程序, 可以通过 blast的结果数据做出大致的比对示 意图;  The software used for plotting the alignment comparison results has a - drawBlast: blast result plotting program that can make a rough comparison of the results of the blast results;
用于数据解析的软件有:  The software used for data analysis is:
tt_tmpred_p: 数据解析器软件, 专用于解析 tt— tmpred生成分析结果数据; parser— bx: 解析器软件, 对 blastn、 blastp、 blastx等程序输出的结果进行解析 的软件; parser— fasta: 解析器软件, 对 fasty比对程序输出的结果进行解析的 软件; ps_signalp: 数据解析器软件, 解析 pepsigp 程序产生的结果数据; tt_pblast: blastn结果解析软件, 对大量的结果输出实现机器自动分析; 用于辅助其它程序运行的软件有 - tt— cycle: 辅助软件, 主要用于配合部分不能实现自动化操作的程序实现全面 自动化运行; Tt_tmpred_p: Data parser software, dedicated to parsing tt-tmpred to generate analysis result data; parser-bx: parser software, software for parsing the output of programs such as blastn, blastp, blastx; parser-fasta: parser software, Software for parsing the results of the fasty comparison program output; ps_signalp: data parser software, parsing the result data generated by the pepsigp program; tt_pblast: blastn result parsing software, automatic machine analysis for a large number of result output; used to assist other programs The software that runs is - tt-cycle: auxiliary software, mainly used to meet the requirements of some programs that cannot be automated. Automated operation;
重新优化的软件有:  The re-optimized software is:
ed_caP4: 重新编译的 Cap4程序, 实现可以自动完成 cap4运行环境的配置; extractcontigs: 将 cap4输出的得分矩阵数据转为 fasta格式的文件; pepsigp: 重新编译的 Signalp软件, 对原先只能单个预测信号肽的程序进行改进, 实现 批量自动化全面预测; primers— for— fulllength— clone: 批量引物设计软件; tt_fasty_l : 改进的 fasty程序, 主要目的是实现方便操作; tt— tmpred: 重新编 译后的蛋白质序列跨膜区域预测, 改进后的序列可以实现批量分析。 此外, 本发明的计算机***平台还可以包括其他一些权威生物信息数据分析软件 包, 例如 Emboss (生物学序列分析软件包, 该软件源码可在 http://eTnboss.sourceforge.net 网站获得 Linux版本的源码)、 Tmpred (跨膜结构预测,可以在 http:〃 www.ch.embnet.org/购 买获得 Linux版本的源码)、 Singlp (分泌型信号肽分析, 可以在 http:〃 www.cbs.dtu.dk/购 买获得 Linux版本的源码)、 Predict Protein (蛋白基本信息分析预测, 可以在 http://www.predictprotein.org/网站免费下载获得)、 Clustalw (序歹 比对分析, 可在 http://www.ebi.ac.uk/下载到免费的 Linux版软件)、 Primer (引物设计分析, 可以在 http://primer3.sourceforge.net/网站获得)。本发明的软件优选运行环境是 Linux***,要实 现对大批量的数据进行过滤筛选, 所以必须获取以上部分软件的源代码, 利用其中的核 心功能部分, 重新编译出适合本平台运行的软件。 Ed_ca P 4: Recompiled Cap4 program, the implementation can automatically complete the configuration of the cap4 runtime environment; extractcontigs: converts the score matrix data output from cap4 into a file in fasta format; pepsigp: Recompiled Signalp software, only a single prediction The signal peptide program is improved to achieve comprehensive forecasting of batch automation; primers-for-fulllength- clone: batch primer design software; tt_fasty_l: improved fasty program, the main purpose is to achieve convenient operation; tt-tmpred: recompiled protein sequence Transmembrane region prediction, the improved sequence can achieve batch analysis. In addition, the computer system platform of the present invention may also include other authoritative biological information data analysis software packages, such as Emboss (Biology Sequence Analysis Software Package, which is available at http://eTnboss.sourceforge.net). Source code), Tmpred (transmembrane structure prediction, can be purchased at http: 〃 www.ch.embnet.org/ to obtain the Linux version of the source code), Singlp (secretory signal peptide analysis, available at http: 〃 www.cbs.dtu. Dk/purchase to get the Linux version of the source code), Predict Protein (protein basic information analysis prediction, available for free download at http://www.predictprotein.org/), Clustalw (preface and sequence analysis, available at http:/ /www.ebi.ac.uk/Download to the free Linux software), Primer (primer design analysis, available at http://primer3.sourceforge.net/). The preferred operating environment of the software of the present invention is a Linux system. To filter and filter large amounts of data, it is necessary to obtain the source code of the above software, and use the core functional part to recompile the software suitable for the operation of the platform.
本发明的部分软件,尤其是全部软件的组合与协调运行构成本发明计算机软件平台 的基础。 在本发明的一个实施例中, 列举了从 NCBI远程数据库下载各种核苷酸序列文库、 专利蛋白质文库、 人表达 DNA序列标签数据库、 非冗余蛋白质序列数据库、 人染色体 序列文库、 和其它相关数据库的过程和方法。  Part of the software of the present invention, and in particular the combination and coordinated operation of all of the software, forms the basis of the computer software platform of the present invention. In one embodiment of the invention, various nucleotide sequence libraries, patent protein libraries, human expressed DNA sequence tag databases, non-redundant protein sequence databases, human chromosomal sequence libraries, and other related libraries are downloaded from the NCBI remote database. The process and method of the database.
在本发明另一实施例中, 列举了各种生物信息分析软件和***的利用, 特别是编制 了本发明的计算机分析***平台,使得各独立软件分析***可以协同工作进行新基因的 发现和分析工作。  In another embodiment of the present invention, the use of various biological information analysis software and systems is enumerated, in particular, the computer analysis system platform of the present invention is compiled, so that independent software analysis systems can work together to discover and analyze new genes. jobs.
在另一实施例中, 列举了发现新基因的计算机运作分析程序综合流程框架图。 依据实施例, 本发明完成的计算机和编制的软件工程是一个独立和完整的生物信 息处理***平台, 其可以复制、 拷贝和移植, 并可用于, 但不局限于, 新基因的发现和 功能分析、 示范、 教学、 商业目的、 临床治疗和医学诊断应用等。  In another embodiment, a comprehensive process framework diagram of a computer operational analysis program for discovering new genes is presented. According to an embodiment, the computer and software engineering completed by the present invention is an independent and complete biological information processing system platform, which can be copied, copied and transplanted, and can be used for, but not limited to, discovery and functional analysis of new genes. , demonstration, teaching, business purposes, clinical treatment and medical diagnostic applications.
2.新的类似于载脂蛋白 A1BP基因的发现  2. New discovery similar to the apolipoprotein A1BP gene
本发明中的这一信息处理平台, 实际应用于新的蛋白质序列的发现和分析, (具体 操作见实施例 3 )获得了 35个可能的新蛋白质序列。 其中两个类似于已知的载脂蛋白 A1BP基因, 作为实例给予公开。这两个新的基因为 BFC06016和 BFC06104, 分别具有 在 Seq ID No.l 和 Seq ID No.3 所示的核苷酸序列; 它们在 GenBank登录号分别为 DQ778079和 DQ778080。由核苷酸序列编码的氨基酸分别为 Seq ID No.2和 Seq ID No.4 所示的序列。利用各种软件和生物信息分析技术, 获得各种蛋白质分析资料, 包括蛋白 质的亲 /疏水性、分泌肽的存在与否、蛋白质可能空间构像、跨膜结构分析、蛋白质螺旋 结构和可能的功能预测等。 This information processing platform in the present invention is actually applied to the discovery and analysis of new protein sequences (see Example 3 for specific operations) to obtain 35 possible new protein sequences. Two of them are similar to the known apolipoprotein A1BP gene and are disclosed as an example. The two new genes, BFC06016 and BFC06104, have the nucleotide sequences shown in Seq ID No. 1 and Seq ID No. 3, respectively; they are in GenBank accession numbers DQ778079 and DQ778080, respectively. The amino acids encoded by the nucleotide sequence are the sequences shown by Seq ID No. 2 and Seq ID No. 4, respectively. Use a variety of software and bioinformatics techniques to obtain a variety of protein analysis data, including pro-/hydrophobicity of proteins, presence or absence of secreted peptides, possible spatial conformation of proteins, transmembrane structure analysis, protein helix Structure and possible functional predictions, etc.
一般来说,新发现的基因可以通过全 DNA序列合成方法获得,并将其用于生物学 和临床应用研究及产品开发用途之中。 本发明在实施例中, 详细列举了一个全基因的 DNA序列合成方法和技术。 主要是利用 PCR方法分布合成 DNA片段然后组装成全基 因序列, 并经过了 DNA测序验证合成结果。 同时, 利用定量 PCR的方法验证了这两个 新发现的人类新基因具有生物活性,其表达图谱随人体组织器官的不同而有不同的表达 水平。 初步的实验结果揭示, 这两个基因具有生物活性。 本发明的这些 DNA和蛋白质 氨基酸序列可应用于新的药物研发和临床诊断,优选是这些基因是作为用于心血管疾病 相关的诊断和治疗目的的药物或药靶基因, 更优选地是可作为基因药物或基因治疗药 靶。  In general, newly discovered genes can be obtained by whole DNA sequence synthesis methods and used in biological and clinical application research and product development applications. In the examples of the present invention, a method and technique for synthesizing a whole gene DNA sequence are exemplified in detail. The main method is to use PCR to distribute synthetic DNA fragments and assemble them into full gene sequences, and the results of DNA sequencing confirmed the synthesis. At the same time, the two new human genes were found to be biologically active by quantitative PCR, and their expression profiles varied with human tissues and organs. Preliminary experimental results revealed that these two genes are biologically active. The DNA and protein amino acid sequences of the present invention are applicable to new drug development and clinical diagnosis, preferably these genes are drug or drug target genes for diagnostic and therapeutic purposes related to cardiovascular diseases, more preferably as Gene drug or gene therapy drug target.
本发明中新基因的核苷酸可利用重组克隆技术引入宿主细胞, 以使其编码的蛋白质 得以表达。  The nucleotides of the novel gene of the present invention can be introduced into a host cell by recombinant cloning techniques to allow expression of the encoded protein.
一般来说, 宿主细胞经遗传工程(转导或转化或转染)方法将携带有本发明中所提 到的含有新基因的载体质粒以侵入方式、 病毒感染 "噬菌体"等形式转入宿主***中。 工程宿主细胞可以在含常规营养物的培养基中培养, 并经过适当修改以利于启动子。 以 适当的操作方式控制选择转化子或扩增编码这些基因的核苷酸链的培养条件, 如温度、 pH来选择表达细胞。  Generally, the host cell is genetically engineered (transduced or transformed or transfected) into a host system carrying the vector plasmid containing the novel gene mentioned in the present invention in an invasive manner, a viral infection "phage" or the like. in. The engineered host cells can be cultured in a medium containing conventional nutrients and modified as appropriate to facilitate the promoter. The expression cells are selected by controlling the selection of transformants or amplification of the nucleotide conditions encoding the nucleotide strands of these genes, such as temperature and pH, in an appropriate manner.
按本发明所述, 重组载体携带有包含编码新基因蛋白的核苷酸。重组载体可以是一 个表达载体, 可在宿主细胞内由携带的核苷酸编码序列来表达融合蛋白。 形式可为, 但 不局限于, 融合或是单独***。 宿主生物体及体细胞包括, 但不局限于, 脊椎动物(如 人、 猴、 鼠、 兔等)鱼、 鸡、 昆虫、 植物、 酵母、 真菌和细菌等。  According to the invention, the recombinant vector carries a nucleotide comprising a protein encoding a novel gene. The recombinant vector may be an expression vector which expresses the fusion protein in a host cell by a nucleotide sequence encoding. The form can be, but is not limited to, fusion or separate insertion. Host organisms and somatic cells include, but are not limited to, vertebrates (such as humans, monkeys, rats, rabbits, etc.) fish, chickens, insects, plants, yeast, fungi, bacteria, and the like.
编码本发明所述的核苷酸可在适当的启动子作用下表达为蛋白质。可利用的适合启 动子包括, 但不局限于, 腺病毒启动子, 如腺病毒主要的后期启动子; 或异源启动子, 如 CMV启动子和 RSV启动子; 诱导型启动子可有 MMT启动子、热刺激启动子、 白蛋 白启动子、 ApoAI启动子和人球蛋白启动子; 病毒胸腺嘧啶脱氧核苷酶启动子则有疱疹 病毒胸腺激酶启动子; 反转录病毒 LTR启动子包括经修饰后的 LTR启动子; β -肌动蛋 白启动子; 人生长激素启动子。 也可用天然启动子来控制核苷酸编码表达蛋白质。  Nucleotides encoding the present invention can be expressed as proteins under the action of a suitable promoter. Suitable promoters that may be used include, but are not limited to, adenovirus promoters, such as the major late promoter of adenovirus; or heterologous promoters, such as the CMV promoter and the RSV promoter; inducible promoters may have MMT promoters , a heat-stimulated promoter, an albumin promoter, an ApoAI promoter, and a human globulin promoter; a viral thymidine promoter is a herpesvirus thymidine kinase promoter; a retroviral LTR promoter includes a modification Post-LTR promoter; beta-actin promoter; human growth hormone promoter. A native promoter can also be used to control the expression of a protein encoded by a nucleotide.
根据本发明, 重组细胞具有表达编码本发明中所述的蛋白质的核酸序列的能力。重 组工程细胞可以持续地或在有或无诱导剂的存在状态下表达本发明所述新的蛋白质。重 组工程细胞形式包括, 但不局限于脊椎动物(即人、 猴、 鼠、 兔、 鱼、 鸡等)、 昆虫、 植物、 酵母、 真菌和细菌等细胞。  According to the invention, recombinant cells have the ability to express a nucleic acid sequence encoding a protein as described herein. The recombinant engineered cells can express the novel protein of the present invention continuously or in the presence or absence of an inducing agent. Recombinant engineered cell forms include, but are not limited to, cells of vertebrates (i.e., humans, monkeys, mice, rabbits, fish, chickens, etc.), insects, plants, yeast, fungi, and bacteria.
根据本发明和公知的技术获得的新蛋白质的抗体, 包括, 但不局限于, 多抗、 单抗 或人源化抗体以及这些抗体的实际应用。该特异性抗体优选通过免疫动物产生。所述特 异性抗体在临床诊断、 治疗和作为生物试剂中具有重要的应用。  Antibodies to novel proteins obtained according to the present invention and known techniques include, but are not limited to, polyclonal, monoclonal or humanized antibodies and the practical application of such antibodies. The specific antibody is preferably produced by immunizing an animal. The specific antibodies have important applications in clinical diagnosis, treatment, and as biological agents.
本发明还提供一种核酸,其核苷酸序列与 Seq ID No.l或 Seq ID No.3具有至少 95%, 优选至少 96%, 更优选至少 97%, 进一步优选至少 98%, 更进一步优选至少 99%的序 列同源性,并且该核酸编码的蛋白质分别与 Seq ID No.l或 Seq ID No.3编码的蛋白质具 有相同的功能。本发明的核酸优选编码如下蛋白质: 在 Seq ID N0.2或Seq ID N0.4所限 定的氨基酸序列中经过取代、 缺失或添加一个或几个氨基酸且与 Seq ID No.2或 Seq ID No.4所示的蛋白质具有相同的功能的由 Seq ID No.2或 Seq ID No.4所示的蛋白质衍生的 蛋白质。 The present invention also provides a nucleic acid having a nucleotide sequence of at least 95%, preferably at least 96%, more preferably at least 97%, further preferably at least 98%, further preferably Seq ID No. 1 or Seq ID No. 3. At least 99% sequence homology, and the protein encoded by the nucleic acid has the same function as the protein encoded by Seq ID No. 1 or Seq ID No. 3, respectively. Preferably the nucleic acid encoding the protein of the present invention are as follows: After substitution of the amino acid sequence defined by S e q ID N 0 .2 or S e q ID N 0 .4, deletion or addition of one or several amino acids and with Seq ID No. 2 or Seq ID The protein represented by No. 4 has the same function as a protein derived from a protein represented by Seq ID No. 2 or Seq ID No. 4.
本发明还提供一种蛋白质, 其氨基酸序列与 Seq ID No.2或 Seq ID No.4具有至少 90%, 优选至少 92%, 更优选至少 95%, 进一步优选至少 97%, 更进一步优选至少 99% 的序列同源性,并且该蛋白质分别与 Seq ID No.2或 Seq ID No.4所示的蛋白质具有相同 的功能。优选, 本发明的蛋白质为在 Seq ID No.2或 Seq ID No.4所限定的氨基酸序列中 经过取代、缺失或添加一个或几个氨基酸且与 Seq ID No.2或 Seq ID No.4所示的蛋白质 具有相同的功能的由 Seq ID No.2或 Seq ID No.4所示的蛋白质衍生的蛋白质。 附图说明  The present invention also provides a protein having an amino acid sequence of at least 90%, preferably at least 92%, more preferably at least 95%, further preferably at least 97%, still more preferably at least 99, with Seq ID No. 2 or Seq ID No. 4. % sequence homology, and the protein has the same function as the protein shown by Seq ID No. 2 or Seq ID No. 4. Preferably, the protein of the present invention is substituted, deleted or added with one or several amino acids in the amino acid sequence defined by Seq ID No. 2 or Seq ID No. 4 and with Seq ID No. 2 or Seq ID No. 4 The protein shown has the same function as a protein derived from a protein represented by Seq ID No. 2 or Seq ID No. 4. DRAWINGS
图 1为用于新基因发现的计算机网络硬件连接框架图。  Figure 1 is a diagram of a computer network hardware connection framework for new gene discovery.
图 2A、 图 2B和图 2C分别为本发明用于新基因发现的方法流程图、 蛋白质序列 聚类流程图和计算机运作分析程序综合流程框架图。  2A, 2B and 2C are respectively a flow chart of a method for novel gene discovery, a flow chart of protein sequence clustering, and a comprehensive process framework diagram of a computer operation analysis program.
图 3显示了新发现的两个类似于载脂蛋白 A1BP的新基因 BFC06016 (A)的 DNA 核苷酸序列 (A1 )和与之对应的氨基酸序列 (A2)和 BFC06104 (B) 的 DNA核苷酸 序列 (B1 )和与之对应的氨基酸序列 (B2)。  Figure 3 shows the DNA nucleotide sequence (A1) of two newly discovered apolipoprotein A1BP-like genes BFC06016 (A) and the corresponding amino acid sequence (A2) and BFC06104 (B) DNA nucleosides The acid sequence (B1) and the corresponding amino acid sequence (B2).
图 4显示了利用 ProParam软件对计算机预测的 BCF06016 (A)和 BFC06104 (B) 进行蛋白质疏水性 /亲水性预测的分析结果。  Figure 4 shows the results of protein hydrophobicity/hydrophilic prediction of BCF06016 (A) and BFC06104 (B) predicted by computer using ProParam software.
图 5A和图 5B 显示了利用 Tmpred/tmap 分析软件进行蛋白质跨膜区域分析 BFC06016基因的蛋白质跨膜区域分析结果, 证明其无跨膜区 (图 5A), 同理, 也证明 了 BFC06104无跨膜区 (图 5B)。  Figure 5A and Figure 5B show the results of protein transmembrane region analysis of BFC06016 gene by Tmpred/tmap analysis software, which proves that there is no transmembrane region (Fig. 5A). Similarly, BFC06104 has no transmembrane. Zone (Figure 5B).
图 6A和图 6B利用 pepwheel图形显示出该蛋白质序列中各氨基酸残基的螺旋轮, 图 6A为 BFC06016和图 6B为 BFC06104蛋白质氨基酸螺旋轮分析的结果。  Figures 6A and 6B show the helical wheel of each amino acid residue in the protein sequence using the pepwheel pattern. Figure 6A shows the results of BFC06016 and Figure 6B for the BFC06104 protein amino acid spiral wheel analysis.
图 7A和图 7B利用 pepinfo统计出该蛋白质序列中各种不同性质的氨基酸的含量 及其分布, 图 7A显示了对 BFC06016基因分析的结果, 图 7B显示了对 BFC06104基因 分析的结果。  Figures 7A and 7B use pepinfo to count the amino acid content and distribution of various amino acids in the protein sequence. Figure 7A shows the results of the BFC06016 gene analysis, and Figure 7B shows the results of the BFC06104 gene analysis.
图 8显示了 BFC06016 (A)和 BFC06104 (B)基因定位在人第 19号染色体 DNA 序列上。  Figure 8 shows that the BFC06016 (A) and BFC06104 (B) genes are located on the human chromosome 19 DNA sequence.
图 9显示了 BFC06016和 BFC06104与已知载脂蛋白 A1BP基因三者间氨基酸同源 性比较。 星字符号 (*)代表三者基因间氨基酸完全相同; 空白符号 ( )表示三者间该氨 基酸不相同; 下位点符号 (.)代表氨基酸半保守突变; 上下两点 (:)代表氨基酸保守 突变。 BFC06016与载脂蛋白 A1BP间的氨基酸同源性为 40.0%; BFC06104与载脂蛋白 A1BP间的氨基酸同源性为 41.5%。  Figure 9 shows the amino acid homology between BFC06016 and BFC06104 and the known apolipoprotein A1BP gene. The star character (*) represents the same amino acid between the three genes; the blank symbol ( ) indicates that the amino acid is not the same among the three; the lower dot symbol (.) represents the amino acid semi-conservative mutation; the upper and lower two points (:) represent the amino acid conserved mutation . The amino acid homology between BFC06016 and apolipoprotein A1BP was 40.0%; the amino acid homology between BFC06104 and apolipoprotein A1BP was 41.5%.
图 10为全合成由计算机预测的新基因核苷酸序列流程图。  Figure 10 is a flow chart showing the complete synthesis of a new gene nucleotide sequence predicted by a computer.
图 11为使用两个新发现的基因所具有的共同 DNA序列制作的 qPCR检测引物, 在不同的人组织和细胞系中所观察的这些基因的表达定量图谱。  Figure 11 is a quantitative map of the expression of these genes observed in different human tissues and cell lines using qPCR detection primers prepared using the common DNA sequences of the two newly discovered genes.
图 12显示了新发现的基因克隆和在细菌中的蛋白质表达和结果。 具体实施方式 Figure 12 shows the newly discovered gene clones and protein expression and results in bacteria. detailed description
实施例 1、 生物信息分析所需要的数据库下载和获取  Example 1. Database download and acquisition required for bioinformatics analysis
按图 1所示连接内网各计算机、 服务器等硬件设备并完成前期的硬件及基本的系 统调试。 通过连接 NCBI远程数据库, 下载所需要的生物信息分析相关的最新数据库。 其中包括了: 人表达 DNA序列标签数据库, 非冗余蛋白质序列数据库, 核苷酸数据库, 专利蛋白质序列数据库, 人染色体序列数据库等。在本地计算机对所有这些下载回来的 数据库进行格式化处理。 将其转化为本地程序可以识别的序列格式数据库。  Connect the hardware devices such as computers and servers on the internal network as shown in Figure 1 and complete the previous hardware and basic system debugging. By connecting to the NCBI remote database, download the latest database related to the required bioinformatics analysis. These include: Human Expression DNA Sequence Tag Database, Non-redundant Protein Sequence Database, Nucleotide Database, Patented Protein Sequence Database, Human Chromosomal Sequence Database, etc. All of these downloaded databases are formatted on the local computer. Convert it to a sequence format database that is recognized by the local program.
实施例 2、 程序搜集与编写  Example 2, program collection and writing
本发明所使用的生物信息分析程序均是来源于公共渠道或商业软件, 主要有 blastall: NCBI (美国国立生物技术信息中心) 的 blast软件包, 可以实现大致的基因序 列的比对工作; WU-blast:华盛顿大学的 blast软件包,其在新基因的检索分析方面所做 的功能比较优秀; Fasta: EMBL(欧洲分子生物学实验室)的序列比对软件包; cap4/Phrap: 华盛顿大学基因组科学研究中心的序列拼接软件; Tmpred: 预测蛋白质序列的跨膜; Signalp: 预测蛋白质序列的信号肽; clustalw: 多序列比对分析软件; pressdb: 数据库 编辑软件, WU-blast程序专用的核苷酸序列数据库格式化软件; sim4: 表达序列和染色 体基因组序列比对软件; im_index:数据库编辑软件,主要用于对序列数据库建立索引, 实现大数据库的可操作性; setdb:数据库编辑软件, WU-blast程序专用的蛋白质序列数 据库格式化软件; remap: 序列酶切位点分析软件; restrict: 序列酶切信息统计软件; showorf: DNA序列翻译软件; pepinfo: 图形方式显示蛋白质序列中各种不同性质氨基 酸的含量; pepstats: 统计蛋白质序列中各种氨基酸的含量同时得出分子量、等电点、带 电荷以及 280nm的光吸收值; pepwheel: 图形方式显示蛋白质序列中所有氨基酸残基的 螺旋轮; Proparam: 主要用于综合确定蛋白质的亲水性 /疏水性; Tmap: 图形显示蛋白 质的跨膜区域。  The bioinformatics analysis program used in the present invention is derived from public channels or commercial software, and mainly includes blastall: NCBI (National Biotechnology Information Center) blast software package, which can realize the comparison of approximate gene sequences; WU- Blast: The blasting software package at the University of Washington, which performs well in the search and analysis of new genes; Fasta: EMBL (European Molecular Biology Laboratory) sequence alignment software package; cap4/Phrap: University of Washington Genomics Science Sequence splicing software for the research center; Tmpred: predicting transmembrane of protein sequences; Signalp: signal peptide for predicting protein sequences; clustalw: multi-sequence alignment analysis software; pressdb: database editing software, nucleotide sequence specific for WU-blast programs Database formatting software; sim4: expression sequence and chromosomal genome sequence alignment software; im_index: database editing software, mainly used to index the sequence database, realize the operability of large databases; setdb: database editing software, WU-blast program Dedicated protein sequence database formatting software; Remap: sequence cleavage site analysis software; restrict: sequence cleavage information statistical software; showorf: DNA sequence translation software; pepinfo: graphical display of various amino acid content in protein sequences; pepstats: statistical protein sequences The amino acid content simultaneously yields molecular weight, isoelectric point, charged charge and light absorption at 280 nm; pepwheel: graphically shows the helical wheel of all amino acid residues in the protein sequence; Proparam: mainly used to comprehensively determine the hydrophilicity of the protein / Hydrophobic; Tmap: The graphic shows the transmembrane region of the protein.
为执行本发明编制的计算机程序: 主要有如下软件: tbl2faSta— n /faSta2tbl— n: 序列 格式转换软件, 其可将 fasta格式的序列转为表格格式的序列; gb2fasta: 序列格式转换 软件,可将 GenBank格式的序列转为 fasta格式的序列; drawBlast: blast结果做图程序, 可以通过 blast的结果数据做出大致的比对示意图; ed— cap4: 重新编译的 Cap4程序, 可以实现自动完成 cap4运行环境的配置; extractcontigs: 将 cap4输出的得分矩阵数据 转为 fasta格式的文件; im— delete: 数据库编辑软件, 可以实现对数据库中任意一个序 列的删除; im— insert:数据库编辑软件,可以实现对序列数据库的***增加序列的操作; im— retrieve: 数据库编辑软件, 批量或单个获取大型数据库中的某些序列; pepsigp: 重 新编译的 Signalp软件, 对原先只能单个预测信号肽的程序进行改进, 实现批量自动化 全面预测; primers— for— fulllength— clone: 批量引物设计软件; ps— signalp: 数据解析器软 件, 解析 pepsigp程序产生的结果数据; ps— scan: 蛋白质活性位点 /功能域分析软件; translate: 数据库编辑软件, DNA序列翻译程序; tt— comp— dna: 数据库编辑软件, DNA 序列反向互补程序; tt— cycle: 辅助软件, 主要用于配合部分不能实现自动化操作的程序 实现全面自动化运行; tt— fasty— 1 : 改进的 fasty程序, 可将复杂的参数和一些经验值直 接赋给 fasty程序, 使得 fasty程序可以和 tt— cycle结合使用,达到实现方便操作的目的; tt _get:是用于对临时的没有建立索引的数据库进行 DNA、蛋白质序列获取操作的软件; tt_pblast: blastn是用于结果解析软件,对大量的结果输出实现机器自动分析; tt_sub_seq: 序列编辑的软件, 是用于方便获取序列中的某个片段; tt— SUbSeq_gen0me: 是用于对基 因组序列进行片段序列获取的软件; tt— tmpred: 是重新优化后的蛋白质序列跨膜区域预 测软件, 使得改进后的序列可以实现批量分析; tt— tmpred_p: 数据解析器软件, 是专用 于解析 tt— tmpred生成分析结果数据; tt— zip— 2: 序列编辑软件, 主要用于合并两个简单 的序列片段, 并过滤掉它们之间的重复部分; biofaseqindex: 数据库编辑软件, 是用于 针对 Fasta格式的数据库建立索引的程序; biogbseqindex: 数据库编辑软件, 是用于针 对 GenBank格式的数据库建立索引的程序; gb2cds:序列编辑软件,是用于获取 GenBank 格式的序列文件中的 CDS序列; parser— bx: 解析器软件, 是用于对 blastn, blastp blastx 等程序输出的结果进行解析的软件; parser— fasta:解析器软件, 是用于对 fasty比对程序 输出的结果进行解析的软件; rfetch: 数据库操作软件, 是用于通过网络远程直接获取 GenBank上的序列数据; lfetch: 数据库操作软件, 是用于由本地网络直接获取本地数 据库上的序列数据软件***构成本发明的基础。 Computer program for the execution of the present invention: Mainly has the following software: tbl2f aS ta- n /f aS ta2tbl- n: sequence format conversion software, which can convert the sequence of fasta format into a sequence of table format; gb2f as ta : sequence Format conversion software, can convert the sequence of GenBank format into a sequence of fasta format; drawBlast: blast result drawing program, can make a rough comparison diagram by blast result data; ed-cap4: Recompiled Cap4 program, can Automatically complete the configuration of the cap4 runtime environment; extractcontigs: convert the score matrix data output by cap4 to a file in the fasta format; im-del: database editing software, which can delete any sequence in the database; im-insert: database editing Software, can achieve the operation of inserting sequences into the sequence database; im-retrieve: database editing software, batch or single to obtain certain sequences in large databases; pepsigp: recompiled Signalp software, only a single predictive signal peptide The program is improved to achieve full automation of batch automation Primers— for—fulllength— clone: batch primer design software; ps—signal: data parser software, parsing result data generated by pepsigp program; ps—scan: protein active site/function domain analysis software; translate: database editing software , DNA sequence translation program; tt-comp-dna: database editing software, DNA sequence reverse complement program; tt-cycle: auxiliary software, mainly used to cooperate with some programs that can not achieve automatic operation to achieve full automation; tt- fasty- 1 : Improved fasty program, can assign complex parameters and some experience values directly to the fasty program, so that the fasty program can be used in conjunction with tt-cycle to achieve convenient operation; tt _get: is used for temporary Software for indexing databases for DNA and protein sequence acquisition operations; Tt_pblast: blastn is used for result analysis software, which implements automatic machine analysis for a large number of result output; tt_sub_seq: sequence editing software is used to easily obtain a certain segment in the sequence; tt_ SU b Se q_g en0 me : is used Software for fragment sequence acquisition of genomic sequences; tt-tmpred: is a re-optimized protein sequence transmembrane region prediction software, so that the improved sequence can be batch analyzed; tt-tmp re d_p: data parser software, Dedicated to parsing tt-tmpred to generate analysis result data; tt-zip-2: sequence editing software, mainly used to merge two simple sequence fragments and filter out the duplicates between them; biofaseqindex: database editing software, is used A program for indexing a database in the Fasta format; biogbseqindex: a database editing software for indexing a database in GenBank format; gb2cds: sequence editing software for obtaining a CDS sequence in a sequence file in GenBank format; Parser — bx: parser software, is used for blastn, blastp blast Software that parses the results of program output such as x; parser—fasta: parser software, which is used to parse the results of the fasty comparison program output; rfetch: database operation software, which is used to directly obtain GenBank remotely via the network. Sequence data; lfetch: Database operating software, which is a software system for obtaining sequence data on a local database directly from a local network.
实施例 3、 新的基因获取操作流程  Example 3, a new gene acquisition process
本发明的新基因发现的方法的流程图、 蛋白质序列聚类流程图和和计算机运作分 析程序综合流程框架见图 2A、 图 2B和图 2C。 首先由脚本解析专利蛋白数据库获取数 据库中长度为 500aa以下, 优选 400aa以下, 更优选 300aa以下的所有蛋白质序列 (程序 有 fasta2tbl— n, tbl2fasta_n),将所有的序列传给 pepsigp, 进行分泌型信号肽分析, 结果 转给 ps— signalp解析, 分别获得含有分泌型信号肽的蛋白质序列和不含有分泌型信号肽 的蛋白质序列; 通过 tt— cycle, 配合 tt— tmpred对所有含有信号肽的序列进行跨膜区域动 态预测分析, 结果通过管道直接送给程序 tt— tmpred_p进行解析, 再将序列分成: 含有 分泌型信号肽且不含有跨膜区域的序列; 含有分泌型信号肽且含有跨膜区域的序列; 含 有分泌型信号肽且含有跨膜区域的序列中跨膜区域的数目为 5〜8(大于等于 5且小于等 于 8) 的序列这三个大类(具体可参见图 2B); 将获取到的氨基酸序列片段做为模型对 人表达序列标签数据库做 tblastn比对,可以获得一定匹配(序列相似程度为 15%〜95%, 优选为 20%〜90%,更优选为大于 25%并小于 90%,优选突变尽可能均匀分布在整个匹 配的序列中)的表达序列标签,例如通过参数调整设置(参数设定为: B=50000; V=50000; S=300),可以获取到含有符合参数优选要求的全部序列片段,由管道送给 tt_pblast解析, 获取所有符合优选条件的表达标签序列,并送给脚本对其进行 ployA和 ployT替换过滤; 建立 cap4运行所必须的环境。 用 fastaclust2caml将这些 fasta格式的序列全部转成 xml 数据交换格式的文件。与此同时启动 Cap4和 Phrap软件分别对这些序列进行全面拼接, 之后用 extractcontigs将拼接后的数据文件还原为 FASTA格式的文件, 合并序列; 通过 设定先对这些序列和非分泌蛋白数据库做 blastx比对分析, parser— bx解析排除掉所有完 全匹配的序列, 再将余下的序列和专利蛋白质序列数据库做 tt— fasty— 1 比对分析, parser— fasta解析后获得余下的序列;通过程序控制循环将余下的程序和人染色体序列数 据库做 blastn比对验证分析, 和专利核苷酸序列数据库做 blastn比对校正序列上的突变 或缺失问题, 和核苷酸序列数据库以及人类表达序列标签数据库做 blastn比对分析解决 序列长度不够问题, 和非冗余蛋白数据库做 blastx验证序列的是否已经被发现过; 对比 这五个反复运行分析所获得的结果可以得出全长基因序列。 使用 Sim4软件可以确定该 全长基因在染色体上的具***置; 使用 ProParam可进行蛋白质的疏水性 /亲水性预测分 析; 使用 Signalp可对该蛋白质进行分泌型信号肽分析; 使用 Tmpred和 tmap可进行蛋 白质跨膜区域分析; 使用 gamier可分析该蛋白质的二级结构; 使用 pepwheel可图形显 示出该蛋白质序列中各氨基酸残基的螺旋轮; 使用 pepinfo可统计出该蛋白质序列中各 种不同性质的氨基酸的含量并以示意图大致显示出这些氨基酸的分布; 使用 pepstat可 统计该蛋白质序列中各种氨基酸的含量并得到分子量、等电点、带电荷以及 280nm的光 吸收值等信息; 同时通过 PubMed文献检索搜集大量相关文献对所发现的基因进行生物 学功能方面的预测。 The flow chart of the novel gene discovery method of the present invention, the protein sequence clustering flow chart, and the computer operation analysis program integrated flow framework are shown in Figures 2A, 2B and 2C. First, the patent protein database is parsed to obtain all protein sequences in the database of 500aa or less, preferably 400aa or less, more preferably 300aa or less (programs are fasta2tbl-n, tbl2fasta_n), and all sequences are transmitted to pepsigp for secretory signal peptides. Analysis, the results were transferred to ps-signal analysis, and the protein sequence containing the secreted signal peptide and the protein sequence containing no secreted signal peptide were obtained respectively; through the tt-cycle, all the sequences containing the signal peptide were transmembrane by tt-tmpred Regional dynamic predictive analysis, the results are directly sent to the program tt-tmp r ed_p for analysis, and then the sequence is divided into: a sequence containing a secreted signal peptide and containing no transmembrane region; containing a secreted signal peptide and containing a transmembrane region Sequence; a sequence of sequences having a transmembrane region containing a secreted signal peptide and having a transmembrane region of 5 to 8 (greater than or equal to 5 and less than or equal to 8) (see Figure 2B for details); The amino acid sequence fragment obtained as a model is tblastn-aligned with the human expression sequence tag database, and one can be obtained. An expression sequence tag that matches (with a sequence similarity of 15% to 95%, preferably 20% to 90%, more preferably greater than 25% and less than 90%, preferably the mutation is distributed as uniformly as possible throughout the matched sequence), for example Through the parameter adjustment setting (parameter setting: B=50000; V=50000; S=300), all the sequence fragments containing the parameters that meet the parameter preferences can be obtained and sent to the tt_pblast for analysis by the pipeline to obtain all expressions that meet the preferred conditions. The sequence of tags is sent to the script for filterA and ployT replacement filtering; establish the environment necessary for cap4 to run. These fasta format sequences are all converted to xml data exchange format files using fastaclust2caml. At the same time, the Cap4 and Phraps software are started to completely splicing these sequences, and then the contiguous data files are restored to the FASTA format file by extractcontigs, and the sequences are merged; the blastx ratio of these sequences and the non-secreted protein database is set first. For analysis, parser-bx parsing excludes all perfectly matched sequences, and then the remaining sequences and the patent protein sequence database are tt-fasty-1, and parser-fasta parses to obtain the remaining sequences; The remaining programs and human chromosome sequence databases are blastn alignment verification analysis, and the patent nucleotide sequence database is blastn aligned to correct mutation or deletion problems on the sequence, and the nucleotide sequence database as well as the human expressed sequence tag database do blastn ratio Whether the analysis solves the problem of insufficient sequence length, and whether the blastx verification sequence has been found in the non-redundant protein database has been discovered; comparing the results obtained by the five repeated runs analysis, the full-length gene sequence can be obtained. This can be determined using Sim4 software. Specific position of the full-length gene on the chromosome; ProParam can be used for predictive analysis of hydrophobicity/hydrophilicity of the protein; Signalp can be used for secretory signal peptide analysis using Signalp; Trpred and tmap can be used for protein transmembrane region analysis; The secondary structure of the protein can be analyzed using gamier; the spiral wheel of each amino acid residue in the protein sequence can be graphically displayed using pepwheel; the content of amino acids of various natures in the protein sequence can be counted using pepinfo and The distribution of these amino acids is shown; pepstat can be used to calculate the content of various amino acids in the protein sequence and obtain information such as molecular weight, isoelectric point, charged charge and light absorption value at 280 nm. At the same time, a large number of related literatures are collected through PubMed literature search. The discovered genes are predicted for biological functions.
实施例 4、 新的类似于载脂蛋白 A1BP的 BFC06016和 BFC06104基因的获得 按照以上新的基因获取操作流程在服务器的终端执行实际操作, 本发明获得了计 算机预测的 35个蛋白质序列, 属于可能的新基因候选者。 其中与载脂蛋白 A1BP基因 类似的两个新基因,现分别编号为 BFC06016和 BFC06104。 Seq ID No.l和 Seq ID No.2 是 BFC06016的 DNA序列和氨基酸序列,列在图 3 ( A)中。 Seq ID No.3和 Seq ID No.4 是 BFC06104的 DNA序列和氨基酸序列, 列在图 3 (B ) 中。 这两个基因已经存入美国 GenBank资料库, 分别获得 Accession ID DQ778079和 ID DQ778080。 应用自编基因分 析程序和已知的基因生物信息计算机处理软件, 如利用 ProParam进行蛋白质的疏水性 / 亲水性预测分析,获得的结果显示亲水性(+2〜- 2范围内)的 GRAVY (Grand average of hydropathicity)值分别为: -0.015和 -0.115; Signalp对该蛋白质进行分泌型信号肽分析(图 4 (A)显示了对 BFC06016分泌型信号肽分析的结果, 证明其无分泌型信号肽; 同理, 也证明 BFC06104 (图 4 (B))也无分泌型信号肽); 利用 Tmpred/tmap分析软件进行蛋 白质跨膜区域分析 BFC06016基因的蛋白质跨膜区域分析结果, 证明其无跨膜区 (图 5A), 同理, 也证明了 BFC06104无跨膜区 (图 5B ); 利用 pepwheel图形显示出该蛋白 质序列中各氨基酸残基的螺旋轮, 图 6A和图 6B分别为 BFC06016和 BFC06104的蛋 白质氨基酸螺旋轮分析的结果; 利用 pepinfo统计出该蛋白质序列中各种不同性质的氨 基酸的含量及其分布, 图 7A显示了对 BFC06016基因分析的结果, 图 7B 显示了对 BFC06104基因分析的结果。  Example 4: Obtainment of a new apolipoprotein A1BP-like BFC06016 and BFC06104 genes According to the above new gene acquisition operation procedure, the actual operation is performed at the terminal of the server, and the present invention obtains 35 protein sequences predicted by the computer, which is possible. New gene candidates. Two new genes similar to the apolipoprotein A1BP gene are now numbered BFC06016 and BFC06104, respectively. Seq ID No. 1 and Seq ID No. 2 are the DNA sequence and amino acid sequence of BFC06016 and are listed in Figure 3 (A). Seq ID No. 3 and Seq ID No. 4 are the DNA sequence and amino acid sequence of BFC06104 and are listed in Figure 3 (B). These two genes have been deposited into the US GenBank database and have accession ID DQ778079 and ID DQ778080, respectively. Application of self-programming genetic analysis program and known gene bioinformatics computer processing software, such as proParam for hydrophobicity/hydrophilic predictive analysis of proteins, the results obtained show hydrophilicity (in the range of +2~-2) GRAVY The values of (Grand average of hydropathicity) were: -0.015 and -0.115; Signalp performed a secreted signal peptide analysis of the protein (Fig. 4 (A) shows the results of analysis of the secreted signal peptide of BFC06016, demonstrating that it has no secretory signal. Peptide; similarly, it also proved that BFC06104 (Fig. 4 (B)) also has no secretory signal peptide); using Tmpred/tmap analysis software for protein transmembrane region analysis, BFC06016 gene protein transmembrane region analysis results, proved that it has no transmembrane The region (Fig. 5A), for the same reason, also proved that BFC06104 has no transmembrane region (Fig. 5B); the spiral wheel of each amino acid residue in the protein sequence is shown by the pepwheel pattern, and Fig. 6A and Fig. 6B are BFC06016 and BFC06104, respectively. Results of protein amino acid helical round analysis; using pepinfo to calculate the content of various amino acids in the protein sequence and Cloth, FIG. 7A shows the results of genetic analysis BFC06016, FIG 7B shows the results BFC06104 genetic analysis.
实施例 5、获得与载脂蛋白 A1结合蛋白基因类似的人基因 BFC06016和 BFC06104 之间与已知载脂蛋白 A1BP间的比较  Example 5. Comparison between the human gene BFC06016 and BFC06104 similar to the apolipoprotein A1 binding protein gene and known apolipoprotein A1BP
应用 Sim4 软件确定了该全长基因在染色体上的具***置; 已知的人载脂蛋白 A1BP基因是坐落在人第 1号染色体(见文献 Ritter et al Genetics, 79:693-702,2002 )0 经 由本发明设计的计算机分析方法预测的 BFC06016和 BFC06104基因是分别坐落在人第 19号染色体上见图 8 (A)和图 8 (B)。在人的 cDNA文库中获得 BFC06016和 BFC06104 基因的全长 cDNA序列, 分别是 Seq ID No.5和 Seq ID No.6。 与已知人载脂蛋白 AIBP 的三者之间氨基酸序列比较见图 9。 其与载脂蛋白 A1BP氨基酸同源性分别为 41.5%和 40.0%。星字符号 (*)代表三者基因间氨基酸相同; 空白符号( )表示三者间该氨基酸不 相同; 下位点符号 (.)代表氨基酸半保守突变; 上下两点 (:)代表氨基酸保守突变。 BFC06016与载脂蛋白 A1BP间的氨基酸同源性为 40.0%; BFC06104与载脂蛋白 A1BP 间的氨基酸同源性为 41.5%。 实施例 6、 分子克隆技术简述 Sim4 specific application software determines the position of the full-length gene on the chromosome; A1BP known human apolipoprotein genes are located in the human chromosome 1 (see Document Ritter et al Genetics, 79: 693-702,2002 ) 0 The BFC06016 and BFC06104 genes predicted by the computer analysis method designed by the present invention are located on human chromosome 19, respectively, as shown in Fig. 8 (A) and Fig. 8 (B). The full-length cDNA sequences of the BFC06016 and BFC06104 genes were obtained in a human cDNA library, which are Seq ID No. 5 and Seq ID No. 6, respectively. Amino acid sequence comparisons with the known human apolipoprotein AIBP are shown in Figure 9. Its amino acid homology with apolipoprotein A1BP was 41.5% and 40.0%, respectively. The star character (*) represents the same amino acid among the three genes; the blank symbol ( ) indicates that the amino acid is not the same among the three; the lower dot symbol (.) represents the amino acid semi-conservative mutation; the upper and lower two points (:) represent the amino acid conserved mutation. The amino acid homology between BFC06016 and apolipoprotein A1BP was 40.0%; the amino acid homology between BFC06104 and apolipoprotein A1BP was 41.5%. Example 6, a brief description of molecular cloning techniques
常规分子克隆技术包括 DNA、 RNA的提取, 琼脂糖凝胶和聚丙烯酰胺凝胶电泳, DNA片段的连接, 限制性内切酶酶切反应均参照文献 (Maniatis等, 《分子克隆实验手 册》 冷泉港实验室出版, 冷泉港, 纽约, 1982)。 DNA聚合酶链反应 (PCR) (参照文 献 Saikiet等, 《科学》, 230:1350, 1985 )所用的酶及反应所需 PCR仪均为 Perkin Elmer 产品。并参照厂家操作程序。 DNA测序和 DNA扩增所需用的寡聚核苷酸引物由专门机 构完成。 感受态大肠杆菌由 GIBCO/BRL公司购得。 质粒 DNA的纯化, DNA片段的回 收等均采用商品 Qiagen纯化柱制备。使用毕氏酵母菌或 BL21DE3菌株用于蛋白质表达 和制备。  Conventional molecular cloning techniques include DNA, RNA extraction, agarose gel and polyacrylamide gel electrophoresis, DNA fragment ligation, and restriction endonuclease digestion. References (Maniatis et al., Molecular Cloning Manual, Cold Spring) Hong Kong Laboratory Publishing, Cold Spring Harbor, New York, 1982). The enzymes used in the DNA polymerase chain reaction (PCR) (see the article Saikiet et al., Science, 230: 1350, 1985) and the PCR equipment required for the reaction are all Perkin Elmer products. And refer to the manufacturer's operating procedures. Oligonucleotide primers required for DNA sequencing and DNA amplification are performed by specialized mechanisms. Competent E. coli was purchased from GIBCO/BRL. Purification of plasmid DNA, recovery of DNA fragments, and the like were carried out using a commercial Qiagen purification column. Pichia pastoris or BL21DE3 strains were used for protein expression and preparation.
实施例 7、 BFC06016和 BFC06104基因的全合成  Example 7. Total synthesis of BFC06016 and BFC06104 genes
以 BFC06106基因为例, 讲述如何设计 DNA寡聚核苷酸链引物, 利用 PCR技术 来进行全合成计算机预测的基因, 其合成路线见图 10。  Taking the BFC06106 gene as an example, how to design DNA oligonucleotide chain primers and PCR-based full-synthesis computer-predicted genes are shown in Figure 10.
Seq ID No.7: 5'- CACATATGAGCAGCGCAGCCGGCCCAGACCCGTCGGAGG CGCCCGAAGAGCGGC -3' 合成 1-57正链, 长 54个碱基。  Seq ID No. 7: 5'- CACATATGAGCAGCGCAGCCGGCCCAGACCCGTCGGAGG CGCCCGAAGAGCGGC -3' Synthesis 1-57 positive strand, 54 bases long.
Seq ID No.8: 5'- GGGCGGCTGCCTCCGCGGTGCTGAGGAAATGCCGCTCTTC Seq ID No. 8: 5'- GGGCGGCTGCCTCCGCGGTGCTGAGGAAATGCCGCTCTTC
GGGCGCCTCCG -3'合成 37-87反向互补, 长 51个碱基。 GGGCGCCTCCG -3' Synthesis 37-87 is reverse complemented and is 51 bases in length.
Seq ID No.9: 5'- CCGCGGAGGCAGCCGCCCTGGAGCGGGAGCTGCTGGAGG ATTATCGCTTTGGGCGGC -3' 70-126 正链, 长 57个碱基。  Seq ID No. 9: 5'- CCGCGGAGGCAGCCGCCCTGGAGCGGGAGCTGCTGGAGG ATTATCGCTTTGGGCGGC -3' 70-126 Positive strand, 57 bases long.
Seq ID No.10: 5'- CAGCCACGGCACTAGCATGACCGCACAGCTCCACGAGCT GCTGCCGCCCAAAGCGATA -3' 111-168 反向互补, 长 58个碱基。  Seq ID No. 10: 5'- CAGCCACGGCACTAGCATGACCGCACAGCTCCACGAGCT GCTGCCGCCCAAAGCGATA -3' 111-168 Reverse complement, 58 bases long.
Seq ID No.ll: 5'- TGCTAGTGCCGTGGCTGTGACCAAGGCGTTCCCGTTGCC CGCTCTCTCCCGGAAGCAG -3' 152-209 正链, 长 58个碱基。  Seq ID No. ll: 5'- TGCTAGTGCCGTGGCTGTGACCAAGGCGTTCCCGTTGCC CGCTCTCTCCCGGAAGCAG -3' 152-209 Positive strand, 58 bases long.
Seq ID No.12: 5'- CTGCCCCGTTCTGCTCCGGGCCACACACGACCAGCACCG TCCTCTGCTTCC GGGAGAG -3' 195-252 反向互补, 长 58个碱基。  Seq ID No. 12: 5'- CTGCCCCGTTCTGCTCCGGGCCACACACGACCAGCACCG TCCTCTGCTTCC GGGAGAG -3' 195-252 Reverse complement, 58 bases long.
Seq ID No.13: 5'- GCAGAACGGGGCAGTGGGGCTGGTCTGTGCCCGGCACC Seq ID No. 13: 5'- GCAGAACGGGGCAGTGGGGCTGGTCTGTGCCCGGCACC
TGCGGGTGTTTGAGTATGA -3' 239-295 正链, 长 57个碱基。 TGCGGGTGTTTGAGTATGA -3' 239-295 Positive strand, 57 bases long.
Seq ID No.14: 5'- GCAGGTCCAGCGAGCGTGTGGGGTAGAAGATGGTGGGT TCATACTCAAA CACCCGC -3' 278-333 反向互补, 长 56个碱基。  Seq ID No. 14: 5'-GCAGGTCCAGCGAGCGTGTGGGGTAGAAGATGGTGGGT TCATACTCAAA CACCCGC -3' 278-333 Reverse complement, 56 bases long.
Seq ID No.15: 5'- CACGCTCGCTGGACCTGCTGCATCGGGACCTGACCACCC AGTGCGAGAAGATGGAC -3' 316-371 正链, 长 56个碱基。  Seq ID No. 15: 5'- CACGCTCGCTGGACCTGCTGCATCGGGACCTGACCACCC AGTGCGAGAAGATGGAC -3' 316-371 Positive strand, 56 bases long.
Seq ID No.16: 5'- ATGAGCTGCACCTCAGTGGGCAGGTAGCTCAGGAAGGG GATGTCCATCTTC TCGC -3' 358-412反向互补, 长 55个碱基。  Seq ID No. 16: 5'- ATGAGCTGCACCTCAGTGGGCAGGTAGCTCAGGAAGGG GATGTCCATCTTC TCGC -3' 358-412 is a reverse complement, 55 bases long.
Seq ID No.17: 5'- CCTGCCCACTGAGGTGCAGCTCATTAACGAAGCCTATGG GCTGGTGGTGGATGCCGT -3' 389-445正链, 长 57个碱基。  Seq ID No. 17: 5'- CCTGCCCACTGAGGTGCAGCTCATTAACGAAGCCTATGG GCTGGTGGTGGATGCCGT -3' 389-445 positive strand, 57 bases long.
Seq ID No.18: 5'- GGGGCCCCCGACCTCGCCCGGCTCCACGCCGGGGCCCA Seq ID No.18: 5'- GGGGCCCCCGACCTCGCCCGGCTCCACGCCGGGGCCCA
GTACGGCATCCACCACC -3' 431-485 反向互补, 长 55个碱基。 GTACGGCATCCACCACC -3' 431-485 is a reverse complement, 55 bases long.
Seq ID No.19: 5'- GCCGGGCGAGGTCGGGGGCCCCTGCACCCGCGCGCTGG CCACGCTCAAGCTGCTGTCC -3' 464-521正链, 长 58个碱基。  Seq ID No. 19: 5'-GCCGGGCGAGGTCGGGGGCCCCTGCACCCGCGCGCTGG CCACGCTCAAGCTGCTGTCC -3' 464-521 positive strand, 58 bases long.
Seq ID No.20: 5'- GCCTGAGGGGATGTCCAGGCTCACGAGGGGGATGGACA GCAGCTTGAGCG TGGCC -3' 500-554反向互补, 长 55个碱基。 Seq ID No.20: 5'- GCCTGAGGGGATGTCCAGGCTCACGAGGGGGATGGACA GCAGCTTGAGCG TGGCC -3' 500-554 is reverse complemented and is 55 bases in length.
Seq ID No.21: 5'- CATCCCCTCAGGCTGGGACGCAGAGACCGGCAGCGATT CGGAGGACGGG CTGCGGCCTG -3' 542-600正链, 长 59个碱基。  Seq ID No. 21: 5'- CATCCCCTCAGGCTGGGACGCAGAGACCGGCAGCGATT CGGAGGACGGG CTGCGGCCTG -3' 542-600 positive strand, 59 bases long.
Seq ID No.22: 5'- GCGCAGCGCTTGGGCGCCGCGAGAGACACCAGCACGTC AGGCCGCAGCCC GTCCTCCGA -3' 579-637反向互补, 长 59个碱基。  Seq ID No. 22: 5'-GCGCAGCGCTTGGGCGCCGCGAGAGACACCAGCACGTC AGGCCGCAGCCC GTCCTCCGA -3' 579-637 Reverse complement, 59 bases long.
Seq ID No.23: 5'- CGTGCTGGT GTCTCTCGCG GCGCCCAAGCGCTGCGCTG G CCGCTTCTCCGGGCGCCACC -3' 602-660正链, 长 59个碱基。  Seq ID No. 23: 5'- CGTGCTGGT GTCTCTCGCG GCGCCCAAGCGCTGCGCTG G CCGCTTCTCCGGGCGCCACC -3' 602-660 positive strand, 59 bases long.
Seq ID No.24: 5'- CTTGCGGCGCACGTCATCGGGCACGAACCTGCCGGCCA CGAAGTGGTGGCG CCCGGAGA -3' 646-704反向互补, 长 59个碱基。  Seq ID No. 24: 5'- CTTGCGGCGCACGTCATCGGGCACGAACCTGCCGGCCA CGAAGTGGTGGCG CCCGGAGA -3' 646-704 reverse complement, 59 bases long.
Seq ID No.25: 5'- TGACGTGCGCCGCAAGTTCGCTCTGCGCCTGCCGGGATA Seq ID No.25: 5'- TGACGTGCGCCGCAAGTTCGCTCTGCGCCTGCCGGGATA
CACGGGCACCG -3' 689-738正链, 长 50个碱基。 CACGGGCACCG -3' 689-738 positive strand, 50 bases long.
Seq ID No.26: 5'- TAGCGGCCGCTCACAGTGCCGCGACGCAGTCGGTGCCC GTGTATCCCGGC -3' 719-768反向互补, 长 50个碱基。  Seq ID No. 26: 5'- TAGCGGCCGCTCACAGTGCCGCGACGCAGTCGGTGCCC GTGTATCCCGGC -3' 719-768 reverse complement, 50 bases long.
Seq ID No.27: 5'- CACATATGATGAGCAGCGCAG -3' 1-21正链, 长 21个碱基。 Seq ID No.28 : 5'- TAGCGGCCGCTCACAGTGCCGC -3' 747-768反向互补, 长 22 个碱基。  Seq ID No. 27: 5'- CACATATGATGAGCAGCGCAG - 3' 1-21 positive strand, 21 bases long. Seq ID No. 28: 5'- TAGCGGCCGCTCACAGTGCCGC -3' 747-768 is reverse complementary and is 22 bases long.
具体操作简述如下:  The specific operation is as follows:
以待合成 DNA序列的第一个寡聚核苷酸链引物为起始点,首先每 4个寡聚核苷酸 链为一组, 利用 PCR技术合成一个长链 DNA片段。 例如 Seq ID No.7、 Seq ID No.8、 Seq ID No.9和 Seq ID No.10为一组。在 25微升的 PCR缓冲液反应体积中,引物的含量 分别为 ΙΟΟρΜ: ΙρΜ: ΙρΜ: 100pM的引物, 20mM dNTP, 适量的水和 lu的 T4 DNA 多聚合成酶(T4 Taq Polymerase )。 PCR仪中, 以 94°C X 30秒, 55 °C X 30秒, 72°C X 30 秒重复 25个循环, 最后 72°C保温 5分钟, 4°C保存至合成 DNA片段进行纯化程序。 此 产物为第一组产物。如此制备每一组产物。然后每相邻两组的产物等比例混合,在有 Taq 酶和 dNTP存在的 PCR缓冲液中, 先进行 5个程序的 PCR循环反应, 然后再加入两端 的寡聚核糖酸链引物(此处,如是第一和第二组产物结合,就加入SeqID N0.7和Seq ID No.14各 100pM)。 采用同样的 PCR循环程序进行反应制备更大 DNA片段。 但循环中 的 72°C保温时间可适当增加。 按操作示意图 10演示即可完成所设计的 DNA全序列合 成工作。 应用此程序 BFC06016和 BFC06104计算机预测基因序列获得合成和制备, 其 5'末端含有 Nde I限制性内切酶。 PCR合成的全长 DNA***在 pTA载体中, 并且在插 入位点的左右分别含有两个 EcoRI和 Notl位点。 DNA序列经测序检定证明所合成的 DNA序列正确。 该质粒命名为 pTA-BFC06016。 Starting from the first oligonucleotide strand primer of the DNA sequence to be synthesized, first, every four oligonucleotide strands are grouped, and a long-chain DNA fragment is synthesized by PCR. For example, Seq ID No. 7, Seq ID No. 8, Seq ID No. 9, and Seq ID No. 10 are a group. In a 25 μl PCR buffer reaction volume, the primer contents were ΙΟΟρΜ: ΙρΜ: ΙρΜ: 100 pM of primer, 20 mM dNTP, appropriate amount of water and lu T4 DNA poly-polymerization (T4 Taq Polymerase). In the PCR machine, 25 cycles of 94 ° C X 30 seconds, 55 ° C X 30 seconds, 72 ° C X 30 seconds were repeated, and finally incubated at 72 ° C for 5 minutes, and stored at 4 ° C until the synthetic DNA fragment was subjected to a purification procedure. This product is the first set of products. Each set of products was prepared as such. Then, the products of each adjacent two groups are mixed in equal proportions. In the PCR buffer in the presence of Taq enzyme and dNTP, the PCR cycle reaction of 5 procedures is performed first, and then the oligoribonucleotide chain primers at both ends are added (here, If the first and second sets of products are combined, S e qID N 0 .7 and S e q ID No. 14 are each added to 100 pM). A larger DNA fragment was prepared by the same PCR cycle procedure. However, the 72 ° C holding time in the cycle can be appropriately increased. The full sequence synthesis of the designed DNA can be completed by the demonstration of operation diagram 10. The BFC06016 and BFC06104 computer predicted gene sequences were used to obtain synthesis and preparation, and the 5' end thereof contained an Nde I restriction enzyme. The full-length DNA synthesized by PCR was inserted into the pTA vector, and contained two EcoRI and Notl sites, respectively, to the left and right of the insertion site. The DNA sequence was verified by sequencing to confirm that the synthesized DNA sequence was correct. This plasmid was named pTA-BFC06016.
随后, 新发现的基因 BFC06016和 BFC06104也从人肾 cDNA文库中, 应用 PCR 克隆的方法直接获得, 并提供 DNA序列测定的方法获得了验证。  Subsequently, the newly discovered genes BFC06016 and BFC06104 were also directly obtained from the human kidney cDNA library by PCR cloning, and the method for providing DNA sequencing was verified.
实施例 8、 新发现的基因在不同的人组织和细胞系中的表达状况  Example 8. Expression of newly discovered genes in different human tissues and cell lines
为了进行定量 PCR分析,从各种来源的人组织和细胞系的 mRNA用来合成 cDNA, 在 25 μΐ 的反应体积中, 加入 100 个单位的 M-MLV反转录酶 (Ambion), 0.5mM dNTPs(Epicentre)和 40 ng/ml随机 6核苷酸引物(Fisher)。 样品在 25 °C下反应 10分钟, 42°C下 50分钟, 然后 70°C下 15分钟, 稀释至 500 μ 1, 最后保存在 -20°C。 cDNA也可 是从试剂公司购买来获得(包括 Clontech MTC cDNA文库的使用)。 PCR引物和探针(5' 末端是 6-FAM标记, 3'末端是用 TAMRA来标记) 是用 ABI引物设计软件来设计, 其 中使用两个新发现的基因所具有的共同 DNA序列设计 PCR检测引物。 合成由 Qiagen、 BiosearchTechnologies Inc或 Applied Biosystems Inc.完成。 For quantitative PCR analysis, mRNA from human tissues and cell lines from various sources was used to synthesize cDNA. In a reaction volume of 25 μΐ, 100 units of M-MLV reverse transcriptase (Ambion), 0.5 mM dNTPs were added. (Epicentre) and 40 ng/ml random 6 nucleotide primer (Fisher). The sample was reacted at 25 ° C for 10 minutes, at 42 ° C for 50 minutes, then at 70 ° C for 15 minutes, diluted to 500 μl and finally stored at -20 ° C. cDNA can also be It was purchased from a reagent company (including the use of the Clontech MTC cDNA library). PCR primers and probes (6-FAM-labeled at the 5' end and TAMRA at the 3' end) were designed using ABI primer design software, which uses a common DNA sequence design PCR detection of two newly discovered genes. Primer. The synthesis was performed by Qiagen, Biosearch Technologies Inc or Applied Biosystems Inc.
所使用的引物 Primer-F (第 244-263核苷酸位点) Seq ID No 29: 5'- CTGGAGGA Primer used Primer-F (nucleotide position 244-263) Seq ID No 29: 5'- CTGGAGGA
TTATCGCTTTGG -3'; 引物 Primer-R (第 480-461核苷酸位点) Seq ID No 30: 5' -ATGC AGCAGGTCCAGCGAGC- 3'和探针 Probe (第 281-300核苷酸位点) Seq ID No 31: 5'- 6-FAM-AGCTGTGCGG TCATGCTAGT-TAMRA -3'均与已知的 APBP基因的核苷酸序列 不同源。可以真实地反应出新发现的基因在人体不同组织和细胞系中的表达水平和生物 活性的不同。 TTATCGCTTTGG -3'; Primer Primer-R (nucleotide 480-461) Seq ID No 30: 5' -ATGC AGCAGGTCCAGCGAGC-3' and probe Probe (281-300 nucleotide position) Seq ID No 31: 5'- 6-FAM-AGCTGTGCGG TCATGCTAGT-TAMRA -3' are each different from the nucleotide sequence of the known APBP gene. It can truly reflect the difference in the expression level and biological activity of newly discovered genes in different tissues and cell lines of human body.
QPCR反应使用的是 ABI7700序列反应检测***。在 25 μΐ的反应体积里,含有 5 μΐ cDNA模板, IX的 TaqMan通用 PCR混和反应液(ABI), PCR引物为 ΙΟΟηΜ, 探针含 量为 200nM, 以及 IX的 VIC标记的 Beta-2-Microglobulin endogenous对照物(ABI)。 PCR反应条件为 50°C, 2分钟; 90°C, 10分钟; 随后在 95°C, 15秒, 60°C, 1分钟重 复 40个循环。 结果的分析使用序列检测软件(ABI)和应用比较 CT方法来计算基因产 物的倍数不同。  The QPCR reaction uses the ABI7700 Sequence Reaction Detection System. In a 25 μΐ reaction volume, containing 5 μΐ cDNA template, IX TaqMan Universal PCR Mixture Reaction (ABI), PCR primers for ΙΟΟηΜ, probe content 200 nM, and IX VIC-labeled Beta-2-Microglobulin endogenous control (ABI). The PCR reaction conditions were 50 ° C, 2 minutes; 90 ° C, 10 minutes; then repeated 40 cycles at 95 ° C, 15 seconds, 60 ° C, 1 minute. Analysis of the results used sequence detection software (ABI) and application comparison CT methods to calculate the difference in multiples of the gene product.
结果表明在不同的人体组织和细胞系中(图 11 ),新发现的基因有不同程度的 mRNA 表达水平。 预示着, 该基因是有组织特异性和基因表达水平不同的生物功能作用机制。 具有潜在的生物功能利用价值。  The results indicate that the newly discovered genes have varying degrees of mRNA expression levels in different human tissues and cell lines (Fig. 11). It is predicted that the gene is a biologically functional mechanism with different tissue specificity and gene expression levels. Has potential biofunctional use value.
实施例 9、 蛋白质的表达与抗体的制备  Example 9. Expression of protein and preparation of antibody
将克隆的 cDNA片段克隆入表达载体中(在基因核苷酸阅读框架内*** 6个组氨酸 DNA序列)。使用 Novegen的 pET系列 DNA载体在 T7启动子的作用下,利用 IPTG来 诱导目的蛋白在大肠杆菌(BL21/DE3 ) 中的表达和纯化(图 12)。将目的蛋白(带有或 不带有 6 XHis)进行 Ni柱的亲合层析法纯化。 纯化的蛋白直接与免疫佐剂相混合后, 皮下注射免疫 3-4公斤的家兔, 重复免疫 3次, 每次间隔 15-20天。 然后静脉直接注射 纯化的蛋白质抗原来加强免疫反应。 获得的抗体效价大于 1 : 68, 优选地是 1 : 500。 所 制备的抗体就可以直接用于生物活性、功能性、 临床检测目的的免疫测试工作, 可使用 的方法有, 但不局限于, ELISA、 Westerten Blot等。  The cloned cDNA fragment was cloned into an expression vector (6 histidine DNA sequences were inserted in the nucleotide reading frame of the gene). Using Novegen's pET series DNA vector, IPTG was used to induce expression and purification of the target protein in E. coli (BL21/DE3) under the action of the T7 promoter (Fig. 12). The target protein (with or without 6 XHis) was purified by affinity chromatography on a Ni column. After the purified protein was directly mixed with the immunoadjuvant, rabbits immunized with 3-4 kg were injected subcutaneously, and the immunization was repeated 3 times at intervals of 15-20 days. The purified protein antigen is then injected directly into the vein to boost the immune response. The antibody titer obtained is greater than 1:68, preferably 1:500. The prepared antibody can be directly used for immunological tests for biological activity, functional, and clinical detection purposes, and may be used, but not limited to, ELISA, Westerten Blot, and the like.

Claims

权利要求书 Claim
1、 一种发现新基因的方法, 该方法包括以下步骤: 1. A method of discovering a new gene, the method comprising the steps of:
1 )从已公开发表的蛋白质序列数据库中获取长度为 500aa以下的所有蛋白质序列, 并将这些序列转为统一的可识别格式;  1) Obtaining all protein sequences of 500aa or less in length from published protein sequence databases and converting these sequences into a uniform, identifiable format;
2)对从步骤 1 ) 中获得的蛋白质序列批量进行分泌型信号肽分析, 分别获得含有 分泌型信号肽的蛋白质序列和不含有分泌型信号肽的蛋白质序列;  2) performing secretory signal peptide analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a secreted signal peptide and a protein sequence not containing a secreted signal peptide;
3 )对从步骤 1 ) 中获得的蛋白质序列批量进行跨膜区域分析, 分别获得含有跨膜 区域的蛋白质序列和不含有跨膜区域的蛋白质序列;  3) performing a transmembrane region analysis on the protein sequence obtained in step 1), respectively obtaining a protein sequence containing a transmembrane region and a protein sequence containing no transmembrane region;
4)综合步骤 2)和步骤 3 ) 中获得的结果序列, 将序列分成: 含有分泌型信号肽 且不含有跨膜区域的序列; 含有分泌型信号肽且含有跨膜区域的序列; 含有分泌型信号 肽且含有跨膜区域的序列中跨膜区域的数目为 5〜8的序列这三个大类;  4) Combining the sequence of results obtained in steps 2) and 3), the sequence is divided into: a sequence containing a secreted signal peptide and containing no transmembrane region; a sequence containing a secreted signal peptide and containing a transmembrane region; a signal peptide and a sequence containing a transmembrane region in the sequence of 5 to 8 in the sequence of transmembrane regions;
5 )分别用这三个大类的序列对表达序列标签文库做序列比对分析, 并获得具有一 定匹配的表达序列标签, 所述匹配条件为: 序列相似程度为 15%〜95%, 并且要求这些 突变点是均匀分布在整条序列中;  5) Perform sequence alignment analysis on the expressed sequence tag library using the sequences of the three major classes, respectively, and obtain a sequence tag with a certain match, the matching condition is: the sequence similarity is 15%~95%, and the requirement is These mutation points are evenly distributed throughout the sequence;
6)分别对这些表达序列标签进行不同算法的聚类拼接分析; 和  6) performing clustering and splicing analysis on different expressions of these expressed sequence tags;
7) 与已知数据库的序列进行分析比较, 获得新的全长基因。  7) Analyze and compare the sequences of known databases to obtain new full-length genes.
2、 如权利要求 1所述的发现新基因的方法, 其中步骤 1 )中所述的蛋白质序列 的长度为 400aa以下或 300aa以下。  The method for discovering a novel gene according to claim 1, wherein the protein sequence described in the step 1) has a length of 400 aÅ or less or 300 aÅ or less.
3、 如权利要求 1所述的发现新基因的方法, 其中步骤 5 )中的序列相似程度为 大于 25%且小于 90%。  3. A method of discovering a novel gene according to claim 1, wherein the degree of sequence similarity in step 5) is greater than 25% and less than 90%.
4、 如权利要求 1所述的发现新基因的方法, 其中步骤 7)中所述的已知数据库 包括非分泌蛋白数据库、专利蛋白质序列数据库、 人染色体序列数据库、专利核苷酸序 列数据库、 核苷酸序列数据库、 人类表达序列标签数据库和非冗余蛋白数据库。  4. The method for discovering a novel gene according to claim 1, wherein the known database described in the step 7) comprises a non-secreted protein database, a patented protein sequence database, a human chromosomal sequence database, a patented nucleotide sequence database, and a nucleus. Glycosidic acid sequence database, human expressed sequence tag database and non-redundant protein database.
5、 如权利要求 1所述的发现新基因的方法,其中在获得新的全长基因后对所述 基因进行预测分析。  5. The method of discovering a novel gene according to claim 1, wherein the gene is subjected to predictive analysis after obtaining a new full-length gene.
6、 如权利要求 1所述的发现新基因的方法, 该方法通过计算机***平台完成, 该计算机***平台包括:  6. The method of discovering a novel gene according to claim 1, the method being completed by a computer system platform, the computer system platform comprising:
以下用于序列比对的软件:  The following software for sequence alignment:
美国国立生物技术信息中心的 blast软件包; 华盛顿大学的 blast软件包; 欧 洲分子生物学实验室的序列比对软件包; Clustalw多序列比对分析软件; 以下用于氨基酸序列功能预测的软件- 蛋白质活性位点 /功能域分析软件; The blast package of the National Center for Biotechnology Information; the blast package of the University of Washington; Sequence alignment software package of the Institute of Molecular Biology; Clustalw multiple sequence alignment analysis software; the following software for protein sequence function prediction - protein active site/domain analysis software;
以下用于序列编辑的软件- 将 fasta格式的序列转为表格格式的序列的软件; 将 GenBank格式的序列转 为 fasta格式的序列的软件; DNA序列反向互补程序; DNA序列翻译程序; 获取 GenBank格式的序列文件中的 CDS序列的软件;合并两个简单的序列 片段, 并过滤掉它们之间的重复部分的软件; The following software for sequence editing - software for converting sequences in fasta format to sequences in tabular format; software for converting sequences in GenBank format to sequences in fasta format; reverse sequence complementary program for DNA sequences; translation program for DNA sequences; acquisition of GenBank Software for CDS sequences in a formatted sequence file; software that combines two simple sequence fragments and filters out duplicates between them;
以下用于数据库操作的软件: The following software for database operations:
实现对数据库中任意一个序列的删除的软件;实现对序列数据库的***增加 序列的操作的软件;批量或单个获取大型数据库中的某些序列的软件;对临 时的没有建立索引的数据库进行 DNA、 蛋白质序列获取操作的软件; 通过 网络远程直接获取 GenBank上的序列数据的软件; 由本地网络直接获取本 地数据库上的序列数据的软件; 针对 Fasta格式的数据库建立索引的程序; 针对 GenBank格式的数据库建立索引的程序; 对基因组序列进行片段序列 获取的软件; 方便获取序列中的某个片段的软件;  Software that implements the deletion of any sequence in the database; software that implements the operation of inserting sequences into the sequence database; software that batches or individually acquires certain sequences in large databases; performs DNA on temporary, unindexed databases, Software for obtaining protein sequence access; software for directly acquiring sequence data on GenBank via network; software for directly acquiring sequence data on local database from local network; program for indexing database in Fasta format; database for GenBank format Indexing program; software for fragment sequence acquisition of genomic sequences; software for facilitating acquisition of a fragment in a sequence;
以下用于序列比对结果分析做图的软件- 通过 blast的结果数据做出大致比对示意图的软件; The following software for plotting the results of sequence alignment analysis - software that makes a rough alignment of the results of the blast;
以下用于数据解析的软件: The following software for data parsing:
对批量分析蛋白质序列跨膜区域预测分析结果进行解析的软件; 对 blastru blastp, blastx等程序输出的结果进行解析的软件; 对 fasty比对程序输出的 结果进行解析的软件;对批量自动化预测信号肽的程序产生的结果数据进行 解析的软件; 对大量的结果输出实现机器自动分析的软件;  Software for analyzing batch analysis of protein sequence transmembrane region prediction analysis results; software for parsing results of blastru blastp, blastx and other program outputs; software for parsing results of fasty comparison program output; predicting signal peptides for batch automation Software that parses the resulting data for the program; software that automates the analysis of the machine for a large number of results;
以下用于辅助其它程序运行的软件: The following software is used to assist other programs to run:
配合部分不能实现自动化操作的程序实现全面自动化运行的软件; 以下重新优化的软件:  A software that fully automates the operation of a program that cannot be automated; the following re-optimized software:
自动完成 cap4运行环境的配置的软件; 将 cap4输出的得分矩阵数据转为 fasta格式的文件的软件; 批量自动化预测信号肽的程序; 批量分析蛋白质 序列跨膜区域预测的软件。  Automated software for configuring the cap4 runtime environment; software for converting cap matrix output scoring matrix data to fasta format files; batch automated program for predicting signal peptides; batch analysis of protein sequence transmembrane region prediction software.
、 两个新基因, 其分别具有 Seq ID No.l和 Seq ID No.3所示的核苷酸序列。 8、 如权利要求 1所述的新基因在制备治疗药物或诊断试剂中的应用。 Two new genes having the nucleotide sequences shown by Seq ID No. 1 and Seq ID No. 3, respectively. 8. Use of the novel gene of claim 1 for the preparation of a therapeutic or diagnostic agent.
9、如权利要求 8所述的应用, 其中所述的新基因用作与心血管疾病相关的诊断或 治疗目的的药物或药靶基因。  9. Use according to claim 8 wherein said novel gene is used as a drug or drug target gene for diagnostic or therapeutic purposes associated with cardiovascular disease.
10、 如权利要求 9所述的应用, 其中所述的药物或药靶为基因药物或基因治疗药 靶。  10. The use according to claim 9, wherein the drug or drug target is a gene drug or a gene therapy drug target.
11、 两个蛋白质, 其分别具有 Seq ID No.2和 Seq ID No.4所示的氨基酸序列。 11. Two proteins having the amino acid sequences of Seq ID No. 2 and Seq ID No. 4, respectively.
12、权利要求 11所述的蛋白质在制备心血管疾病相关的治疗药物或诊断试剂中的 应用。 12. Use of the protein of claim 11 for the preparation of a therapeutic or diagnostic agent associated with cardiovascular disease.
13、一种针对权利要求 11所述的蛋白质的特异性抗体, 该特异性抗体通过免疫动 物产生。  A specific antibody against the protein of claim 11, which is produced by an immunological animal.
14、权利要求 13所述的特异性抗体在制备临床诊断剂、治疗剂或作为生物试剂中 的应用。  14. Use of a specific antibody according to claim 13 for the preparation of a clinical diagnostic, therapeutic or biological agent.
15、 一种核酸, 其核苷酸序列与 Seq ID No.l或 Seq ID No.3具有至少 95%序列同 源性,并且该核酸编码的蛋白质分别与 Seq ID No.l或 Seq ID No.3编码的蛋白质具有相 同的功能。  15. A nucleic acid having a nucleotide sequence having at least 95% sequence homology to Seq ID No. 1 or Seq ID No. 3, and wherein the nucleic acid encodes a protein with Seq ID No. 1 or Seq ID No., respectively. The 3 encoded proteins have the same function.
16、 一种蛋白质, 其氨基酸序列与 Seq ID No. 2或 Seq ID No.4具有至少 90%的序 列同源性, 并且该蛋白质分别与 Seq ID No. 2或 Seq ID No.4所示的蛋白质具有相同的 功能。  16. A protein having an amino acid sequence having at least 90% sequence homology to Seq ID No. 2 or Seq ID No. 4, and the protein is represented by Seq ID No. 2 or Seq ID No. 4, respectively. Proteins have the same function.
PCT/CN2007/070153 2006-06-21 2007-06-21 A method for identifying novel gene and the resulting novel genes WO2008000186A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007800202904A CN101460625A (en) 2006-06-21 2007-06-21 A method for identifying novel gene and the resulting novel genes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200610089339.9 2006-06-21
CNA2006100893399A CN1884521A (en) 2006-06-21 2006-06-21 Method for finding novel gene and computer system platform using same and novel gene

Publications (2)

Publication Number Publication Date
WO2008000186A1 true WO2008000186A1 (en) 2008-01-03
WO2008000186A8 WO2008000186A8 (en) 2009-07-09

Family

ID=37582826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/070153 WO2008000186A1 (en) 2006-06-21 2007-06-21 A method for identifying novel gene and the resulting novel genes

Country Status (2)

Country Link
CN (2) CN1884521A (en)
WO (1) WO2008000186A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785900A (en) * 2018-12-12 2019-05-21 上海派森诺生物科技股份有限公司 A kind of function of microbial population genetic analysis method based on protein sequence similarity
CN110033826A (en) * 2018-12-10 2019-07-19 上海派森诺生物科技股份有限公司 A kind of analysis method applied to macrovirus group high-flux sequence data
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN112750501A (en) * 2020-12-29 2021-05-04 上海派森诺生物科技股份有限公司 Optimized analysis method for macrovirome process

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1884521A (en) * 2006-06-21 2006-12-27 北京未名福源基因药物研究中心有限公司 Method for finding novel gene and computer system platform using same and novel gene
CN101930502B (en) * 2010-09-03 2011-12-21 深圳华大基因科技有限公司 Method and system for detection of phenotype genes and analysis of biological information
CN103186716B (en) * 2011-12-29 2017-02-08 上海生物信息技术研究中心 Metagenomics-based unknown pathogeny rapid identification system and analysis method
CN105095623B (en) * 2014-05-13 2017-11-17 中国人民解放军总医院 Screening assays, platform, server and the system of disease biomarkers
CN110019155B (en) * 2017-09-30 2023-04-07 山西医科大学 MicroRNA omics data perturbation platform
WO2020037085A1 (en) * 2018-08-15 2020-02-20 Zymergen Inc. Bioreachable prediction tool with biological sequence selection
US20210313011A1 (en) * 2018-10-17 2021-10-07 Quest Diagnostics Investments Llc Genomic sequencing selection system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004086568A (en) * 2002-08-27 2004-03-18 Hitachi Ltd New gene producing method and its program
US20060069512A1 (en) * 1999-04-15 2006-03-30 Andrey Rzhetsky Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins
CN1884521A (en) * 2006-06-21 2006-12-27 北京未名福源基因药物研究中心有限公司 Method for finding novel gene and computer system platform using same and novel gene

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069512A1 (en) * 1999-04-15 2006-03-30 Andrey Rzhetsky Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins
JP2004086568A (en) * 2002-08-27 2004-03-18 Hitachi Ltd New gene producing method and its program
CN1884521A (en) * 2006-06-21 2006-12-27 北京未名福源基因药物研究中心有限公司 Method for finding novel gene and computer system platform using same and novel gene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DATABASE GENBANK [online] Database accession no. (EAW84827) *
DATABASE GENBANK [online] Database accession no. (XP_001117298) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033826A (en) * 2018-12-10 2019-07-19 上海派森诺生物科技股份有限公司 A kind of analysis method applied to macrovirus group high-flux sequence data
CN110033826B (en) * 2018-12-10 2023-08-08 上海派森诺生物科技股份有限公司 Analysis method applied to macrovirome high-throughput sequencing data
CN109785900A (en) * 2018-12-12 2019-05-21 上海派森诺生物科技股份有限公司 A kind of function of microbial population genetic analysis method based on protein sequence similarity
CN109785900B (en) * 2018-12-12 2023-05-23 上海派森诺生物科技股份有限公司 Microbial community functional gene analysis method based on protein sequence similarity
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN111199772B (en) * 2019-12-27 2023-05-23 上海派森诺生物科技股份有限公司 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing
CN112750501A (en) * 2020-12-29 2021-05-04 上海派森诺生物科技股份有限公司 Optimized analysis method for macrovirome process
CN112750501B (en) * 2020-12-29 2024-04-02 上海派森诺生物科技股份有限公司 Optimized analysis method for macro virus group flow

Also Published As

Publication number Publication date
CN1884521A (en) 2006-12-27
WO2008000186A8 (en) 2009-07-09
CN101460625A (en) 2009-06-17

Similar Documents

Publication Publication Date Title
WO2008000186A1 (en) A method for identifying novel gene and the resulting novel genes
Macdonald bicoid mRNA localization signal: phylogenetic conservation of function and RNA secondary structure
US7560541B2 (en) Heart20049410 full-length cDNA and polypeptides
US20020156773A1 (en) Soluble HLA ligand database utilizing predictive algorithms and methods of making and using same
JP2001512011A (en) 5'EST of non-tissue specific secreted protein
JP2001512015A (en) 5 'EST of secreted proteins in the brain
WO1993016178A2 (en) Sequences characteristic of human gene transcription product
JP2001512016A (en) 5'EST of secreted proteins expressed in muscle and other mesodermal tissues
JP2001512012A (en) 5'EST of secreted protein expressed in testis and other tissues
KR20090053893A (en) Matrix attachment regions(mars) for increasing transcription and uses thereof
JP2001512014A (en) 5 'EST of secreted protein identified from brain tissue
Seroussi et al. Characterization of the human NIPSNAP1 gene from 22q12: a member of a novel gene family
JP2002525024A (en) 5'EST of secreted proteins expressed in various tissues
CN1331700A (en) Generation of antibodies using polynucleotide vaccination in avian species
Bischof et al. Genome-wide analysis of gene transcription in the hypothalamus
JP2003529371A (en) Human serine racemase
JPH0892285A (en) Human clap protein and dna coding for the same
Duncan et al. Molecular characterisation and expression of CD4 in two distantly related marsupials: the gray short-tailed opossum (Monodelphis domestica) and tammar wallaby (Macropus eugenii)
WO2001007607A2 (en) FULL LENGTH cDNA CLONES AND PROTEINS ENCODED THEREBY
EP0695802A2 (en) Human CRH receptor-related receptor
WO2003091435A1 (en) Novel proteins and dnas encoding the same
JP2003506074A (en) Drug target isogenic: polymorphism in the immunoglobulin E receptor IALPHA subunit gene
JP2003506070A (en) Drug target isogenic: polymorphism in 5-hydroxytryptamine receptor 1A gene
JP2002501723A (en) Orphan receptor
US6444443B1 (en) Gene

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200780020290.4

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07721771

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07721771

Country of ref document: EP

Kind code of ref document: A1