JP4899024B2

JP4899024B2 - How to populate a data structure for use in evolutionary simulation

Info

Publication number: JP4899024B2
Application number: JP2000594066A
Authority: JP
Inventors: サーゲイエイ．セリホノブ，; ウィレムピー．シー．ステマー，
Original assignee: コデクシスメイフラワーホールディングス，エルエルシー
Priority date: 1999-01-18
Filing date: 2000-01-18
Publication date: 2012-03-21
Anticipated expiration: 2020-01-18
Also published as: CA2337949C; CA2337949A1; JP2003536117A

Description

【０００１】
（関連出願の相互参照）
本出願は、１９９９年１０月２１日に出願された米国特許出願第０９/４１６，８３７号の一部継続出願である。
【０００２】
本出願はまた、ＳｅｌｉｆｏｎｏｖらのＰＣＴ出願（１９９９年１月１８日出願）（ＪｏｎａｔｈａｎＡｌａｎＱｕｉｔｅ法律事務所により、代理人書類番号：０２−２８９−３ＰＣ０で出願）による「ＭＥＴＨＯＤＳＦＯＲＭＡＫＩＮＧＣＨＡＲＡＣＴＥＲＳＴＲＩＮＧＳ，ＰＯＬＹＮＵＣＬＥＯＴＩＤＥＳＡＮＤＰＯＬＹＰＥＰＴＩＤＥＳＨＡＶＩＮＧＤＥＳＩＲＥＤＣＨＡＲＡＣＴＥＲＩＳＴＩＣＳ」に対する優先権を主張する。ＰＣＴ出願（１９９９年１月１８日出願）は、Ｓｅｌｉｆｏｎｏｖらによる１９９９年１０月１２日出願の米国特許出願第０９／４１６，３７５号「ＭＥＴＨＯＤＳＦＯＲＭＡＫＩＮＧＣＨＡＲＡＣＴＥＲＳＴＲＩＮＧＳ，ＰＯＬＹＮＵＣＬＥＯＴＩＤＥＳＡＮＤＰＯＬＹＰＥＰＴＩＤＥＳＨＡＶＩＮＧＤＥＳＩＲＥＤＣＨＡＲＡＣＴＥＲＩＳＴＩＣＳ」の一部継続出願である。Ｓｅｌｉｆｏｎｏｖらによる１９９９年１０月１２日出願の米国特許出願第０９／４１６，３７５号は、ＳｅｌｉｆｏｎｏｖおよびＳｔｅｍｍｅｒによる１９９９年１月１９日出願の米国特許出願第６０／１１６，４４７号「ＭＥＴＨＯＤＳＦＯＲＭＡＫＩＮＧＣＨＡＲＡＣＴＥＲＳＴＲＩＮＧＳ，ＰＯＬＹＮＵＣＥＬＯＴＩＤＥＳＡＮＤＰＯＬＹＰＥＰＴＩＤＥＳＨＡＶＩＮＧＤＥＳＩＲＥＤＣＨＡＲＡＣＴＥＲＩＳＴＩＣＳ」の正規出願であり、そしてまたＳｅｌｉｆｏｎｏｖおよびＳｔｅｍｍｅｒによる１９９９年２月５日出願の米国特許出願第６０／１１８，８５４号「ＭＥＴＨＯＤＳＦＯＲＭＡＫＩＮＧＣＨＡＲＡＣＴＥＲＳＴＲＩＮＧＳ，ＰＯＬＹＮＵＣＬＥＯＴＩＤＥＳＡＮＤＰＯＬＹＰＥＰＴＩＤＥＳＨＡＶＩＮＧＤＥＳＩＲＥＤＣＨＡＲＡＣＴＥＲＩＳＴＩＣＳ」の正規出願である。
【０００３】
本出願はまた、ＣｒａｍｅｒｉらのＰＣＴ出願（１９９９年１月１８日出願）（ＪｏｎａｔｈａｎＡｌａｎＱｕｉｔｅ法律事務所により、代理人書類番号：０２−２９６−３ＰＣで出願）による「ＯＬＩＧＯＮＵＣＬＥＯＴＩＤＥＭＥＤＩＡＴＥＤＮＵＣＬＥＩＣＡＣＩＤＲＥＣＯＭＢＩＮＡＴＩＯＮ」に対する優先権を主張する。ＰＣＴ出願（１９９９年１月１８日出願）は、Ｃｒａｍｅｒｉらによる１９９９年９月２８日出願の米国特許出願第０９／４０８，３９２号「ＯＬＩＧＯＮＵＣＬＥＯＴＩＤＥＭＥＤＩＡＴＥＤＮＵＣＬＥＩＣＡＣＩＤＲＥＣＯＭＢＩＮＡＴＩＯＮ」の一部継続出願である。米国特許出願第０９／４０８，３９２号は、Ｃｒａｍｅｒｉらによる１９９９年２月５日出願の米国特許出願第６０／１１８，８１３号「ＯＬＩＧＯＮＵＣＬＥＯＴＩＤＥＭＥＤＩＡＴＥＤＮＵＣＬＥＩＣＡＣＩＤＲＥＣＯＭＢＩＮＡＴＩＯＮ」の正規出願であり、そしてまたＣｒａｍｅｒｉらによる１９９９年６月２４日出願の米国特許出願第６０／１４１，０４９号「ＯＬＩＧＯＮＵＣＬＥＯＴＩＤＥＭＥＤＩＡＴＥＤＮＵＣＬＥＩＣＡＣＩＤＲＥＣＯＭＢＩＮＡＴＩＯＮ」の正規出願である。
【０００４】
本出願はまた、Ｗｅｌｃｈらによる１９９９年９月２８日出願の米国特許出願第０９／４０８，３９３号「ＵＳＥＯＦＣＯＤＯＮＶＡＲＩＥＤＯＬＩＧＯＮＵＣＬＥＯＴＩＤＥＳＹＮＴＨＥＳＩＳＦＯＲＳＹＮＴＨＥＴＩＣＳＨＵＦＦＬＩＮＧ」に関連する。
【０００５】
本出願は、適切に、米国特許法第１１９条および／または米国特許法第１２０条に提供されるように、これらの出願の各々に対する優先権を主張し、そしてその利益を請求する。これらの出願の全ては、全ての目的のためにそれらの全体が参考として本明細書に引用される。
【０００６】
（著作権の告知）
この特許書類の開示の一部は、著作権の保護に供される材料を含む。著作権者は、特許書類または特許が米国特許商標庁の特許ファイルまたは記録に現れるので、特許書類または特許の開示の誰によるファクシミリでの複製に対しても異論はないが、他の点では、どんなことでも全ての著作権の権利を有する。
【０００７】
（連邦政府により援助を受けた研究および開発の下で行われた発明に対する権利に関しての宣言）
該当せず。
【０００８】
（発明の分野）
本発明は、コンピューターモデリングおよびシュミレーションの分野に関する。特に、本発明は、進化モデリングにおける使用のためにデータ構造を居住させる新規な方法を提供する。
【０００９】
（発明の背景）
個々の遺伝システムおよび／または集団の遺伝／表現型システムの生命の進化をシュミレーションおよび／または調査するコンピューターの使用の広汎な歴史が存在する。ほとんどの人工生命（Ａ生命）シュミレーションを推進する動力は、人工生命体が進化し、そして／またはそれらの環境に対して適応させるアルゴリズムである。この基本的なアルゴリズムは、２つの主なカテゴリー（学習アルゴリズム（例えば、神経ネットワークにより代表されるアルゴリズム）および例えば、遺伝アルゴリズムにより代表される進化アルゴリズムに分かれる。
【００１０】
多くの人工生命研究者ら、特に学習および適応のようなより高次のプロセスに関心のある研究者らは、人口脳として働く神経網をそれらの生物に与えている（例えば、Ｔｏｕｒｅｔｚｋｙ（１０８８〜１９９１）、ＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ、第１〜４巻、ＭｏｒｇａｎＫａｕｆｍａｎｎ、１９８８〜１９９１を参照のこと）。ニュートラルネットワークは、学習アルゴリズムである。それらは、例えば、イメージをカテゴリーに分類するように訓練され得る。代表的な課題は、どの文字が所定の手記の文字に対応するかを認識することである。
【００１１】
神経網は、ニューロンと呼ばれる入力−出力デバイスから構成され、これは、（高度に接続された）ネットワークにおいて組織される。通常、ネットワークは、複数の層（感覚入力を受ける入力層、実際の計算を実行する任意の数のいわゆる秘密層（ｈｉｄｄｅｎｌａｙｅｒｓ）、およびこれらの計算の結果を報告する出力層）へと組織される。神経ネットワークの訓練は、網内のニューロンの間の接続の強度を調整することを包含する。
【００１２】
生物学的に影響を与えられる基本的なアルゴリズムの他の主要な型は、「進化」アルゴリズムである。プロセスの学習（例えば、神経ネットワーク）は、個々の生物におけるプロセスの学習に比喩的に基づくが、進化アルゴリズムは、個体の集団内の進化的な変化によって影響を与えられる。神経網に対して、進化アルゴリズムは、近年、学術団体および産業団体に広汎に受け入れられるのみであった。
【００１３】
進化アルゴリズムは、一般に反復的である。反復は、代表的には、「世代」としていわれる。基本の進化アルゴリズムは、伝統的には、無作為に選択された個体の集団で開始する。各世代において、個体は、課された問題を解決するためにそれら自体の中で「競争」する。比較的十分に実行する個体は、次の世代へと「生存」するようである。次の世代に生存している個体は、小さな無作為の改変に供され得る。このアルゴリズムが正確に設定され、そしてこの問題が、実際に、この様式における解答に対する１つの対象である場合、反復が進行するにつれて、この集団は、質を向上させる解答を含む。
【００１４】
最も有名な進化アルゴリズムは、Ｊ．Ｈｏｌｌａｎｄのｇｅｎｅｔｉｃａｌｇｏｒｉｔｈｍ（Ｊ．Ｈ．Ｈｏｌｌａｎｄ（１９９２）ＡｄａｐｔａｔｉｏｎｉｎＮａｔｕｒａｌａｎｄＡｒｔｉｆｉｃｉａｌＳｙｓｔｅｍｓ．ＵｎｉｖｅｒｓｉｔｙｏｆＭｉｃｈｉｇａｎＰｒｅｓｓ１９７５、ＭＩＴＰｒｅｓｓによるリプリント）である。遺伝的アルゴリズムは、実際のコンテクストにおいて、広範に使用される（例えば、財務予測、経営科学など）。その解空間（ｓｏｌｕｔｉｏｎｓｐａｃｅ）が不連続（「凹凸がある（ｒｕｇｇｅｄ）」）でありそして不十分に理解されている多変量問題に対して、特に良好に適用される。遺伝的アルゴリズムを適用するために、当業者は、以下を定義する、１）パラメーター値のセットから、（０−１）ビットストリング（例えば、キャラクターストリング）のセットへのマッピング、および２）ビットストリングから実数（いわゆる適応度関数（ｆｉｔｎｅｓｓｆｕｎｃｔｉｏｎ））へのマッピング。
【００１５】
ほとんどの進化アルゴリズムにおいて、ランダムに選択したビットストリングのセットは、最初の集団を構成する。基本的な遺伝的アルゴリズムにおいて、以下の間、サイクルを繰り返す；集団中の各個体の適応度が評価される；個体のコピーが、その適応度に比例して作製される；そしてサイクルが繰り返される。そのような進化アルゴリズムの代表的な開始点は、ランダムに選択されたビットストリングのセットである。「任意の」、ランダムな、または偶然の開始集団の使用は、進化アルゴリズムを、当面の問題の効率的な解決、正確な解決、または簡潔な解決から遠くに、強力に偏らせ得る（特に、そのアルゴリズムを使用して、生物学的歴史または生物学的プロセスをモデリングするか、または分析する場合）。実際、進化アルゴリズムを、それがなんであれ任意の解に至らせる唯一の「力」は、適応度決定および付随する淘汰圧である。最終的に解には到達するが、プロセスが、集団のメンバーが、お互いに関連性を有さないランダムな（例えば、任意の）初期状態から開始するので、アルゴリズムが進行する際の集団の変遷は、シミュレーションされた系の変遷を反映する情報をほとんど明らかにしないか、または全く明らかにしない。
【００１６】
さらに、進化アルゴリズムは、代表的には、比較的高度のシミュレーションであり、そして集団レベルの情報を提供する。特定の遺伝情報（もし、少しでも存在する場合）は、代表的には、対立遺伝子（代表的には単一のキャラクターとして）の抽象的な表示としてか、または対立遺伝子の頻度として存在する。結果として、進化アルゴリズムは、分子レベルの事象について、ほとんど情報を提供しないか、または、全く提供しない。
【００１７】
同様に、ニューラルネットおよび／または細胞オートマトンは、その開始点として本質的に人工の構築物を選択し、そして内部法則（アルゴリズム）を利用して、生物学的プロセスを近似する。結果として、そのようなモデルは、一般にプロセスまたはメタプロセス（ｍｅｔａｐｒｏｃｅｓｓ）を模倣するが、これもまた、分子レベルの事象に関する情報または洞察を、ほとんどもたらさないか、全くもたらさない。
【００１８】
（発明の要旨）
本発明は、さらなるコンピュータ操作（例えば、遺伝的／進化アルゴリズムを介する）のために適切な、「初期」集団を生成するための、新規な方法を提供する。本発明の方法によって生成された集団のメンバーは、天然に存在する集団において見出される共分散の程度を反映する、程度の変化するお互いの「関連性」または「類似性」を保有する。さらに、代表的な進化アルゴリズムにおいて入力として使用される集団とは異なり、本明細書において提供される方法によって生成される集団は、個々のメンバーについての詳細な情報を代表的に含み、そしてその情報は、代表的には、メンバー間の可変性および／または関連性の「連続的な」（２進法よりはむしろ）尺度を提供するのに十分に複雑性である。実際、本発明の方法は、本発明の方法に従って作成された集団を含む個体における、分子情報の詳細なコーディングを提供する。
【００１９】
従って、１つの実施態様において、本発明は、キャラクターストリングを有するデータ構造の集団を形成する（例えば、キャラクターストリングの集団またはライブラリーを生成する）方法を提供する。この方法は、好ましくは、以下を包含するｉ）２つ以上の生物学的分子を、キャラクターストリング中にコードし、２つ以上の異なる初期キャラクターストリングの集団を提供する工程であって、ここで、この生物学的分子の各々が、少なくとも約１０のサブユニットを含む、工程；ｉｉ）そのキャラクターストリングから少なくとも２つのサブストリングを選択する工程；ｉｉｉ）そのサブストリングを結び付けて、初期キャラクターストリングの１つ以上とほぼ同一の長さの、１つ以上の解ストリングを形成する工程；ｉｖ）解ストリングをストリングの集団（データ構造）に添加する工程；ならびにｖ）必要に応じて、１つ以上のその解ストリングを、初期キャラクターストリングの集団中の初期ストリングとして使用して、工程（ｉ）または（ｉｉ）から（ｉｖ）を繰り返す工程。特定の好ましい実施態様において、「コードする」とは、１つ以上の核酸配列および／または１つ以上のアミノ酸配列を、キャラクターストリング中にコードすることを包含する。核酸配列および／またはアミノ酸配列は、未知であり得、および／または偶然に選択され得るが、好ましくは、公知のタンパク質（単数または複数）をコードする。１つの好ましい実施態様において、生物学的分子は、お互いに、少なくとも約３０％、好ましくは少なくとも約５０％、より好ましくは少なくとも約７５％、そして最も好ましくは少なくとも約８５％、９０％、または９５％でさえもの、配列同一性を有するように、選択される。
【００２０】
１つの実施態様において、サブストリング（単数または複数）を選択して、その結果、サブストリングの末端が、同一の２つのストリング間の全体としての配列同一性よりも初期のキャラクターストリングの別の対応する領域に対してより高い配列同一性を有する約３〜約３００の、好ましくは約６〜約２０の、より好ましくは約１０〜約１００の、そして最も好ましくは約２０〜約５０のキャラクターのキャラクターストリング領域中に生じる。別の実施態様において、選択工程は、そのサブストリングの末端が約４〜約１００の、好ましくは約４〜約５０の、なおより好ましくは約４〜約１０の、さらにより好ましくは約６〜約３０の、そして最も好ましくは約６〜約２０のキャラクターの予め規定されたモチーフ中に生じるように、サブストリングを選択することを、包含し得る。
【００２１】
１つの実施態様において、選択および連結は、２つの異なる初期ストリングからサブストリングを結び付けて、その結果、２つの異なる初期ストリング間の全体としての配列同一性よりも２つの異なる初期のストリング間のより高い配列同一性を有する約３〜約２０キャラクターの領域中の連結が生じることを含み得る。選択はまた、その２つ以上の初期キャラクターストリングを整列し、そのキャラクターストリングの２つ以上のサブストリング間の対同一性を最大化する工程、および１つのサブストリングの末端について整列された対のメンバーであるキャラクターを選択する工程を包含し得る。
【００２２】
特定の実施態様において、「追加」工程は、キャラクターストリングによってコードされるタンパク質の、理論的ＰＩ、ＰＫ、分子量、疎水性、２次構造および／または他の特性の計算を包含する。１つの好ましい実施態様において、解ストリングが、初期ストリングに対して、３０％より大きいか、好ましくは５０％より大きいか、より好ましくは７５％より大きいか、または８５％の配列同一性を有する場合にのみ、解ストリングを、集団（データ構造）に追加する。
【００２３】
この方法は、キャラクターストリングの１つ以上のキャラクターをランダムに変更する工程をさらに包含し得る。このことは、ランダムストリングを、初期ストリング集団に導入する工程、および／または本明細書に記載されるような確立論的なオペレーターを利用する工程、を包含するが、これらに限定されない多数の方法に従って達成され得る。特定の好ましい実施態様において、上記の操作は、コンピュータ中で実行される。
【００２４】
別の実施態様において、本発明は、ｉ）２つ以上の生物学的分子をキャラクターストリング中にコードして、２つ以上の異なる初期キャラクターストリングの集団を提供し、ここでこの生物学的分子の各々が、少なくとも約１０のサブユニットを含有する；ｉｉ）そのキャラクターストリングから少なくとも２つのサブストリングを選択する；ｉｉｉ）サブストリングを連結し、初期キャラクターストリングの１つ以上とほぼ同一の長さの１つ以上の解ストリングを形成する；ｉｖ）解ストリングを、ストリングの集団に追加する（すなわち、データ構造の集団を形成する）；そしてｖ）必要に応じて、１つ以上の解ストリングを、初期キャラクターストリングの集団中の初期ストリングとして使用して、工程（ｉ）または（ｉｉ）から（ｉｖ）を繰り返す、コンピュータコードを含むコンピュータプログラム解を提供する。すなわち、本明細書において記載される操作を実行するコンピュータコードを含むコンピュータプログラム解である。プログラムコードは、コンパイルの様式において、ソースコードとして、オブジェクトコードとして、実行可能なものとしてなどで、提供され得る。このプログラムは、任意の都合よい媒体において提供され得る（例えば、磁気媒体、光学媒体、電子媒体、光磁気媒体など）。コードはまた、コンピュータ上に（例えば、メモリー（ダイナミックまたはスタティックメモリー）中、ハードドライブ上など）に存在し得る。
【００２５】
別の実施態様において、本発明は、生物学的分子の配列に由来する、ラベル（タグ）および／または音楽を生成するシステムを提供する。このシステムは、生物学的分子（例えば、核酸および／またはタンパク質）からの２つ以上の初期ストリングをコードするエンコーダー；２つ以上のストリングからサブストリングを同定し、そして選択するためのアイソレーター；サブストリングを結び付けるためのコンカテネーター；ストリングの集団として結び付けられたサブストリングを保管するためのデータ構造；ストリングの集団の数および／または可変性を測定し、そしてストリングの集団中に存在する十分なストリングを決定するためのコンパレーター；ならびにストリングの集団を、ローストリングファイル（ｒａｗｓｔｒｉｎｇｆｉｌｅ）中に書き込むためのコマンドライターを含む。好ましい実施態様において、アイソレーターは、２つ以上の初期ストリングの間の同一性の領域を整列して決定するためのコンパレーターを含む。同様に、コンパレーターは、配列同一性を計算するための手段を含み得、そしてアイソレーターおよびコンパレーターは、必要に応じて、この手段を共有し得る。好ましい実施態様において、アイソレーターは、サブストリングの末端が、同一の２つのストリング間の全体としての配列同一性よりも初期のキャラクターストリングの別の対応する領域に対してより高い配列同一性を有する約３〜約１００個のキャラクターのストリング領域中に生じるように、サブストリングを選択する。
【００２６】
別の実施態様において、アイソレーターは、サブストリングの末端が、約４〜約１００の、好ましくは約４〜約５０の、なおより好ましくは約４〜約１０の、さらにより好ましくは約６〜約３０の、そして最も好ましくは約６〜約２０のキャラクターの予め規定されたモチーフ中に生じるように、そのサブストリングを選択する。１つの実施態様において、アイソレーターおよびコンカテネーターは、個々に、または組み合わせて、２つの異なる初期ストリングからのサブストリングを連結し、その結果、連結が、その２つの異なる初期ストリング間の全体としての配列同一性よりも、その２つの異なる初期ストリング間でより高い配列同一性を有する約３〜約３００の、より好ましくは約５〜約２００の、最も好ましくは約１０〜約１００のキャラクターの領域中に生じる。１つの好ましい実行において、アイソレーターは、２つ以上の初期キャラクターストリングを整列し、そのキャラクターストリングの２つ以上のサブストリング間の対同一性を最大化し、そして１つのサブストリングの末端について整列された対のメンバーであるキャラクターを選択する。
【００２７】
コンパレーターは、任意の広範に種々の選択判定基準を課し得る。従って、種々の実施態様において、コンパレーターは、コードされるタンパク質の、理論的なＰＩ、ＰＫ、分子量、疎水性、２次構造および／または他の特性を計算し得る。１つの好ましい実施態様において、コンパレーターは、ストリングが初期ストリングと３０％を超える同一性を有する場合にのみ、ストリングをデータ構造に追加する。
【００２８】
このシステムは、必要に応じて、キャラクターストリングの１つ以上のキャラクターをランダムに変更するオペレーターを含み得る。特定の実施態様において、そのようなオペレーターは、そのキャラクターストリング中の特定の予め選択されたキャラクターの１つ以上の出現を、ランダムに選択および変更し得る。このシステムにおける好ましいデータ構造は、コードされた（もしくはデコンボルブ（ｄｅｃｏｎｖｏｌｖｅ）された）核酸配列および／またはコードされたもしくはデコンボルブ（ｄｅｃｏｎｖｏｌｖｅ）されたアミノ酸配列を蓄積する。
【００２９】
本発明のさらなる理解は、以下の特定の実施態様の詳細な考察から得られ得る。明確さの目的のために、この考察は、特定の実施例に関する、装置、方法、および概念を参照する。しかし、本発明の方法は、種々のタイプの論理デバイス内において作動し得る。従って、本発明は、添付の特許請求の範囲において（均等論のもとにおいて解釈されるように）提供される以外には、限定されないことが意図される。
【００３０】
さらに、ロジックシステムが、モジュール様式中の広範に種々の異なるコンポーネントおよび異なるファンクションを含み得ることが理解される。システムの異なる実施態様は、エレメントおよびファンクションの異なる混合物を含み得、そして種々のエレメントの一部として種々のファンクションをグループ化し得る。明確さの目的のために、本発明は、多くの異なる革新的なコンポーネントおよびコンポーネントの革新的な組み合わせを含むシステムに関して記載される。本発明が、本明細書の任意の説明的な実施態様に列挙される革新的なコンポーネントの全てを含む組み合わせに限定されるといういかなる推論も、なされるべきではない。
【００３１】
（定義）
用語「キャラクターストリング」「ワード」「バイナリストリング」または「コードされたストリング」は、配列情報（例えば、核酸のヌクレオチド配列、タンパク質のアミノ酸配列、多糖類の糖配列などのような生物学的分子のサブユニット構造）を蓄積し得る任意の実体を、表す。１つの実施態様において、キャラクターストリングは、キャラクターの単純な配列（文字（ｌｅｔｔｅｒ）、数、または他のシンボル）形態であり得るか、または有形または無形（例えば、電子的、磁気的など）の形態におけるそのような情報の数的表示であり得る。キャラクターストリングは、「直線状」である必要はないが、多数の他の形態（例えば、結び付けられたリストなど）においてもまた存在し得る。
【００３２】
キャラクターストリングのキャラクターに関して使用される場合、「キャラクター」とは、ストリングのサブユニットをいう。好ましい実施態様において、キャラクターストリングのキャラクターは、コードされた生物学的分子の１つのサブユニットをコードする。従って、例えば、好ましい実施態様において、コードされた生物学的分子がタンパク質である場合、ストリングのキャラクターは、単一のアミノ酸をコードする。
【００３３】
「モチーフ」とは、生物学的分子を含むサブユニットのパターンをいう。このモチーフとは、コードされていない生物学的分子のサブユニットパターンをいい得るか、または生物学的分子のコードされた表現のサブユニットパターンをいい得る。
【００３４】
用語サブストリングとは、別のストリング中に見出されるストリングをいう。サブストリングは、全長「親」ストリングを含み得るが、代表的にはサブストリングは、全長ストリングのサブストリングを表す。
【００３５】
用語「データ構造」とは、情報の蓄積のための構造および必要に応じて付随する装置をいい、代表的には、情報の多数の「部分」をいう。データ構造は、情報の単純な記録（例えば、リスト）であり得るか、あるいは、データ構造は、そこに含まれる情報に関するさらなる情報（例えば、注釈）を含み得、データ構造の種々の「メンバー」（情報の「部分」）間の関係を構築し得、そしてポインター（ｐｏｉｎｔｅｒ）を提供し得るか、またはデータ構造の外部のリソース（ｒｅｓｏｕｒｃｅ）と結び付けられ得る。データ構造は無形であり得るが、有形媒体中に蓄積／表示される場合に、有形とされる。データ構造は、単純なリスト、結び付けられたリスト、索引をつけたリスト、データテーブル、索引、ハッシュ（ｈａｓｈ）索引、フラットファイル（ｆｌａｔｆｉｌｅ）データベース、リレーショナル（ｒｅｌａｔｉｏｎａｌ）データベース、ローカル（ｌｏｃａｌ）データベース、分散型（ｄｉｓｔｒｉｂｕｔｅｄ）データベース、散在型顧客データベース（ｔｈｉｎｃｌｉｅｎｔｄｅｔａｂａｓｅ）などを含むがこれらに限定されない種々の情報アーキテクチャー（ａｒｃｈｉｔｅｃｔｕｒｅ）を表し得る。好ましい実施態様において、データ構造は、１つ以上のキャラクターストリングの蓄積のために十分なフィールド（ｆｉｅｌｄ）を提供する。データ構造は、好ましくは、キャラクターストリングの整列を可能しに、そして必要に応じて、整列および／またはストリング類似性および／またはストリングの差異に関する情報を蓄積するように構成される。１つの実施態様において、この情報は、整列「スコア」（例えば、類似性索引）の形態、および／または個々のサブユニット（例えば、核酸の場合におけるヌクレオチド）整列を示す整列マップである。用語「コードされたキャラクターストリング」とは、生物学的分子に関する所望の配列情報および／または構造情報を保持するその生物学的分子の表示をいう。
【００３６】
本明細書において使用する場合、類似性とは、分子のコードされた表示の間の（例えば、初期キャラクターストリング）、またはコードされたキャラクターストリングによって表示される分子の間の類似性の尺度をいうことができる。
【００３７】
ストリングのオペレーション（例えば、挿入、欠失、変換など）をいう場合、オペレーションが、生物学的分子のコードされた表示についてか、またはコードされた表示が、オペレーションを表現するように、コードする前の「分子」について実行され得ることが理解される。
【００３８】
生物学的分子に関して使用される場合、用語「サブユニット」とは、生物学が構成される特徴的な「モノマー」をいう。従って、例えば、核酸のサブユニットは、ヌクレオチドであり、ポリペプチドのサブユニットは、アミノ酸であり、多糖類のサブユニットは、糖であるなどである。
【００３９】
用語「プール」または「集団」は、ストリングに関して使用される場合、互換可能に使用される。
【００４０】
「生物学的分子」とは、生物学的生物において代表的に見出される分子をいう。好ましい生物学的分子としては、代表的には天然において複数のサブユニットから構成されるポリマー性である生物学的高分子が挙げられる。代表的な生物学的分子にとしては、核酸（ヌクレオチドサブユニットの形態）、タンパク質（アミノ酸サブユニットの形態）、多糖類（糖サブユニットの形態）などが挙げられるが、これらに限定されない。
【００４１】
句「生物学的分子をコードする」とは、好ましくは最初の生物学的分子の情報コンテンツ（ｉｎｆｏｒｍａｔｉｏｎｃｏｎｔｅｎｔ）を含み、従って、その情報コンテンツを再度作成するために使用され得るその生物学的分子の表示の生成を意味する。
【００４２】
用語「核酸」とは、他に限定しない限り、１本鎖形態または２本鎖形態のデオキシリボヌクレオチドポリマーまたはリボヌクレオチドポリマーをいい、天然に存在するヌクレオチドと類似の様式において機能し得る天然のヌクレオチドの公知のアナログを包含する。
【００４３】
「核酸配列」とは、核酸を含むヌクレオチドの規則性および同一性をいう。
【００４４】
用語「ポリペプチド」、「ペプチド」および「タンパク質」は、本明細書において互換可能に使用され、アミノ酸残基のポリマーをいう。この用語は、１つ以上のアミノ酸が対応する天然に存在するアミノ酸の人工化学アナログであるアミノ酸ポリマー、および天然に生じるアミノ酸ポリマーに適用される。
【００４５】
「ポリペプチド配列」とは、ポリペプチドを含むアミノ酸の規則性および同一性をいう。
【００４６】
本明細書において使用される場合、句「ストリングの集団に解ストリングを追加する」は、数学的追加を必要としない。むしろそれは、ストリングのセットに含まれるとして、１つ以上のストリングを同定するプロセスをいう。このことは、
問題のストリングを、ストリングの集団であるデータ構造中にコピーする手段かまたは移動する手段、ストリングからストリングの集団を表示するデータ構造に、ポインターを設定する手段か、または提供する手段、特定のセット中のその包含物を示すストリングと関連するフラグを設定する手段、あるいは、そのように生成されたストリングが、集団中に含まれるというルールを単純に指定する手段を含むがこれらに限定されない種々の手段によって達成され得る。
【００４７】
（詳細な説明）
（Ｉ．キャラクターストリングの集団の生成）
本発明は、進化的モデルにおいて、より好ましくは遺伝的アルゴリズムによって類型化される進化的モデルにおいて、最初の（または成熟／プロセスされた）集団としての使用に適切な実体の、実際のまたは理論的な集団の提示を生成するための、新規な計算的手法を提供する。特定の生物学的有機体の特徴を反映するために初期化された場合に、この本発明の方法によって生成された実体は、根底をなす分子生物学に関する有意な情報（例えば、代表的なアミノ酸配列または核酸配列）を各々含み、そしてそれによって遺伝的または他のアルゴリズムに基づくモデルが、前例がないレベル、すなわち、分子レベルでの進化の過程に関する情報を提供することを可能にする。
【００４８】
特に好ましい実施態様において、本発明の方法は、キャラクターストリングの集団を生成し、ここで各キャラクターストリングは、１つ以上の生物学的分子を表す。いくつかのストリングを「種子」として使用して、本発明は、最初の種子のメンバーに対して「進化的な」関係を有する大きなストリングの集団を生成する。最初のメンバーのセットが任意か、無作為／偶然か、または数学的もしくは表現の簡便さのために選択されるかである、伝統的な遺伝的アルゴリズムと対照的に、本発明の方法によって生成される集団は、好ましい実施態様において、既知で既存の生物学的「前駆体」（例えば、特定の核酸配列および／またはポリペプチド配列）から誘導される。
【００４９】
好ましい実施態様において、本発明は以下の工程を包含する：
１）２つ以上の生物学的分子の同定／選択する工程；
２）生物学的分子をキャラクターストリングにコード化する工程；
３）キャラクターストリングからの少なくとも２つのサブストリングを選択する工程；
４）これらのサブストリングを結び付けて、１つ以上の初期キャラクターストリングとほぼ同じ長さの１つ以上の解ストリングを形成する工程；
５）解ストリングを、初期ストリングのセットまたは別々のセットであり得る、ストリングのコレクションに追加する工程；
６）必要に応じて、得られたストリングセットにさらなるバリエーションを導入する工程；
７）必要に応じて、得られるストリングセットに、淘汰圧を追加する工程；
８）必要に応じて、初期キャラクターストリングのコレクションにおいて初期ストリングとしての１つ以上の解ストリングを使用して、工程（２）または（３）から（７）までを反復する工程。
これらの各操作は、以下でより詳細に記載される。
【００５０】
（ＩＩ．１つ以上の生物学的分子のキャラクターストリングへのコード化）
本発明の方法は、代表的には、１つ以上の「種子」メンバーを利用する。この「種子」メンバーは、好ましくは、１つ以上の生物学的分子の提示である。従って、本発明の好ましい実施態様の初期段階は、２つ以上の生物学的分子を選択する工程、および生物学的分子を１つ以上のキャラクターストリングにコード化する工程を包含する。
【００５１】
（Ａ「種子／初期」生物学的分子を同定／選択する工程）
実質的に任意の生物学的分子が本発明の方法において使用され得る。しかし、好ましい生物学的分子は、複数の「サブユニット」を包含する「ポリマー性」生体高分子である。本発明の方法に特に十分に適している生体高分子には、核酸（例えば、ＤＮＡ、ＲＮＡなど）、タンパク質、糖タンパク質、糖質、ポリサッカリド、特定の脂肪酸などが挙げられるが、これらに限定されない。
【００５２】
核酸が選択される場合、その核酸は、一本鎖または二本鎖であり得るが、一本鎖が二本鎖核酸を表すこと／コード化することのために十分であり得ることが認識され得る。この核酸は、好ましくは既知の核酸である。このような核酸配列は、多数の供給源から容易に決定され得、そのような供給源には、公的なデータベース（例えば、ＧｅｎＢａｎｋ）、所有権を有するデータベース（例えば、Ｉｎｃｙｔｅデータベース）、科学刊行物、商用または私設の配列決定研究室、組織内の配列決定研究室が含まれるがこれらに限定されない。
【００５３】
核酸分子には、ゲノム核酸、ｃＤＮＡ、ｍＲＮＡ、人工配列、改変されたヌクレオチドを有する天然配列などが含まれ得る。
【００５４】
１つの好ましい実施態様において、２つ以上の生物学的分子は、「関連する」が、同一ではない。従って、この核酸は、同一の遺伝子（単数または複数）を示し得るが、それらが由来する系統、種、属、科、目、門、または界において異なり得る。同様に、１つの実施態様において、タンパク質、ポリサッカリド、または他の分子は、それらが異なる系統、種、属、科、目、門、または界から選択されるという事実から得られる分子間の違いを有する、同じタンパク質、ポリサッカリド、または他の分子である。
【００５５】
生物学的分子は、単一遺伝子産物（例えば、ｍＲＮＡ、ｃＤＮＡ、タンパク質など）であることを表し得るか、あるいはそれらは遺伝子産物および／または非コードアミノ酸のコレクションを表し得る。特定の好ましい実施態様において、生物学的分子は、１つ以上の特定の代謝経路（例えば、調節経路、シグナル伝達経路、または合成経路）のメンバーを表す。従って、例えば、生物学的分子は、全体のオペロン、または完全な生合成経路（例えば、ｌａｃオペロン、タンパク質：Ｂ−ＤＮＡｇａｌオペロン、コリシンＡオペロン、ｌｕｘオペロン、ポリケチド合成経路など）を含むメンバーを含み得る。
【００５６】
特定の好ましい実施態様において、生物学的分子は、多数の異なる遺伝子、タンパク質などのを含み得る。従って、特定の実施態様において、生物学的分子は、個体の、または同一のもしくは異なる種の複数の個体の、核酸全体（例えば、ゲノムＤＮＡ、ｃＤＮＡ、もしくはｍＲＮＡ）、タンパク質全体、または脂質全体などを含み得る。
【００５７】
特定の実施態様において、生物学的分子は、種の分子の集団の全体の「提示」を反映し得る。分子の集団の高水準の提示は、実験室において達成され、そして本発明の方法に従ってインシリコで行われ得る。複雑な分子または分子の集団を提示する方法は、ＲｅｐｒｅｓｅｎｔａｔｉｏｎａｌＤｉｆｆｅｒｅｎｃｅＡｎａｌｙｓｉｓ（ＲＤＡ）および関連技術（例えば、Ｌｉｓｉｔｓｙｎ（１９９５）ＴｒｅｎｄｓＧｅｎｅｔ．１１（８）：３０３−３０７，Ｒｉｓｉｎｇｅｒら（１９９４）ＭｏｌＣａｒｃｉｎｏｇ．１１（１）：１３−１８、およびＭｉｃｈｉｅｌｓら（１９９８）ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．２６：１５３６０８−３６１０、ならびにそこで引用される参考文献を参照のこと）において見出される。
【００５８】
本発明の方法においてコード化および操作のために特に好ましい生物学的分子には、タンパク質、および／または種々のクラスのタンパク質の分子（例えば、エリトロポイエチン（ＥＰＯ）のような治療タンパク質、インスリン、ヒト成長ホルモンのようなペプチドホルモン；好中球活性化ペプチド−７８、ＧＲＯα／ＭＧＳＡ、Ｇｒｏβ、ＧＲＯγ、ＭＩＰ−１α、ＭＩＰ−１６、ＭＣＰ−１、上皮増殖因子、線維芽細胞増殖因子、肝細胞増殖因子、インスリン様増殖因子、インターフェロン、インターロイキン、ケラチノサイト増殖因子、白血病阻害因子、オンコスタチンＭ、ＰＤ−ＥＣＳＦ、ＰＤＧＦ、プライオトロピン（ｐｌｅｉｏｔｒｏｐｉｎ）、ＳＣＦ、ｃ−ｋｉｔリガンド、血管形成因子（例えば、血管内皮増殖因子ＶＥＧＦ−Ａ、ＶＥＧＦ−Ｂ、ＶＥＧＦ−Ｃ、ＶＥＧＦ−Ｄ、胎盤増殖因子（ＰＬＧＦ）など）、増殖因子（例えば、Ｇ−ＣＳＦ、ＧＭ−ＣＳＦ）、可溶性レセプター（例えば、ＩＬ４Ｒ、ＩＬ−１３Ｒ、ＩＬ−１０Ｒ、可溶性Ｔ細胞レセプターなど）などのような増殖因子およびサイトカイン）をコードする核酸が挙げられる。
【００５９】
他の好ましいコード化分子には、転写アクチベーターおよび発現アクチベーターが含まれるがこれらに限定されない。転写アクチベーターおよび発現アクチベーターには、原核生物、ウイルス、ならびに、真菌、植物、および動物を含む真核生物において見出される、細胞増殖、分化、調節などを調節する遺伝子および／またはタンパク質が含まれる。発現アクチベーターには、サイトカイン、炎症性分子、増殖因子、増殖因子レセプター、およびオンコジーン産物、インターロイキン（例えば、ＩＬ−１、ＩＬ−２、ＩＬ−８など）、インターフェロン、ＦＧＦ、ＩＧＦ−１、ＩＧＦ−ＩＩ、ＦＦ、ＰＤＧＦ、ＴＮＦ、ＴＧＦ−α、ＴＧＦ−β、ＥＧＫ、ＫＧＦ、ＳＣＲ／ｃ−ｋｉｔ、ＣＤ４０Ｌ／ＣＤ４０、ＶＬＡ−４／ＶＣＡＭ−１、ＩＣＡＭ−１／ＬＦＡ−１、およびヒアルリン（ｈｙａｌｕｒｉｎ）／ＣＤ４４、シグナル伝達分子、および対応するオンコジーン産物（例えば、Ｍｏｓ、ＲＡＳ、Ｒａｆ、およびＭｅｔ）；ならびに転写アクチベーターおよび転写サプレッサー（例えば、ｐ５３、Ｔａｔ、Ｆｏｓ、Ｍｙｃ、Ｊｕｎ、Ｍｙｂ、Ｒｅｌ）、ならびに、ステロイドホルモンレセプター（例えば、エストロゲン、プロゲステロン、テストステロン、アルドステロン、ＬＤＬレセプターリガンド、およびコルチコステロンについてのレセプター）が含まれるがそれらに限定されない。
【００６０】
本発明の方法におけるコード化のための好ましい分子はまた、感染性またはさもなくば病原性の生物由来のタンパク質（例えば、Ａｓｐｅｒｇｉｌｌｕｓ属、Ｃａｎｄｉｄａ属、Ｅ．ｃｏｌｉ、Ｓｔａｐｈｙｌｏｃｃｏｉ属、Ｓｔｒｅｐｔｏｃｃｉ属、Ｃｌｏｓｔｒｉｄｉａ属、Ｎｅｉｓｓｅｒｉａ属、Ｅｎｔｅｒｏｂａｃｔｅｒｉａｃｅａ属、Ｈｅｌｉｃｏｂａｃｔｅｒ属、Ｖｉｂｒｉｏ属、Ｃａｐｙｌｏｂａｃｔｅｒ属、Ｐｓｅｕｄｏｍｏｎａｓ属、Ｕｒｅａｐｌａｓｍａ属、Ｌｅｇｉｏｎｅｌｌａ属、Ｓｐｉｒｏｃｈｅｔｅｓ属、Ｍｙｃｏｂａｃｔｅｒｉａ属、Ａｃｔｎｏｍｙｃｅｓ属、Ｎｏｃａｒｄｉａ属、Ｃｈｌａｍｙｄｉａ属、Ｒｉｃｋｅｔｔｓｉａ属、Ｃｏｘｉｅｌｌａ属、Ｅｈｒｉｌｉｃｈｉａ属、Ｒｏｃｈａｌｉｍａｅａ、Ｂｒｕｃｅｌｌａ、Ｙｅｒｓｉｎｉａ、Ｆｒａｃｉｓｅｌｌａ、およびＰａｓｔｕｒｅｌｌａ；原生動物、ウイルス（＋）ＲＮＡウイルス、（−）ＲＮＡウイルス、オルトミクソウイルス、ｄｓＤＮＡウイルス、レトロウイルスなどに特徴的なタンパク質）を含む。
【００６１】
なお他の適切な分子には、転写のインヒビター、作物の疫病の毒素、工業的に重要な酵素（例えば、プロテアーゼ、ヌクレアーゼ、およびリパーゼ）などが挙げられる。
【００６２】
好ましい分子には、核酸またはそれらがコードするタンパク質の関連する「ファミリー」のメンバーが挙げられる。関連性（例えば、包含または「ファミリー」からの除外）は、タンパク質機能によって、および／またはそのファミリーの他のメンバーとの配列同一性によって決定され得る。配列同一性は、本明細書中に記載されるように決定され得、そして好ましくはファミリーのメンバーは、少なくとも約３０％の配列同一性、より好ましくは少なくとも約５０％の配列同一性、そして最も好ましくは少なくとも約８０％の配列同一性を共有する。特定の例において、低い（例えば、約３０％未満の配列同一性）が有意な関連性を有する分子を含むことが所望される。このような方法は、バイオインフォマティクスの文献において周知であり、そして代表的には、配列／類似性情報を有する分子フォールディングパターンの組み込みを包含する。このようなアプローチの１つの一般的な実施は、「スレッディングアルゴリズム」を含む。スレッディングアルゴリズムは、配列を、構造的なテンプレートに対して比較することによって、遠い相同性を検出する。標的とテンプレートとの間の構造的類似性が十分に大きい場合、それらの関連性は、有意な配列類似性の非存在下で検出され得る。スレッディングアルゴリズムは、当業者に周知であり、そして例えば、ＮＣＢＩＳｔｒｕｃｔｕｒｅＧｒｏｕｐＴｈｒｅａｄｉｎｇＰａｃｋａｇｅ（ＮａｔｉｏｎａｌＣｅｎｔｅｒｆｏｒＢｉｏｌｏｇｉｃａｌＩｎｆｏｒｍａｔｉｏｎから入手可能（例えば、ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／Ｓｔｒｕｃｔｕｒｅ／ＲＥＳＥＡＲＣＨ／ｔｈｒｅａｄｉｎｇ．ｈｔｍｌを参照のこと））およびＳｅａＦｏｌｄ（ＭｏｌｅｃｕｌａｒＳｉｍｕｌａｔｉｏｎｓ，Ｉｎｃ．）において見出され得る。
【００６３】
（Ｂ）生物学的分子のキャラクターストリングへのコード化）
生物学的分子は、キャラクターストリングにコードされる。最も単純な例において、キャラクターストリングは、生物学的分子を表すために使用される文字コードと同一である。従って、例えば、キャラクターストリングは、核酸がコードされる、文字Ａ、Ｃ、Ｇ、Ｔ、またはＵを含み得る。同様に、標準的なアミノ酸命名法がポリペプチド配列を表すために使用され得る。あるいは、ある程度まで、コード化スキームは任意であることが認識される。従って、例えば、核酸の場合において、Ａ、Ｃ、Ｇ、Ｔ、またはＵは、整数１、２、３、４、および５によって、それぞれ表され得、そして核酸は、それ自体が単一の（代表的には大きいにもかかわらず）整数である、これらの整数のストリングとして表され得る。他のコード化スキームもまた可能である。例えば、生物学的分子は、キャラクターストリングにコード化され得、ここで分子の各「サブユニット」は、複数文字の表現にコード化される。あるいは、種々の圧縮された表現もまた可能である（例えば、反復性のモチーフが、各々の出現を識別する適切なポインタを用いて、１回のみで表現される場合）。
【００６４】
生物学的分子はまた、別個の／単一のストリングであるデータ構造にコードされる必要はない。より複雑なデータ構造（例えば、アレイ、リンクしたリスト、インデックス付構造（データベースまたはデータ表などを含むがこれらに限定されない））はまた、生物学的分子をコード化するために使用され得る。
【００６５】
生物学的分子の表現の入力、記憶、および検索を許容することが可能な本質的にいかなるデータ構造も適切である。これらの操作は手動で（例えば、紙と鉛筆、またはカードファイルなどを用いて）達成され得るが、好ましいデータ構造は、光学的におよび／または電子的におよび／または磁気的に操作され得るデータ構造であり、従って、自動化された入力操作、記憶操作、および出力操作を可能にする（例えば、コンピュータによって）データ構造である。
【００６６】
（ＩＩＩ．サブストリングの選択）
好ましい実施態様において、生物学的分子をコード化したキャラクターストリングは、サブストリングがそこから選択される、ストリングの初期集団を提供する。代表的には、少なくとも２つのサブストリングが選択され、１つのサブストリングは各初期キャラクターストリングに由来する。２つより多い初期キャラクターストリングが存在する場合には、少なくとも２つの初期キャラクターストリングがこのようなサブストリングを提供する限りは、すべての初期キャラクターストリングがサブストリングを提供する必要はない。しかし、好ましい実施態様において、少なくとも１つのサブストリングが、各々の初期ストリングから選択される。
【００６７】
（Ａ）サブストリング長）
任意の所定のストリングから生成され得る理論的なストリングの最大数以外には、初期ストリングから選択され得るサブストリングの最大数の限定は、実質的に存在しない。従って、例えば、初期ストリングから選択されるサブストリングの最大数は、初期ストリングの完全な順列によって生成されるストリングの数である。
【００６８】
しかし、比較的適度な長さの初期ストリングを用いると、順列の数は非常に多い。従って、好ましい実施態様において、サブストリングは、サブストリングが重複しないように初期ストリングから選択される。別の方法で表現すると、好ましい実施態様において、初期ストリングのいずれか１つに由来のサブストリングは、正しい規則性で結び付けられた場合に、これらのサブストリングが、それらがそこから選択される完全な初期ストリングを再現するように選択される。
【００６９】
好ましいサブストリングはまた、過度に短くないように選択される。代表的には、サブストリングは、コード化された生物学的分子の１つのサブユニットを表すのに必要な短いストリング長よりも短くない。従って、例えば、コード化された生物学的分子が核酸である場合には、サブストリングは、少なくとも１つのヌクレオチドをコードするのに充分な長さである。同様に、コード化された生物学的分子がポリペプチドである場合には、サブストリングは、少なくとも１つのアミノ酸をコード化するのに充分な長さである。
【００７０】
好ましい実施態様において、選択されるサブストリングは、コード化された生物学的分子の、少なくとも２、好ましくは少なくとも４、より好ましくは少なくとも１０、なおより好ましくは少なくとも２０、そして最も好ましくは少なくとも５０、１００、５００、または１０００サブユニットをコード化し得る。
【００７１】
サブストリング長は、生物学的編成の特定のレベルを捕捉するために選択され得る。例えば、遺伝子全体、ｃＤＮＡ、ｍＲＮＡをコード化するサブストリングが選択され得る。「より高度な」編成のレベルにおいて、オペロンまたは調節経路もしくは合成経路において見出され得るような、一連の関連する遺伝子、ｃＤＮＡ、ｍＲＮＡなどをコード化するサブストリングが、選択され得る。「より高度な」レベルの編成において、個体の核酸全体（例えば、ゲノムＤＮＡ、総ＲＮＡ、総ｃＤＮＡ）をコード化するサブストリングが、選択され得る。サブストリングが選択される初期ストリングがより高度なレベルの組織をコード化する限り、サブストリング中に捕捉される「編成のレベル」に対する限定は、実質的に存在しない。従って、サブストリング（単数または複数）が個々の遺伝子をコード化するために選択される場合、初期ストリングは、全体の代謝経路をコード化し得る。このサブストリングが個体の核酸全体をコード化するために選択される場合、初期ストリングは、集団の核酸全体などをコード化し得る。
【００７２】
逆に、サブストリングはまた、生物学的編成の特定のレベルのサブユニットをコード化するために選択され得る。従って、例えば、サブストリングは、タンパク質の特定のドメイン、染色体の特定の領域（例えば、特徴的に増幅されるか、欠失されるか、または転座された領域）などを選択するために使用され得る。
【００７３】
（Ｂ）サブストリング（Ｓｕｂｓｔｒｉｎｇ）選択アルゴリズム）
任意の広範な種々のアプローチが使用されてサブストリングが選択され得る。この特定のアプローチはモデリングされる問題によって決定される。好ましい選択アプローチとしては、ランダムサブストリング選択、均一なサブストリング選択、モーチフに基づく選択、アラインメントに基づく選択、および頻度に偏りのある（ｆｒｅｑｕｅｎｃｙ−ｂｉａｓｅｄ）選択が挙げられるが、これらに限定されない。同じサブストリング選択方法はあらゆる初期キャラクターストリングに適用される必要はないが、むしろ異なるサブストリング選択方法が異なる初期ストリングに対して使用され得る。さらに、多重サブストリング選択方法を任意の初期キャラクターストリングに適用することが可能である。
【００７４】
（１．ランダムサブストリング選択）
１つの単純なアプローチにおいて、サブストリングはランダムに選択され得る。多くのアプローチがサブストリングの「ランダム」選択について利用可能である。例えば、ここで、最小長「Ｌ」のサブストリングは長さ「Ｍ」のコードされたキャラクターストリングから選択され、「切断点」は、（短い末端ストリングを避けるため）ＬからＭ−Ｌに及ぶ整数（ストリングに沿った位置を示す）を生成する乱数発生器を使用して選択され得る。長さがＬ未満の「内部」サブストリングは捨てられる。
【００７５】
別のアプローチにおいて、キャラクターストリングに沿った各位置のアドレスを指定する（例えば、１からＮに及ぶ整数によって、ここでＮはキャラクターストリングの長さである）。最小サブストリング長「Ｌ」および最大サブストリング長「Ｍ」を選択する。次いで、乱数発生器を、ＬからＭに及ぶ番号「Ｖ」を生成するため使用する。次いで、アルゴリズムは、１からＶまでのポジションからサブストリングを選択し、そしてポジションＶ＋１は再びポジション１になる。次いで、このプロセスを初期ストリングがスパンされるまで繰り返す。
【００７６】
ランダムにサブストリングを選択する他の方法は容易に考案される。本発明の目的のために、「ランダム」選択は、選択プロセスがランダムネスに対する形式的な統計的要件をみたすことを必要としない。擬似ランダムまたは偶然の選択はこの文脈において十分である。
【００７７】
（２．均一なサブストリング選択）
均一なサブストリング選択において、各初期ストリングから得られるべき所望のサブストリングの数を決定する。次いで、初期ストリングを所望のサブストリングの数に均一に分ける。初期ストリング長が均一な分割を許可しない場合、１以上のより短いまたはより長いサブストリングが許可され得る。
【００７８】
（３．モチーフに基づく選択）
サブストリングは、モチーフに基づく選択を使用して、初期ストリングから選択され得る。このアプローチにおいて、初期キャラクターストリングを、予め選択された特定のモチーフの発生に対して走査する。次いで、そのモチーフに対して予め定義された関連性においてサブストリングの終点が発生するようにサブストリングを選択する。従って、例えば、その終わりは、モチーフまたは「上流」または「下流」内でモチーフの終わりから予め選択されたサブユニットの番号であり得る。
【００７９】
モチーフは完全に任意であり得るか、または物理的因子もしくは生物学的分子の特性を反映し得る。従って、例えば、コードされた生物学的分子が核酸である場合、モチーフは制限エンドヌクレアーゼ（例えば、ＥｃｏＲｖ、ＨｉｎｄＩＩＩ、ＢａｍＨＩ、ＰｖｕＩＩなど）の結合特異性、タンパク質結合部位、特定のイントロン／エキソン接合部、トランスポゾンなどを反映するために選択され得る。同様に、コードされた生物学的分子がタンパク質である場合、モチーフはプロテアーゼ結合部位、タンパク結合部位、レセプター結合部位、特定のリガンド、相補性決定領域、エピトープなどを反映し得る。
【００８０】
同様に、多糖類は特定の糖モチーフを含み得、糖タンパク質は特定の糖モチーフおよび／または特定のアミノ酸モチーフなどを有し得る。
【００８１】
モチーフは、コードされた生物学的分子の１次構造を詳細に反映する必要はない。２次構造および３次構造モチーフもまた可能であり、そしてサブストリング終点を描写するために使用され得る。従って、例えば、コードされたタンパク質は、特徴的なα−ヘリックス、β−シート、α−ヘリックスモチーフを包み得る。そしてこのモチーフの発生はサブストリング終点を描写するために使用され得る。
【００８２】
別の「より高度な規則性」の種類のモチーフは、例えば「断片化消化」によって説明されるように「メタ−モチーフ」であり得る。このアプローチにおいて、サブストリング終点は単一モチーフの発生によって決定されないが、１つ以上のモチーフの配位されたパターンおよび間隔によって描写される。
【００８３】
厳密に配列パターンを反映せず、むしろキャラクターストリングの特定のドメインの情報量を反映するモチーフがまた選択／利用され得る。従って、例えばＲ_i（ｂ，ｌ）によって表されるように、米国特許第５，８６７，４０２号は情報量重みマトリックスへの変換による配列シグナルをプロセシングするためのコンピューターシステムおよび計算方法を記載している。Ｒ_i（ｂ，ｌ）によって値Ｒ_iを生成する情報量重みマトリックスに対して特定の配列シグナルを適用し、特定の配列シグナルの個々の情報量を包含する第２の変換が続く。キャラクターストリングの情報量の決定に対する他のアプローチもまた公知である。（Ｓｔａｄｅｎ、（１９８４）ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．１２：５０５−５１９；Ｓｃｈｎｅｉｄｒ（１９９４）Ｎａｎｏｔｅｃｈｎｏｌｏｇｙ５：１−８；Ｈｅｒｍａｎら（１９９２）Ｊ．Ｂａｃｔｅｒｉｏｌ．３５５８−３５６０頁；Ｓｃｈｎｅｉｄｅｒら（１９９０）ＮｕｃｌｅｉｃＡｃｉｄｓＲｅｓ．、１８（２０）：６０９７−６１００；Ｂｅｒｇ，ら（１９８８）Ｊ．Ｍｏｌ．Ｂｉｏｌ．、２００（４）：７０９−７２３をもまた参照のこと）。
【００８４】
意図される他のモチーフは生物学的シグナルを反映する。従って、例えば、コードされた核酸、メチオニンの場合において、サブストリングの終点を描写する１つのモチーフは、終止コドンまたは開始コドンであり得るか、あるいは、タンパク質などの場合においては、ポリアデニル化シグナルであり得る。
【００８５】
同じモチーフがあらゆる初期配列に適用される必要はない。さらに、複数のモチーフ、メタ−モチーフおよび／またはモチーフ／メタ−モチーフの組み合わせは任意の配列に適用され得る。
【００８６】
（４．アラインメントに基づく選択）
別のアプローチにおいて、サブストリングは、２つ以上の初期キャラクターストリングをアラインメントさせることにより、そして、サブストリングの終点を選択するための初期ストリング間で高い同一性の領域を選択することによって選択される。従って、例えば、配列アラインメント後、サブストリングは、少なくとも約５サブユニットの長さ、好ましくは少なくとも約１０サブユニットの長さ、より好ましくは少なくとも約２０サブユニットの長さ、さらにより好ましくは少なくとも約３０サブユニットの長さ、そして最も好ましくは少なくとも約５０、１００、２００、５００サブユニットの長さ、またはさらに、１０００サブユニットの長さに及ぶウインドウにわたって、サブストリングの終点が、少なくとも３０％の、好ましくは少なくとも５０％の、より好ましくは少なくとも７０％の、さらにより好ましくは少なくとも８０％の、そして最も好ましくは少なくとも８５％、９０％、９５％、またはさらに、少なくとも９９％の配列同一性を有する領域の（例えば、中央部内で）サブストリングの終点が発生するように選択され得る。
【００８７】
２つ以上の生物学的高分子（例えば、核酸またはポリペプチド）の文脈における用語「配列同一性」または「パーセント配列同一性」または「パーセント同一性」またはパーセント「相同性」は、同じ配列であるか、あるいは、ある配列に配列比較アルゴリズムを使用して、または視覚的検査によって測定されるように、最大一致について比較し、そして整列した場合に、同じであるサブユニット（例えば、アミノ酸残基またはヌクレオチド）の特定化されたパーセンテージを有する２つ以上の配列または部分配列をいう。
【００８８】
配列比較に関して、代表的には、ある配列は参照配列として作用し、この配列に対して試験配列を比較する。好ましい実施態様において、配列比較アルゴリズムを使用する場合、試験および参照配列をコンピューターに入力し、必要ならば部分配列座標を設計し、そして配列アルゴリズムプログラムパラメーターを設計する。次いで、設計されたプログラムパラメーターに基づいて、配列比較アルゴリズムは参照配列に対して試験配列について配列同一性パーセントを計算する。
【００８９】
アラインメントおよび配列比較アルゴリズムは当業者に周知である。例えば、比較のための配列の最適なアラインメントは、以下を含むが、これらに限定されないアルゴリズムであり得る：ＳｍｉｔｈおよびＷａｔｅｒｍａｎ（１９８１）Ａｄｖ．Ａｐｐｌｅ．Ｍａｔｈ．２：４８２の局所的相同性アルゴリズム、ＮｅｅｄｌｍａｎおよびＷｅｎｃｈ（１９７０）Ｊ．Ｍｏｌ．Ｂｉｏｌ．４８：４４３の相同性アラインメントアルゴリズム、ＰｅａｒｓｏｎおよびＬｉｐａｎ（１９８８）Ｐｒｏ．Ｎａｔｌ．Ａｃａｄ．Ｓｉｃ．ＵＳＡ８５：２４４４の類似性検索方法によるもの、市販モジュールおよび／または市販ソフトウエアパッケージ（例えば、ｔｈｅＷｉｓｃｏｎｓｉｎＧｅｎｅｔｉｃｓＳｏｆｔｗａｒｅＰａｃｋａｇｅ、ＧｅｎｅｔｉｃｓＣｏｍｐｕｔｅｒＧｒｏｕｐ、５７５ＳｃｉｅｎｃｅＤｒ．、Ｍａｄｉｓｏｎ、ＷＩ）においてコンピューター処理されたこれらのアルゴリズムの実行によるもの（例えば、ＧＡＰ、ＢＥＳＴＦＩＴ、ＦＡＳＴＡ、およびＴＦＡＳＴＡ）、または視覚的検査によるもの（通常、Ａｍｕｓａｂｌｅら、前出を参照のこと）。
【００９０】
有用なアルゴリズムの１つの例はＰＩＬＥＵＰである。ＰＩＬＥＵＰは、進行的にペアを成すアラインメントを使用して関連配列の群から複数の配列アラインメントを生成し、関係およびパーセント配列同一性を示す。それはまた、使用されるクラスタリング関係を示す系図またはエンドガミーをプロットし、アラインメントを作成する。ＰＩＬＥＵＰはＦｅｎｇおよびＤｏｏｌｉｔｔｌｅ（１９８７）Ｊ．Ｍｏｌ．Ｅｖｏｌ．３５：３５１〜３６０の進行的アラインメント方法の単純化を使用する。使用されるこの方法はＨｉｇｇｉｎｓおよびＳｈａｒｐ（１９８９）ＣＡＢＩＯＳ５：１５１〜１５３によって記述される方法と類似する。このプログラムは、各最大長５０００個のヌクレオチドまたはアミノ酸の３００配列までアラインメントさせ得る。多数アラインメント手順は、２つの最も類似した配列のペアを成すアラインメントとともに始まり、２つのアラインメントされた配列のクラスターを生成する。次いで、このクラスターをアラインメントされた配列の次の最も関連した配列またはクラスターに対してアラインメントする。配列の２つののクラスターを２つの個々の配列のペアを成すアラインメントの単純な進長により整列させる。最終アラインメントを一連の進行的にペアを成すアラインメントによって達成する。このプログラムを配列比較領域についての特定の配列およびそれらのアミノ酸座標またはヌクレオチド座標を設計することにより、そしてプログラムパラメーターを設計することにより実行する。例えば、参照配列を他の試験配列に対して比較し、以下のパラメーター（ｄｅｆａｕｌｔｇａｐｗｅｉｇｈｔ（３．００）、ｄｅｆａｕｌｔｇａｐｌｅｎｇｔｈｗｅｉｇｈｔ（０．１０）およびｗｅｉｇｈｔｅｄｅｎｄｇａｐｓ）を使用するパーセント配列同一性関係を決定し得る。
【００９１】
パーセント配列同一性および配列類似性を決定するために適した別のアルゴリズムの例はＢＬＡＳＴアルゴリズムであり、これはＡｌｔｓｃｈｕｌら（１９９０）Ｊ．Ｍｏｌ．Ｂｉｏｌ．２１５：４０３〜４１０において記載される。ＢＬＡＳＴ分析を実行するためのソフトウエアはＮａｔｉｏｎａｌＣｅｎｔｅｒｆｏｒＢｉｏｔｅｃｈｎｏｌｏｇｙＩｎｆｏｒｍａｔｉｏｎ（ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／）によって公に入手可能である。このアルゴリズムは、照会配列中の長さＷの短いワードを同定することにより高得点配列対（ＨＳＰ）の最初の同定を含み、このことはデータベース配列中の同じ長さのワードと整列された場合に、一致するか、またはいくつかの陽性と評価される閾値スコアＴを満たすかのどちらかである。Ｔを近傍のワードスコア閾値とみなす（Ａｌｔｓｃｈｕｌら前出）。これら初期近傍ワードヒットは検索を開始する種子として作用し、それらを含むより長いＨＳＰｓを見い出す。このワードヒットを、累積的アラインメントスコアが増加し得る限り各配列に沿って両方向に進長させる。累積的アラインメントスコアがその最大到達値からＸ量の近くに下落する場合、１つ以上の陰性スコアリング残基アラインメントの蓄積のために累積スコアが０以下になる場合、またはどちらかの配列の終点に到達した場合に、各方向においてワードヒットの進長は停止する。このＢＬＡＳＴアルゴリズムパラメーターＷ、Ｔ、およびＸはアラインメントの感度およびスピードを決定する。このＢＬＡＳＴプログラムは１１のワード長（Ｗ）、５０のＢＬＯＳＵＭ６２スコアリングマトリクス（ＨｅｎｉｋｏｆｆおよびＨｅｎｉｋｏｆｆ（１９８９）Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．ＵＳＡ８９：１０９１５を参照のこと）アラインメント(Ｂ)、１０の例外(Ｅ）、Ｍ＝５、Ｎ＝−４、および両鎖の比較をデフォルトとして使用する。
【００９２】
パーセント配列同一性を計算することに加えて、このＢＬＡＳＴアルゴリズムはまた２つの配列間の類似性の統計的分析を実行する（例えば、ＫａｒｌｉｎおよびＡｌｔｓｃｈｕｌ（１９９３）Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．ＵＳＡ９０：５８７３〜５７８７を参照のこと）。ＢＬＡＳＴアルゴリズムにより提供された類似性の１つの計測は最も小さい合計確率（Ｐ（Ｎ））であり、これは確率の指標を提供し、その確率によって２つのヌクレオチドまたはアミノ酸配列間の一致が偶発する。例えば、核酸は、参照核酸に対する試験核酸の比較において最も小さい合計確率が約０．１未満、より好ましくは約０．０１未満、および最も好ましくは約０．００１未満である場合、参照配列に類似しているとみなされる。
【００９３】
上記同定された類似性アルゴリズムは、例示的であり、かつ限定的でないことが意図される。類似性は全長の初期キャラクターストリングにわたって決定され得るか、または特定のサブドメインに限定され得るということが理解される。
【００９４】
（５．頻度に偏りのある（ｆｒｅｑｕｅｎｃｙ−ｂｉａｓｅｄ）選択）
頻度に偏りのある（ｆｒｅｑｕｅｎｃｙ−ｂｉａｓｅｄ）部分配列選択方法において、部分配列は、部分配列の終点が特定の予め選択された頻度基準を満たする部分配列ドメインに対する特定の関係において生じるように選択される。例えば、高度に反復化したサブユニットパターン（例えば、核酸の場合において、「ＡＣＡＣＡＣＡＣＡＣＡＣ」のようなＡＣ反復の高い集中）を含むコードされた生物学的分子を除外することが所望される場合、サブユニット選択が設計され、特定のサブユニットまたはサブユニットのモチーフの特定の反復密度が出現する前に終点を生成し得る。この瞬間において、反復密度は、サブユニット数またはサブユニットモチーフの長さにおいてそれぞれ測定されたキャラクターストリング長あたりのサブユニットまたはサブユニットモチーフの発生数である。
【００９５】
従って、上記で示唆された例において、サブストリングは、ＡＣモチーフが０．５（５０％）を超える頻度で、少なくとも例えば４モチーフ長（この場合において８サブユニット長）の長さにわたって発生するキャラクターストリング領域に隣接してサブストリング終点が発生するように選択され得る。
【００９６】
そのような選択の他の例は、少なくともＸサブユニットにわたる１００％の出現にて、特定のサブユニットの出現に基づくサブストリング選択である。従って、例えば、コードされた生物学的分子が核酸であり、そしてこのサブユニットがアデノシン「Ａ」である場合、頻度に偏りのある選択はサブストリング終点をポリアデニル化シグナル（例えば、ＡＡＡＡＡＡＡ）の出現にて設定し得る。頻度に偏りのあるサブストリング選択基準の設計に依存して、上記に記載されているように、モチーフに基づく選択スキームを使用して同等の結果が得られ得る。
【００９７】
（６．他の基準）
多数の他の基準を使用し、特定のサブストリングの選択に影響を与え、そして／または決定し得る。そのような基準は、サブストリングによってコードされる分子の予想される疎水性および／またはＰＩおよび／またはＰＫを含む。他の基準は、交差数、所望されるフラグメントの大きさ、サブストリングの長さの分布、および／またはサブストリングによってコードされる分子の折り畳みに関する合理的な情報を含む。
【００９８】
（ＩＶ．サブストリングの連結）
一担、サブストリングの集団が初期ストリングから選択されると、このサブストリングは連結され、およそまたは正確に親初期ストリングと同じ長さの新しいストリングを生成する。このストリング連結は幅広い数の方法によって実行され得る。
【００９９】
１つの実施態様において、このサブストリングはランダムに連結され「再結合」ストリングを生成する。そのような「ランダム」連結に対する１つのアプローチにおいて、各サブストリングは独特の識別名を割り当てられる（例えば、整数または他の識別名）。次いで、この識別名がプールよりランダムに選択され（例えば、乱数発生器を使用する）、そしてそれらの識別名に対応する部分配列が結合され、連結された配列を生成する。結合された部分配列がおよそまたは正確に開始キャラクターストリングの長さである場合、このプロセスは再び開始され別のストリングを生成する。このプロセスを全てのサブストリングが利用されるまで繰り返す。あるいは、「サブストリングプール」よりそれらを取り除くこと無しにこのサブストリングを選択し得、そして所望される数の「完全長」ストリングを得るまでこのプロセスを繰り返す。
【０１００】
しかし、好ましい実施態様において、初期ストリング中に存在するような連結されたストリングを形成するサブストリングの相対的規則性を維持することが所望される。このことは任意の幅広い数の手段により達成され得る。例えば、親ストリングより選択された各サブストリングは、その親ストリングより誘導される他のストリングの位置に対するそのサブストリングの初期ストリングにおける位置を同定する識別名（例えば、ポインタ）とともに「タグ化」され得る。他の初期ストリング中の対応する位置より誘導されるサブストリングには、類似した位置の識別名を割り当てる。各３つの初期ストリング（Ａ、Ｂ、およびＣ命名した）が１から５までの通し番号をつけた５つのサブストリングを発生させる場合、このアプローチを図２において説明する。説明されているように、各サブストリングは独特に同定され得る（例えば、Ａ１、Ａ２、．．．Ａ５、Ｂ１、Ｂ２、．．．Ｂ５、Ｃ１、Ｃ２、．．．Ｃ５）。次いで、プール１（Ａ１、Ｂ１およびＣ２からなる）、プール２（Ａ２、Ｂ２およびＣ２からなる）などからプール５のサブストリングをランダムに選択することにより、連結されたストリングは生成され得る。このプロセスは、３つのストリングが再構成されるまで繰り返され得る。
【０１０１】
この連結スキームにおいて、一担サブストリングが連結されると、サブストリングプールよりそれが除去される。しかし、このプールから部分配列を「コピーする」ことにより、そして従って、後の連結に利用可能なサブストリングをまだ保持している間に、連結された配列中でそれを利用することによってこの連結は達成され得る。これはより大きな多様性を発生させる。
【０１０２】
他の実施態様において、連結の間、様々なアラインメントおよび／または類似性アルゴリズムを使用してサブストリングの関連配列を一般的に維持し得る。このアプローチにおいて、高度類似性の領域を会合することにより、部分配列に、連結された配列中に相対位置を割り当てる（例えば、図３を参照のこと）。
【０１０３】
好ましい実施態様において、最初にコードされた生物学的分子は、互いに何らかの関係を有する。従って、例えば、コードされる分子が、特定の酵素ファミリーにおけるメンバーを表す場合、分子は特定の集団などからの個体を表す。サブ配列は、有意な類似性を有するドメインを共有することが予測される。さらに、重要な機能性ドメインは、保存される傾向があり、そしてそれゆえまた、サブ配列の特定のドメインの類似性を増大させる。従って、サブ配列間の高度な類似性を有する領域を整列することは、初期ストリングにおけるそれらの規則性を反映するサブ配列の相対的な規則性を再構築する傾向がある。
【０１０４】
完全な規則性が全ての結び付けられたキャラクターストリングにおいて確立されることは要求されない。結び付けられた配列のパーセンテージ（例えば、好ましくは、少なくとも１パーセント、より好ましくは、少なくとも１０パーセント、なおより好ましくは、少なくとも２０％、そして最も好ましくは、少なくとも４０パーセント、少なくとも６０％、または少なくとも８０パーセント）が元々の規則性を保つことが好ましい。
【０１０５】
サブ配列を再並べ替えする類似性基準の使用は、ハイブリダイゼーション（ＳＢＨ）法（そこでは、類似性アルゴリズムは、完全配列のフラグメントから核酸配列を再構築するために使用される）による配列決定に類似する（例えば、Ｂａｒｉｎａｇａ（１９９１）Ｓｃｉｅｎｃｅ、２５３：１４８９；Ｂａｉｎｓ（１９９２）Ｂｉｏ／Ｔｅｃｈｎｏｌｏｇｙ１０：７５７−７５８；ＤｒｍａｎａｃおよびＣｒｋｖｅｎｊａｋｏｖ、ユーゴスラビア特許出願第５７０／８７号、１９８７；Ｄｒｍａｎａｃら（１９８９）Ｇｅｎｏｍｉｃｓ、４：１１４；Ｓｔｒｅｚｏｓｋａら（１９９１）Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．ＵＳＡ８８：１００８９；ならびにＤｒｍａｎａｃおよびＣｒｋｖｅｎｊａｋｏｖ、米国特許第５，２０２，２３１号を参照のこと）。
【０１０６】
特定の結び付け単独、または選択および結び付け操作は一緒に、特定のオペレータによって表わされ得ることが理解される。この種の特定のオペレータは、遺伝学アルゴリズムにおいて公知である。従って、例えば、「クロスオーバー」（相互転座）オペレータが定義され得、そこでは、２つの異なる初期配列中の類似の位置にあるサブ配列が交換される。同様に、クロスオーバー事象における特定のサブ配列を連結し、その結果、そのサブ配列が互いにクロスオーバーする（それらが隣接するサブ配列であるか否かに関わらず）「連結」オペレータが、定義され得る。前述の開示を鑑みて、その他のオペレータが当業者に公知である。
【０１０７】
（Ｖ．ストリングの収集物に解ストリングを加える）
本発明の方法によって生成される結び付けられたストリングは、「居住させたデータセット（ｐｏｐｕｌａｔｅｄｄａｔａｓｅｔ）」を形成するストリングの収集物に加えられる。この収集物中のストリングは、本明細書に記載される方法のさらなる反復において、初期ストリングとして使用され得る（図１を参照のこと）。この文脈における、加える、は、ストリングのセット内に含まれるような１つ以上のストリングを同定するプロセスをいう。これは、問題のストリングを、ストリングの収集物であるデータ構造中にコピーまたは移動させること、そのストリングからストリングの収集物を表すデータ構造へポインタを設定するかまたは提供すること、そのストリングと関連するフラッグ（ストリングを特定のセットに含むことを示す）を設定すること、あるいは単にそのように生成されたストリングがその収集物中に含められるルールを設計することを含むが、これらに限定されない種々の手段によって達成され得る。
【０１０８】
一旦、１つ以上の結び付けられたキャラクターストリングが生成されると、選択基準が、必要に応じて、結び付けられたストリングがストリングの収集物中に（例えば、第二の反復のための初期ストリングとして、および／または居住されたデータ構造の要素として）含められるべきか否かを決定するために課される。広範な数の選択基準が利用され得る。
【０１０９】
１つの実施態様において、類似性指標は、選択基準として使用され得る。従って、新たに生成された結び付けられたキャラクターストリングは、互いに、および／または初期ストリング（またはそのコードされた分子）と、および／または１つ以上の「参照」ストリングと、特定の所定の類似性（例えば、１０％を超え、好ましくは、２０％または３０％を超え、より好ましくは、４０％または５０％を超え、そして最も好ましくは、６０％、７０％、８０％、またはさらには９０％を超える）を共有しなければならない。
【０１１０】
選択はまた、配列同一性が極めて低い場合でさえ、「関連性」を評価するアルゴリズムの使用を含み得る。このような方法には、「スレッディング（ｔｈｒｅａｄｉｎｇ）」アルゴリズムおよび／または共分散測定が含まれる。
【０１１１】
その他の選択基準は、結び付けられたストリングによって表される分子がコンピュータにより予測された特定の特性を満足することを要求し得る。従って、例えば、選択基準は、最小または最大の分子量、特定の緩衝系における特定の最小または最大の自由エネルギー、特定の標的分子または表面との最小または最大の接触表面、特定の緩衝系における特定の正味の電荷、予想されたＰＫ、ＰＩ、結合アビディティー、特定の二次もしくは三次形態などを要求し得る。
【０１１２】
なお他の選択基準は、結び付けられたストリングによって表されるその分子が、特定の経験的物理的にアッセイされた特性に合うことを要求し得る。従って、例えば、選択基準は、結び付けられたストリングによって表される分子が特定の温度安定性、酵素活性のレベルを有すること、特定のｐＨの溶液を生成すること、特定の温度および／またはｐＨ至適条件を有すること、特定の溶媒系において最小または最大の可溶性を有すること、最小または最大の親和性で標的分子に結合することなどを要求し得る。特定の選択基準の物理的な決定は、代表的には、結び付けられたストリングによって表されるその分子が、合成され（例えば、化学的に、もしくは組換え法により）るか、または単離されることを要求する。
【０１１３】
物理的系におけるそのような選択基準の適用は、当業者に公知である（例えば、Ｓｔｅｍｍｅｒら（１９９１）ＴｕｍｏｒＴａｒｇｅｔｉｎｇ４：１−４；Ｎｅｓｓら（１９９９）ＮａｔｕｒｅＢｉｏｔｅｃｈｎｏｌｏｇｙ１７：８９３−８９６；Ｃｈａｎｇら（１９９９）ＮａｔｕｒｅＢｉｏｔｅｃｈｎｏｌｏｇｙ１７：７９３−７９７；ＭｉｎｓｈｕｌｌおよびＳｔｅｍｍｅｒ（１９９９）ＣｕｒｒｅｎｔＯｐｉｎｉｏｎｉｎＣｈｅｍｉｃａｌＢｉｏｌｏｇｙ３：２８４−２９０；Ｃｈｒｉｓｔｉａｎｓら（１９９９）ＮａｔｕｒｅＢｉｏｔｅｃｈｎｏｌｏｇｙ１７：２５９−２６４；Ｃｒａｍｅｒｉら（１９９８）Ｎａｔｕｒｅ３９１：２８８−２９１；Ｃｒａｍｅｒｉら（１９９７）ＮａｔｕｒｅＢｉｏｔｅｃｈｎｏｌｏｇｙ１５：４３６−４３８；Ｚｈａｎｇら（１９９７）Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．，ＵＳＡ、９４：４５０４−４５０９；Ｐａｔｔｅｎら（１９９７）Ｃｕｒｒ．Ｏｐｉｎ．Ｂｉｏｔｅｃｈ．８：７２４−７３３；Ｃｒａｍｅｒｉら（１９９６）ＮａｔｕｒｅＭｅｄ．２：１００−１０３；Ｃｒａｍｅｒｉら（１９９６）ＮａｔｕｒｅＢｉｏｔｅｃｈｎｏｌｏｇｙ１４：３１５−３１９；Ｇａｔｅｓら（１９９６）Ｊ．Ｍｏｌ．Ｂｉｏｌ．２５５：３７３−３８６；Ｓｔｅｍｍｅｒ（１９９６）ＣｒａｍｅｒｉおよびＳｔｅｍｍｅｒ（１９９５）ＢｉｏＴｅｃｈｎｉｑｕｅｓ１８：１９４−１９５；米国特許第５，６０５，７９３号、同第５，８１１，２３８号、同第５，８３０，７２１号、同第５，８３４，２５２号、同第５，８３７，４５８号、ＷＯ９５／２２６２５、ＷＯ９７／００７８、ＷＯ９７／３５９６６、ＷＯ９９／４１４０２；ＷＯ９９／４１３８３、ＷＯ９９／４１３６９、ＷＯ９９４１３６８、ＥＰ０９３４９９９；ＥＰ０９３２６７０；ＷＯ９９２３１０７；ＷＯ９９２１９７９；ＷＯ９８３１８３７；ＷＯ９８２７２３０、およびＷＯ９８１３４８７を参照のこと）。
【０１１４】
（ＶＩ．さらなる改変の導入）
特定の例において、さらなる改変をその集団に導入することが望ましい。これは、本発明の方法によって生成される初期集団を使用する進化アルゴリズムの繰り返される反復が、モデル化された問題に解答を与えない場合（例えば、どのメンバーも選択基準に合わない）に、特に所望される。
【０１１５】
多くの方法が、改変を、本発明の方法によって生成されるストリング集団に導入するために使用され得る。改変が初期ストリングに（その方法に対する入力）または結び付けられたストリングに（出力）導入され得ることに留意する。好ましくは、そのような改変は、選択工程の前に導入されるが、しかし、特定の場合には、改変は、選択後（例えば、二回目の反復の前）に導入され得る。
【０１１６】
１つのアプローチにおいて、確率論的オペレータが、コードされる分子を含む１つ以上のサブユニットをランダムに／偶然に変更するアルゴリズムに導入される。改変は、コードされていない分子（これは次いで、キャラクターストリングにコードされる）に導入され得ること、および／または改変は、コードされるキャラクターストリングに直接導入され得ることに留意する。確率論的なオペレータは、代表的には、２つの選択プロセスを呼び出す。１つの選択プロセスは、どのサブユニットが変更されるかの決定を含む。一方、他の選択プロセスは、何のサブユニットに変更されるかの選択／決定を含む。両方の選択プロセスは、確率論的であり得るか、または選択プロセスにあり、またはその他は、決定因子であり得る。従って、例えば、「変異する」ためのサブユニットの選択は、ランダム／偶然であり得るが、変異は、常に、同じ新たな／置換サブユニットに入り得る。あるいは、変異されるべき特定のサブユニットは、予め決定され得るが、変異された／得られるサブユニットのその選択は、ランダム／偶然であり得る。なお別の実施態様において、変異させるサブユニットの選択および変異の結果の両方は、ランダム／偶然であり得る。
【０１１７】
好ましい実施態様において、確率論的オペレータはまた、「変異」の発生の平均頻度を設定する「変異頻度」を入力もしくはパラメータとしてとる。従って、例えば、変異頻度が１０％に設定される場合、その確率論的オペレータは、変異を、初期ストリング中に含む１０サブユニットのうちの１に発生することを許容するのみである。その変異頻度はまた、範囲（例えば、５％〜１０％など）を設定し得る。
【０１１８】
その「確率論的オペレータ」は、全ての初期ストリングに、または初期ストリングを含む全てのサブストリングに適用される必要はない。従って、特定の実施態様において、確率論的オペレータの作用は、１つ以上の初期ストリングの特定の初期ストリングおよび／または特定のサブストリング（例えば、ドメイン）に制約される。
【０１１９】
確率論的オペレータの両方の選択基準が固定される場合、そのオペレータは、もはや確率論的ではなく、むしろ「指向された変異」を導入する。このようなオペレータは、そのオペレータが遭遇する全てのサブユニット「Ａ」をサブユニット「Ｂ」に変更するように指向し得る。その指向された変異オペレータはなお、パラメータ／属性／入力として変異頻度をとり得る。上記のように、その変異頻度は、そのオペレータが実際に形質転換する「遭遇される」サブユニットの数を制限する。
【０１２０】
上記のように、その確率論的オペレータが、１つ以上のコードされたサブユニットを変更し得ることもまた理解される。特定の実施態様において、そのオペレータは、多重にコードされたサブユニット、またはさらには全体のサブストリング／ドメインをさえ変更する。
【０１２１】
改変もまた、挿入オペレータまたは欠失オペレータの使用によって導入され得る。挿入オペレータまたは欠失オペレータは、本質的に「確率論的変異」オペレータのバリアントである。１つ以上のサブユニットを形質転換するかわりに、欠失オペレータは、１つ以上のサブユニットを欠失させ、一方、挿入オペレータは、１つ以上のサブユニットを挿入する。再び、欠失オペレータおよび挿入オペレータは、２つの選択プロセスを有する；挿入または欠失の部位を選択する１つのプロセス、およびその欠失のサイズまたはその挿入の同一性を選択する別のプロセス。選択プロセスの１つまたは両方は、確率論的であり得る。両方の選択プロセスが、予め決定されている（非確率論的）場合、その挿入または欠失オペレータは、指向された挿入オペレータまたは指向された欠失オペレータである。確率論的オペレータに関して、その挿入オペレータまたは欠失オペレータは、変異頻度をパラメータ／属性／入力として取り得る。
【０１２２】
別の実施態様において、改変は、ランダムに、または偶然に生成される１つ以上の初期ストリングを加えることによって増大され得、生物学的分子に由来する初期ストリングに対して必須の関係は有さない。改変導入初期ストリングは、厳密にランダムまたは偶然なストリングとして生成され得るか、または特定の実施態様では、改変ストリングは、特定の予め決定された基準に従って生成される（例えば、特定のサブユニットの発生頻度、コードされるストリングに対する最小および／または最大程度の類似性など）。改変導入初期ストリングは、全長ストリングである必要はないが、単に１つ以上のサブストリングを含むこともあり得る。この性質のストリングまたはサブストリングは、改変もまた減少させるために使用され得ることに留意する。従って、特定の分子ドメインが、「好ましい」場合、このドメインをコードするストリングまたはサブストリングは、初期ストリングの集団に加えられ得る。
【０１２３】
（ＶＩＩ．データ構造を居住させる）
１つの実施態様において、本発明の方法によって生成される全ての結び付けられたストリングは、データ構造を居住させるために使用されるか、および／または本明細書に記載される方法の別の反復において、初期ストリングとして使用される。その他の実施態様において、選択基準は、上記のように課され、そして選択基準に適合する結び付けられたストリングのみが、初期ストリングとして使用され、および／またはデータ構造を居住させるために使用される。データ構造は、上記の操作において使用されるコードされる分子の結び付けられた表示とともに居住され得るか、あるいはその結び付けられたストリングは、部分的に逆重畳積分されて、より単純なコードされたものとして再生され得るか、またはそのコードされた生物学的分子の表示を指向し得、そしてこれらの逆重畳積分されたストリングは、データ構造を居住させるために使用され得る。
【０１２４】
１つの実施態様において、そのデータ構造は、結び付けられたストリングが書き込まれた一枚の紙、またはそれぞれのカードに１つ以上の結び付けられたストリングがリストされているカードの集団と同じくらい単純であり得る。好ましい実施態様において、そのデータ構造は、適切に設計されたコンピュータによってそのデータ構造の操作を可能にする媒体（例えば、機械的および／または流体および／または光学的および／または量子的および／または磁気的および／または電子的）において実施される。特に好ましい実施態様において、そのデータ構造は、コンピュータメモリ（例えば、ダイナミック、スタティック、リードオンリーなど）中に、および／または光学的、磁気的、または磁気光学的保存媒体中に形成される。
【０１２５】
そのデータ構造は、コンピュータアクセス可能形態においてさえ、結び付けられたストリングのリストを単に提供し得る。あるいは、そのデータ構造は、種々の「エントリー」間の関係を保存するために構築され得る。簡単なレベルにおいて、これは、エントリーの簡単な同一性および／または規則性を維持することを包含し得る。より精巧なデータ構造はまた、利用可能であり、そしてデータ構造（例えば、結び付けられたストリング）中の１つ以上のエントリー間の関係をインデックス付けするため、および／または選別するため、および／または維持するための付属的な構造を提供し得る。そのデータ構造は、さらに、そのエントリーに関する注釈（例えば、起源、タイプ、物理的特性など）、またはエントリーと外部データ供給源との間のリンクに関する注釈を包含し得る。好ましいデータ構造には、リスト、リンクされたリスト、表、ハッシュ・テーブルおよび他のインデックス、フラットファイルデータベース、リレーショナルデータベース、局所または分配コンピュータシステムが含まれるが、これらに限定されない。特に好ましい実施態様において、そのデータ構造は、従来的な（例えば、磁気および／または光学的）媒体に保存されたデータファイルか、またはコンピュータメモリに読み込まれたデータファイルである。
【０１２６】
（ＶＩＩＩ．プログラムされたデジタル装置における実施態様）
本発明は、適切に構成されたコンピュータデバイスにロードされた場合に、本発明の方法に従って、そのデバイスにデータ構造を居住させる（例えば、結び付けられたストリングのプール／収集物を生成する）論理構造および／またはデータを含む固定された媒体または伝達可能プログラム構成要素において実施され得る。
【０１２７】
図４は、媒体７１７および／またはネットワークポート７１９からの命令を読むことができる論理装置として理解され得るデジタルデバイス７００を示す。装置７００は、その後、その命令を使用して、分子のコードされた表示およびデータ構造の集団の生物学的分子操作のコード化を指向させ得る。本発明を具体化し得る論理装置の１つのタイプは、ＣＰＵ７０７、光学入力デバイス７０９および７１１、ディスクドライブ７１５および必要に応じてモニタ７０５を含む７００に例示されるようなコンピュータシステムである。固定された媒体７１７は、このようなシステムをプログラムするために使用され得、そしてディスクタイプの光学的または磁気的な媒体またはメモリを表し得る。コミュニケーションポート７１９はまた、このようなシステムをプログラムするために使用され得、そして任意のタイプのコミュニケーションコネクションを表し得る。
【０１２８】
本発明はまた、特定の一体化された回路（ＡＳＩＣ）またはプログラム可能な論理デバイス（ＰＬＤ）のアプリケーションの回路内で実施され得る。このような場合、本発明は、本明細書に記載されるように操作されるＡＳＩＣまたはＰＬＤを生成するために使用され得るコンピュータ理解可能な記述子言語で実施され得る。
【０１２９】
本発明はまた、その他のデジタル装置（例えば、カメラ、ディスプレイ、画像編集装置など）の回路または論理プロセス内で実施され得る。
【０１３０】
（ＩＸ．ウェブサイトにおける実施態様）
本発明の方法は、ローカライズされたコンピューティング環境、または分散コンピューティング環境において実現され得る。分散環境において、この方法は、複数のプロセッサーを含む１つのコンピューターまたは多数のコンピューター上で実施され得る。このコンピューターは、例えば、共通バスを通じてリンクされ得るが、より好ましくは、このコンピューターはネットワーク上のノードである。このネットワークは、汎用化したもしくは専用化した、ローカルネットワークまたは広域ネットワークであり得、特定の好ましい実施態様では、コンピューターは、イントラネットまたはインターネットの構成要素であり得る。
【０１３１】
好ましい実施態様では、クライアントシステムは、代表的に、ウェブブラウザを実行し、そしてウェブサーバーを実行するサーバーコンピューターに接続される。このウェブブラウザは、代表的に、ＩＢＭのＷｅｂＥｘｐｌｏｒｅｒ、またはＮｅｔＳｃａｐｅもしくはＭｏｓａｉｃのようなプログラムである。ウェブサーバーは、代表的に、ＩＢＭのＨＴＴＰＤａｅｍｏｎまたは他のＷＷＷデーモンのようなプログラムであるが、それである必要はない。クライアントコンピューターは、ラインを通してかまたはワイアレスシステムを介してサーバーコンピューターと双方向接続される。次いで、このサーバーコンピューターは、本発明の方法を実現するソフトウェアへのアクセスを提供するウェブサイト（サーバーがこのウェブサイトをホスティングする）と双方向接続される。
【０１３２】
イントラネットまたはインターネットに接続されたクライアントのユーザーは、本発明の方法の実現を提供するアプリケーションをホスティングするウェブサイトの部分であるリソースをクライアントに要求させ得る。次いで、サーバープログラムは、特定のリソース（それらは現在利用可能であると想定する）を返答するために要求を処理する。ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ（「ＵＲＬ」）として公知の、標準的な命名規則が適用されている。この規則は、いくつかの形式のロケーション名を含む。これは、現在、例えば以下のようなサブクラスを含む：ＨｙｐｅｒｔｅｘｔＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ（「ｈｔｔｐ」）、ＦｉｌｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ（「ｆｔｐ」、ゴーファー（ｇｏｐｈｅｒ）、およびＷｉｄｅＡｒｅａＩｎｆｏｒｍａｔｉｏｎＳｅｒｖｉｃｅ（「ＷＡＩＳ」）。リソースがダウンロードされる場合、これはさらなるＵＲＬのリソースを含み得る。従って、クライアントのユーザーは、彼または彼女が具体的に要求しなかった新規なリソースの存在を容易に知ることができる。
【０１３３】
本発明の方法を実現するソフトウェアは、真のクライアント−サーバーアーキテクチャにおいてウェブサイトをホスティングするサーバー上にて、ローカルで実行し得る。従って、クライアントコンピューターのポストは、要求されたプロセスをローカルで実行するホストサーバーに要求し、次いで、結果をクライアントにダウンロードして戻す。あるいは、本発明の方法は、「多層（ｍｕｌｔｉ−ｔｉｅｒ）」形式で実行され得、ここで本方法の構成要素は、クライアントによりローカルで実行される。これは、クライアントによる要求に対してサーバーからダウンロードされるソフトウェア（例えば、Ｊａｖａアプリケーション）により実現され得るか、またはクライアント上に「永久に」インストールされるソフトウェアにより実現され得る。
【０１３４】
１つに実施態様では、本発明の方法を実現するアプリケーションは、フレームへと分割される。このパラダイムにおいて、特徴または機能のコレクションとしてアプリケーションを見るのではなく、代わりに分散したフレームまたはビューのコレクションとしてアプリケーションを見るのに役立つ。例えば、代表的なアプリケーションは、一般的に、一組のメニューアイテム（その各々が特定のフレームを呼び出す−−すなわち、アプリケーションの特定の機能を表すフォーム）を含む。この観点において、アプリケーションは、コードのモノリシック体としてではなく、アプレットのコレクションまたは機能のバンドルとみなされる。この様式において、ブラウザ内から、ユーザーは、ウェブページリンクを選択して、次にアプリケーションの特定のフレーム（すなわち、サブアプリケーション）を呼び出す。従って、例えば、１つ以上のフレームが、１つ以上のキャラクターストリング中に生物学的分子を入力する、および／またはその分子をコードするための機能を提供し得るが、別のフレームは、コードされたキャラクターストリングの多様性を生成するおよび／または増加するためのツールを提供する。
【０１３５】
フレームのコレクションとしてアプリケーションを表現することに加えて、アプリケーションはまた、イントラネットおよび／またはインターネット上の位置（アプリケーションを示すＵＲＬ（ＵｎｉｖｅｒｓａｌＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）アドレスとして表現される。各ＵＲＬは、好ましくは、２つの特性を含む：データ形式またはＭＩＭＥ（ＭｕｌｔｉｐｕｒｐｏｓｅＩｎｔｅｒｎｅｔＭａｉｌＥｘｔｅｎｓｉｏｎ）形式とともにＵＲＬに関するコンテントデータ（すなわち、どんなデータもサーバー上に保存される）。このデータ形式は、ウェブブラウザが、サーバーから受け取るデータをどのように解釈すべきか（例えば、ビットマップイメージのような．ｇｉｆファイルの解釈）を決定することを可能にする。結局、これは、ブラウザで一旦受入れられたデータの処理の仕方の記述として役割を果たす。バイナリーデータのストリームは、ＨＴＭＬ形式として受入れられる場合、ブラウザは、それをＨＴＭＬページとして描写する。一方、その代わりに、ビットマップの形式で受入れる場合、ブラウザは、それをビットマップイメージとして描画するなどのようである。
【０１３６】
ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓでは、ホストアプリケーションに、あるデータオブジェクト（すなわち、特定の形式のデータ）との関係を登録させる、異なる技術が存在する。ある技術は、アプリケーションについて、あるものについての特定のファイル拡張子との関係（例えば、．ｄｏｃ−−「ＭｉｃｒｏｓｏｆｔＷｏｒｄ書類」）をＷｉｎｄｏｗｓに登録することであり；これは、Ｗｉｎｄｏｗアプリケーションによって採用される最もよく用いられる技術である。ＭｉｃｒｏｓｏｆｔＯｂｊｅｃｔＬｉｎｋｉｎｇａｎｄＥｍｂｅｄｄｅｄ（ＯＬＥ）において採用される別のアプローチは、クラスＧｌｏｂａｌｌｙＵｎｉｑｕｅＩｄｅｎｔｉｆｉｅｒ、すなわちＧＵＩＤ−−（ＧＵＩＤを有する書類をホスティングするために）呼び出すための特定のサーバーアプリケーションを示すための１６バイト識別子の使用である。このクラスＩＤは、特定のＤＬＬ（ＤｙｎａｍｉｃＬｉｎｋＬｉｂｒａｒｙ）またはアプリケーションサーバーに接続されている特定の機器に登録される。
【０１３７】
特定の目的の１つの実施態様において、ホストアプリケーションを書類と関連づけするための技術は、ＭＩＭＥ形式の使用を通じてである。ＭＩＭＥは、書類オブジェクトをパッケージ化するための規格化された技術を提供する、それは、どのアプリケーションが書類をホスティングするのに適切なかを示すＭＩＭＥヘッダを含む。これら書類は、全て、インターネットを通じて転送するのに適するフォーマットで収納される。
【０１３８】
１つの好ましい実施態様において、本発明の方法は、部分的に、本発明の方法の使用に固有のＭＩＭＥ形式の使用を用いて実現される。ＭＩＭＥ形式は、書類（例えば、ＭｉｃｒｏｓｏｆｔＡｃｔｉｖｅＸ書類）をローカルで作成するために必要な情報を含むが、さらに、必要ならば、書類の表示を表現するためのプログラムコードを見つけそしてダウンロードするために必要な情報もまた含む。このプログラムコードが既にローカルに存在する場合、それは、ローカルの複製をアップデートする目的でダウンロードされる必要だけがある。これは、書類の表示を表現するためのダウンロード可能なプログラムコードをサポートする情報を含む新しい書類形式を定義する。
【０１３９】
ＭＩＭＥ形式は、．ＡＰＰのファイル拡張子と関連し得る。．ＡＰＰ拡張子を有するファイルは、ＯＬＥ書類であり、これはＯＬＥＤｏｃＯｂｊｅｃｔによって実現される。．ＡＰＰファイルは１つのファイルであるので、それは、ＨＴＭＬＨＲＥＦを用いてサーバー上に置かれ得そしてリンクされ得る。この．ＡＰＰファイルは、好ましくは以下のデータの断片を含む：（１）ＡｃｔｉｖｅＸオブジェクトのＣＬＤＳＩＤ、これは、本発明の方法の使用に適切な１つ以上のフォームとして実現されるＯＬＥＤｏｃｕｍｅｎｔＶｉｅｗｅｒである；（２）オブジェクトのコードが見出され得るＵＲＬのコードベース、および（３）（必要に応じて）必要とされるバージョン番号。一旦、．ＡＰＰＤｏｃＯｂｊｅｃｔハンドラコードがインストールされ、そしてＡＰＰＭＩＭＥ形式を登録すると、それを使用して、ユーザーのウェブブラウザへと．ＡＰＰファイルをダウンロードし得る。
【０１４０】
サーバー側において、．ＡＰＰファイルは、現実に１つのファイルであるので、ウェブサーバーは、単に要求を受入れ、そしてクライアントにこのファイルを戻す。ＡＰＰファイルがダウンロードされる場合、．ＡＰＰＤｏｃＯｂｊｅｃｔハンドラは、オペレーティングシステムにこの．ＡＰＰファイルに固有のオブジェクトに関するコードベースをダウンロードするように要求する。このシステムの機能は、ＣｏＧｅｔＣｌａｓｓＯｂｊｅｃｔＦｒｏｍＵＲＬ機能を通じて、Ｗｉｎｄｏｗｓにおいて利用可能である。ＡｃｔｉｖｅＸオブジェクトのコードベースがダウンロードされた後、この．ＡＰＰＤｏｃＯｂｊｅｃｔハンドラは、ブラウザにそれ自身の表示を、例えば、Ｅｘｐｌｏｒｅｒ書類サイト上のＡｃｔｉｖａｔｅＭｅ方法を呼び出すことによって、作成することを要求する。次いで、ＩｎｔｅｒｎｅｔＥｘｐｌｏｒｅｒは、ＤｏｃＯｂｊｅｃｔを呼び出して、表示の証拠として実例を示す（それは、ダウンロードされたコードからのＡｃｔｉｖｅＸ表示オブジェクト例を作成することによってなされる）。一旦作成されると、ＡｃｔｉｖｅＸ表示オブジェクトは、ＩｎｔｅｒｎｅｔＥｘｐｌｏｒｅｒにおいて適所で起動される。ＩｎｔｅｒｎｅｔＥｘｐｌｏｒｅｒは、適切なフォームを作成し、そしてフォームの子を制御する。
【０１４１】
一旦このフォームが作成されると、それは、それがその機能を実行するために必要である、もとの任意のリモートサーバーオブジェクトへの接続を確立し得る。この点において、ユーザーは、このフォームで対話し得、このフォームは、ＩｎｔｅｒｎｅｔＥｘｐｌｏｒｅｒフレームに埋め込まれているようである。ユーザーが、違うページに変える場合、ブラウザは、このフォームを最終的に閉じかつ破棄する（ならびに、リモートサーバーに対する任意の未決着の接続も放棄する）責任を想定する。
【０１４２】
１つの好ましい実施態様では、エンドユーザーのデスクトップからの、このシステムへのエントリーポイントは、企業ホームページまたは別の特定のウェブサイトのホームページである。このページは、必要に応じて、従来の様式で、多数のリンクを含み得る。ユーザーがアプリケーションページ（例えば、本発明の方法の機能を提供するページ）への特定のリンクをクリックすることに応じて、ウェブブラウザは、サーバー上に常駐するアプリケーションページ（ファイル）に接続する。
【０１４３】
１つの実施態様では、ユーザーが本発明の方法へのアクセスを要求する場合、このユーザーは、特定のページ形式（例えば、ウェブブラウザにおける（本発明の方法の１つ以上の要素を実行する）アプリケーションの所定の位置での実行のためのアプリケーション（ａｐｐｄｏｃ）ページ）に指向される。各アプリケーションページは、ＵＲＬを使用して位置づけられるので、他のページは、それへのハイパーリンクを有し得る。複数のアプリケーションページは、アプリケーションページへのハイパーリンクを含むカタログページを作成することによってグループ化され得る。ユーザーが、あるアプリケーションページを示すハイパーリンクを選択する場合、ウェブブラウザは、アプリケーションコードをダウンロードし、そしてブラウザ内でページを実行する。
【０１４４】
ブラウザがアプリケーションページをダウンロードする際に、このブラウザ（定義されたＭＩＭＥ形式に基づく）は、ある形式の書類に関するハンドラである、ローカルハンドラを呼び出す。すなわち、詳細には、アプリケーションページは、好ましくはＧｌｏｂａｌｌｙＵｎｉｑｕｅＩｄｅｔｉｆｉｅｒ（ＧＵＩＤ）および書類をホスティングするために呼び出すリモート（ダウンロード可能な）アプリケーションを識別するためのコードベースＵＲＬを含む。アプリケーションページと共に届く書類オブジェクトおよびＧＵＩＤが与えられれば、ローカルハンドラは、ホスティングアプリケーションが既にローカルに常駐しているかどうかを（例えば、Ｗｉｎｄｏｗｓ９５／ＮＴレジストリを検査することによって）確かめるためにクライアント機器を見る。この点で、ローカルハンドラは、（あれば）ローカルコピーを呼び出すことを選択し得るか、またはホストアプリケーションの最新バージョンをダウンロードし得る。
【０１４５】
異なるモデルのダウンロードコードは、市販されている。コードがダウンロードされる場合、「コードベース」仕様（ファイル）は、最初にサーバーから要求される。このコードベース自体は、簡易ＤＬＬファイルから複数の圧縮ファイルを含むＣａｂｉｎｅｔファイル（Ｍｉｃｒｏｓｏｆｔ．ｃａｂファイル）に及び得る。なおさらに、情報（例えば、Ｍｉｃｒｏｓｏｆｔ．ｉｎｆ）ファイルは、ダウンロードされるアプリケーションをインストールする方法をクライアントシステムに指示するために採用され得る。これらの機構は、どのアプリケーションの構成要素が、ダウンロードされるか、そして何時ダウンロードされるかを選択することにおいて、卓越した柔軟性を与える。
【０１４６】
好ましい実施態様について、プログラムコードを実際にダウンロードするために採用される機構そのものが、標準的ＭｉｃｒｏｓｏｆＡｃｔｉｖｅＸＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｉｎｇＩｎｔｅｒｆａｃｅ）−コールに依存する。ＡｃｔｉｖｅＸＡＰＩは、ウェブで配布されるアプリケーションに関するネイティブサポートを提供しないが、そのＡＰＩは、プログラムコードの正確なバージョンを位置付け、ローカル機器へそれをコピーし、その整合性を検証し、そしてそれをクライアントオペレーティングシステムに登録するために呼び出され得る。一旦、このコードがダウンロードされると、ハンドラが、書類オブジェクトを表現するために（レジストリが既にインストールされた場合、このレジストリを通じてホスティングアプリケーションを呼び出すのに類似した様式で）今存在するアプリケーションホストを呼び出すことを実行し得る。
【０１４７】
ホスティングアプリケーション（ＯＬＥサーバー）が、クライアントでロードされる以上は、このクライアントシステムは、ブラウザ内でアプリケーションを正しく表現するためにＯＬＥドキュメントビューアーキテクチャを採用し得る。これは、ブラウザのメニューにアプリケーションのメニューを加えるために、および（シングルＡｃｔｉｖｅＸコントロールレクタングル（ｃｏｎｔｒｏｌｒｅｃｔａｎｇｌｅ）−−既述した制限内で実行するのにアプリケーションを要求することとは対照的に）ブラウザのサイズを変える際にアプリケーションのサイズを正しく変えるために、従来のＯＬＥ方法論を用いることを含む。一旦、アプリケーションがクライアントで実行されると、それは例えば、ＲＰＣ（ＲｅｍｏｔｅＰｒｏｃｅｄｕｒｅＣａｌｌ）方法論を使用してリモートロジックを実行し得る。この様式において、リモートプロシージャーとして好適に実現されるロジックも、さらに使用され得る。
【０１４８】
特定の好ましい実施態様では、本発明の方法は、以下の機能を提供する１つ以上のフレームとして実行される。２つ以上の生物学的分子を、キャラクターストリング中にコードして、２つ以上の異なる初期キャラクターストリングのコレクションを提供する機能（ここで、各々の上記生物学的分子は、少なくとも約１０のサブユニットを含む）；キャラクターストリングから少なくとも２つのサブストリングを選択する機能；サブストリングを結び付けて、１つ以上の初期キャラクターストリングとほぼ同じ長さの１つ以上の産物ストリングを形成する機能；およびストリングのコレクションへ産物ストリングを加える（配置する）機能。
【０１４９】
２つ以上の生物学的分子をコードする機能は、好ましくは、１つ以上のウィンドウを提供する。ここで、ユーザーは、生物学的分子の表示を挿入し得る。さらに、コーディング機能はまた、必要に応じて、ローカルネットワークならびに／または、インターネットを通じてアクセス可能な個人のデータベースおよび／もしくは公的なデータベースへのアクセスを提供し、それによって、データベース中に含まれる１つ以上の配列が、本発明の方法へと入力され得る。従って、例えば、１つの実施態様において、エンドユーザーが核酸配列をコーディング機能中に入力する場合、ユーザーは、必要に応じて、ＧｅｎＢａｎｋの検索を要求し、そしてこのような検索によって戻ってきた、１つ以上の配列をコーディング機能および／または多様性生成機能に入力する能力を有し得る。
【０１５０】
コンピュータープロセスならびに／またはデータアクセスプロセスのインターネットおよび／もしくはイントラネットの実施態様を実現する方法は、当業者に周知であり、そして極めて詳細に記録されている（例えば、Ｃｌｕｅｒら、（１９９２）ＡＧｅｎｅｒａｌＦｒａｍｅｗｏｒｋｆｏｒｔｈｅＯｐｔｉｍｉｚａｔｉｏｎｏｆＯｂｊｅｃｔ−ＯｒｉｅｎｔｅｄＱｕｅｒｉｅｓ，ＰｒｏｃＳＩＧＭＯＤＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｎａｇｅｍｅｎｔｏｆＤａｔａ，ＳａｎＤｉｅｇｏ，Ｃａｌｉｆｏｒｎｉａ，１９９２年６月２〜５日，ＳＩＧＭＯＤＲｅｃｏｒｄ，第２１巻、１９９２年６月２日発行；Ｓｔｏｎｅｂｒａｋｅｒ，Ｍ．編；ＡＣＭＰｒｅｓｓ，３８３−３９２頁；ＩＳＯ−ＡＮＳＩ，ＷｏｒｋｉｎｇＤｒａｆｔ，「ＩｎｆｏｒｍａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ−ＤａｔａｂａｓｅＬａｎｇｕａｇｅＳＱＬ」，ＪｉｍＭｅｌｔｏｎ編，ＩｎｔｅｎａｔｉｏｎａｌＯｒｇａｎｉｚａｔｉｏｎｆｏｒＳｔａｎｄａｒｄｉｚａｔｉｏｎａｎｄＡｍｅｒｉｃａｎＮａｔｉｏｎａｌＳｔａｎｄａｒｄｓＩｎｓｔｉｔｕｔｅ，１９９２年７月；ＭｉｃｒｏｓｏｆｔＣｏｒｐｏｒａｔｉｏｎ，「ＯＤＢＣ２．０Ｐｒｏｇｒａｍｍｅｒ’ｓＲｅｆｅｒｅｎｃｅａｎｄＳＤＫＧｕｉｄｅ．ＴｈｅＭｉｃｒｏｓｏｆｔＯｐｅｎＤａｔａｂａｓｅＳｔａｎｄａｒｄｆｏｒＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓ．ＴＭ．ａｎｄＷｉｎｄｏｗｓＮＴ．ＴＭ．，ＭｉｃｒｏｓｏｆｔＯｐｅｎＤａｔａｂａｓｅＣｏｎｎｅｃｔｉｖｉｔｙ．ＴＭ．ＳｏｆｔｗａｒｅＤｅｖｅｌｏｐｍｅｎｔＫｉｔ」，１９９２，１９９３，１９９４ＭｉｃｒｏｓｏｆｔＰｒｅｓｓ，３−３０頁および４１−５６頁；ＩＳＯＷｏｒｋｉｎｇＤｒａｆｔ，「ＤａｔａｂａｓｅＬａｎｇｕａｇｅＳＱＬ−Ｐａｒｔ２：Ｆｏｕｎｄａｔｉｏｎ（ＳＱＬ／Ｆｏｕｎｄａｔｉｏｎ）」，ＣＤ９０７５−２：１９９．ｃｈｉ．ＳＱＬ，１９９７年９月１１日など、を参照のこと）。
【０１５１】
当業者は、多くの改善が、本発明の範囲から逸脱することなく、本構成に対してなされ得ることを認識する。例えば、２段構成において、ＷＷＷゲートウェイの機能を実行するサーバーシステムはまた、ウェブサーバーの機能も実行し得る。例えば、上記の実施態様のいずれか１つは、ＵＲＬ以外の形式であるユーザー（単数／複数）末端からの要求を認めるように変更され得る。なお別の変更は、複数のマネージャー環境への適応を含む。
【０１５２】
（Ｘ．物理的評価およびフィードバックループの組み込み）
上記のように、特定の好ましい実施態様において、選択基準は、結び付けられたストリングにより提示される分子が、特定の経験的な物理的にアッセイされた特性を満たすことが必要であり得る。これらの特性をアッセイするために、コードされた分子を得る必要がある。このことを達成するために、結び付けられたストリングにより提示される分子は、物理的に合成される（例えば、化学的にもしくは組換え法により）か、または単離される。
【０１５３】
本発明に従って生成されたキャラクターストリングの収集物によりコードされる遺伝子、タンパク質、ポリサッカライドの物理的合成は、１つ以上の所望の特性についての物理的アッセイに敏感に反応する物理的提示物を作製するための主な手段である。
【０１５４】
好ましい実施態様において、遺伝子合成技術は、代表的には、一致した様式で、および本発明の方法により生成される結び付けられたストリングの収集物に提供される配列提示に対する忠実な厳守において、ライブラリーを構築するために使用される。
【０１５５】
好ましい遺伝子合成方法は、１０⁴〜１０⁹遺伝子／タンパク質変化のライブラリーの迅速な構築を可能にする。これは、代表的には、物理的アッセイまたは選択方法により完全にサンプリングされるのと同程度に、より大きなライブラリーを作製および維持することがより困難であり、かつときおり作製および維持され得ないので、スクリーニング／選択プロトコルに適切である。例えば、当該分野における既存の物理的アッセイ方法（例えば、「生死（ｌｉｆｅａｎｄｄｅａｔｈ）」選択法を含む）は、一般に、特定のライブラリーの特定のスクリーニングにより約１０⁹の変化以下のサンプリングを可能にし、そして多くのアッセイは約１０⁴〜１０⁵のメンバーのサンプリングに制限されている。従って、いくつかのより小さなライブラリーを構築することは、好ましい方法である。なぜなら、大きなライブラリーは、完全にサンプリングすることは容易にはできないからである。しかし、より大きなライブラリーは、例えば、ハイスループット方法を用いて、やはり作製およびサンプリングされる。
【０１５６】
十分に規定された配列を用いて遺伝子、ポリサッカライド、タンパク質などを合成するために使用され得る多くの方法が存在し、そしてこの分野は、急激に発展している。単に、例を明示する目的で、この議論は、生物学的分子の生成について公知の方法の多くの可能性のあるかつ利用可能な型のうちの１つに焦点を当てている。
【０１５７】
ポリヌクレオチド合成における現在の技術は、当業者がオリゴヌクレオチドを効率的に調製することを可能にする、周知かつ成熟したホスホルアミダイト化学により最もよく表れている。１００ｂｐより有意に長いオリゴヌクレオチドの慣用的合成についてこの化学を使用することは可能であるが、いくらか実際的ではない。そして合成収量は減少し、必要とされる生成の程度は増大する。「代表的な」４０〜８０ｂｐサイズのオリゴヌクレオチドは、非常に高純度で慣用的かつ直接的に獲得され得る。
【０１５８】
オリゴヌクレオチドおよびなお完全な合成（二本鎖または一本鎖）遺伝子を、多くの市販の供給源（例えば、ＴｈｅＭｉｄｌａｎｄＣｅｒｔｉｆｉｅｄＲｅａｇｅｎｔＣｏｍｐａｎｙ（ｍｃｒｃ＠ｏｌｉｇｏｓ．ｃｏｍ）、ＴｈｅＧｒｅａｔＡｍｅｒｉｃａｎＧｅｎｅＣｏｍｐａｎｙ（ｈｔｔｐ：／／ｗｗｗ．ｇｅｎｃｏ．ｃｏｍ）、ＥｘｐｒｅｓｓＧｅｎ，Ｉｎｃ．（ｗｗｗ．ｅｘｐｒｅｓｓｇｅｎ．ｃｏｍ）、ＯｐｅｒｏｎＴｅｃｈｎｏｌｏｇｉｅｓＩｎｃ．（ａｌａｍｅｄａ，ＣＡ）などの多くの商用の供給源のいずれかから注文し得る。同様に、ペプチドを、ＰｅｐｔｉｄｏＧｅｎｉｃ（ｐｋｉｍ＠ｃｃｎｅｔ．ｃｏｍ）、ＨＴＩＢｉｏ−ｐｒｏ＝ｄｕｃｔ，Ｉｎｃ．（ｈｔｔｐ：／／ｗｗｗ．ｈｔｉｂｉｏ．ｃｏｍ）、ＢＭＡＢｉｏｍｅｄｉｃａｌｓ，Ｌｔｄ．（Ｕ．Ｋ．Ｂｉｏ−Ｓｙｎｔｈｅｓｉｓ，Ｉｎｃ．などのような種々の供給元のいずれかから特注し得る。
【０１５９】
最適化、並行、およびハイスループットに容易に敏感に反応しやすい小さなフラグメントからの全遺伝子合成の関連する実証は、ＤｉｌｌｏｎおよびＲｏｓｅｎ（１９９０）Ｂｉｏｔｅｃｈｎｉｑｕｅｓ，（９）３：２９８−３００に記載される。リガーゼを使用することなく部分的に重複する一本鎖オリゴヌクレオチドのセットからの、単純かつ迅速なＰＣＲベースの遺伝子アセンブリプロセスが記載される。いくつかのグループはまた、漸増するサイズの種々の遺伝子の合成に対して、同じＰＣＲベースの遺伝子アセンブリアプローチのバリエーションが首尾よく適用され、従って、この方法の変異した遺伝子のライブラリー合成についての一般的適用性およびコンビナトリアルな性質を実証したことを記載した（有用な参考文献に関しては、Ｓａｎｄｈｕら（１９９２）Ｂｉｏｔｅｃｈｎｉｑｕｅｓ，１２（１）：１５−１６、ＰｒｏｄｏｍｏｕおよびＰｅａｒｌ（１９９２）ＰｒｏｔｅｉｎＥｎｇｉｎ．，５（８）：８２７−８２９、Ｃｈｅｎら（１９９４）ＪＡＣＳ、１９９４（１１）：８７９９−８８００、Ｈａｙａｓｈｉら（１９９４）Ｂｉｏｔｅｃｈｎｉｑｕｅｓ，１７：３１０−３１４などもまた参照のこと）。
【０１６０】
より最近では、Ｓｔｅｍｍｅｒら（１９９５）Ｇｅｎｅ１６４５：４９−５３は、ＰＣＲベースのアセンブリ方法が、数十または数百さえもの合成４０ｂｐオリゴヌクレオチドから、少なくとも２．７ｋｂまでのより大きな遺伝子を構築するために有用であるという証拠を提供した。これらの著者らはまた、「循環」アセンブリＰＣＲが使用される場合、公知のＰＣＲベースの遺伝子合成プロトコル（オリゴヌクレオチド合成、遺伝子アセンブリ、遺伝子増幅、および代表的には、クローニング）を包含する４つの工程から、遺伝子増幅工程が省略され得ることを実証した。
【０１６１】
一旦調製されると、当業者に周知の慣用的方法に従って遺伝子をベクターに挿入し得、そしてこのベクターを使用して、宿主細胞をトランスフェクトし得、そしてコードされたタンパク質を発現し得る。これらの目的を達成するためのクローニング方法論、および核酸の配列を確認するための配列決定方法は、当該分野で周知である。適切なクローニングおよび配列決定技術、ならびに多くのクローニングの実施を通して当業者を指導するに十分な指示は、ＢｅｒｇｅｒおよびＫｉｍｍｅｌ、ＧｕｉｄｅｔｏＭｏｌｅｃｕｌａｒＣｌｏｎｉｎｇＴｅｃｈｎｉｑｕｅｓ，ＭｅｔｈｏｄｓｉｎＥｎｚｙｍｏｌｏｇｙ、第１５２巻、ＡｃａｄｅｍｉｃＰｒｅｓｓ，Ｉｎｃ．、ＳａｎＤｉｅｇｏ（Ｂｅｒｇｅｒ）；Ｓａｍｂｒｏｏｋら（１９８９）ＭｏｌｅｃｕｌａｒＣｌｏｎｉｎｇ＿ＡＬａｂｏｒａｔｏｒｙＭａｎｕａｌ（第２版）第１〜３巻、ＣｏｌｄＳｐｒｉｎｇＨａｒｂｏｒＬａｂｏｒａｔｏｒｙ，ＣｏｌｄＳｐｒｉｎｇＨａｒｂｏｒＰｒｅｓｓ，ＮＹ；およびＣｕｒｒｅｎｔＰｒｏｔｏｃｏｌｓｉｎＭｏｌｅｃｕｌａｒＢｉｏｌｏｇｙ、Ｆ．Ｍ．Ａｕｓｕｂｅｌら編、ＣｕｒｒｅｎｔＰｒｏｔｏｃｏｌｓ、ＧｒｅｅｎｅＰｕｂｌｉｓｈｉｎｇＡｓｓｏｃｉａｔｅｓ，Ｉｎｃ．とＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ，Ｉｎｃ．との合弁事業（１９９４、増補）に見出される。生物学的試薬および実験装置の製造業者からの製品情報はまた、公知の生物学的方法において有用な情報を提供する。このような製造業者らとしては、ＳＩＧＭＡＣｈｅｍｉｃａｌｃｏｍｐａｎｙ（ＳａｉｎｔＬｏｕｉｓ，ＭＯ）、Ｒ＆Ｄｓｙｓｔｅｍｓ（Ｍｉｎｎｅａｐｏｌｉｓ，ＭＮ）ＰｈａｒｍａｃｉａＬＫＢＢｉｏｔｅｃｈｎｏｌｏｇｙ（Ｐｉｓｃａｔａｗａｙ，ＮＪ）、ＣＬＯＮＴＥＣＨＬａｂｏｒａｔｏｒｉｅｓ，Ｉｎｃ．（ＰａｌｏＡｌｔｏ，ＣＡ）、ＣｈｅｍＧｅｎｅｓＣｏｒｐ．，ＡｌｄｒｉｃｈＣｈｅｍｉｃａｌＣｏｍｐａｎｙ（Ｍｉｌｗａｕｋｅｅ，ＷＩ）、ＧｌｅｎＲｅｓｅａｒｃｈ，Ｉｎｃ．、ＧＩＢＣＯＢＲＬＬｉｆｅＴｅｃｈｎｏｌｏｇｉｅｓ，Ｉｎｃ．（Ｇａｉｔｈｅｒｓｂｅｒｇ，ＭＤ）、ＦｌｕｋａＣｈｅｍｉｃａＢｉｏＣｈｅｍｉｃａＡｎａｌｙｔｉｋａ（ＦｌｕｋａＣｈｅｍｉｅＡＧ，Ｂｕｃｈｓ，Ｓｗｉｔｚｅｒｌａｎｄ）、Ｉｎｖｉｔｒｏｇｅｎ，ＳａｎＤｉｅｇｏ，ＣＡおよびＡｐｐｌｉｅｄＢｉｏｓｙｓｔｅｍｓ（ＦｏｓｔｅｒＣｉｔｙ，ＣＡ）、ならびに当業者に公知の多くの他の商業的供給元が挙げられる。
【０１６２】
物理的分子は、一旦発現されると、１つ以上の特性についてスクリーニングされ得、そしてこの分子は、それらが選択基準を満たすか否かを決定され得る。次いで、物理的選択基準を満たす分子をコードするキャラクターストリングは、上記のとおりに選択される。物理的特性（例えば、結合特異性および／またはアビディティー、酵素活性、分子量、電荷、熱安定性、至適温度、至適ｐＨなど）についての多くのアッセイは、当業者に周知である。
【０１６３】
特定の実施態様において、物理的分子は、１回以上の「シャッフリング」手順に供され得、そして必要に応じて、特定の物理的特性についてスクリーニングされて、新たな分子を生成する。次いで、この新たな分子は、上記の方法に従ってコードされ、そして処理される。
【０１６４】
種々の「シャッフリング方法」が公知である。これらの方法としては、本発明者らおよび共同研究者ら（例えば、Ｓｔｅｍｍｅｒ（１９９４）Ｎａｔｕｒｅ３７０：３８９−３９１、Ｓｔｅｍｍｅｒら（１９９４）、Ｐｒｏｃ．Ｎａｔｌ．Ａｃａｄ．Ｓｃｉ．ＵＳＡ９１：１０７４７−１０７５１、Ｓｔｅｍｍｅｒ、米国特許第５，６０３，７９３号、Ｓｔｅｍｍｅｒら、米国特許第５，８３０，７２１号、Ｓｔｅｍｍｅｒら、米国特許第５，８１１，２３８号、Ｍｉｎｓｈｕｌｌら、米国特許第５，８３７，４５８号、Ｃｒａｍｅｒｉら（１９９６）ＮａｔｕｒｅＭｅｄ．２（１）１００−１０３、ＰＣＴ公開ＷＯ９５／２２６２５、ＷＯ９７／２００７８、ＷＯ９６／３３２０７、ＷＯ９７／３３９５７、ＷＯ９８／２７２３０、ＷＯ９７／３５９６６、ＷＯ９８／３１８３７、ＷＯ９８／１３４８７、ＷＯ９８／１３４８５、およびＷＯ９８／４２８３２）に教示される方法が挙げられる。さらに、いくつかの同時係属中の出願は、重要なＤＮＡシャッフリング方法論を記載する（例えば、同時係属中の米国特許出願第０９／１１６，１１８号（１９９８年７月１５日出願）、同第６０／１０２，３６２号、およびＳｅｌｉｆｏｎｏｖおよびＳｔｅｍｍｅｒのＭｅｔｈｏｄｓｆｏｒｍａｋｉｎｇｃｈａｒａｃｔｅｒｓｔｒｉｎｇｓ，ｐｏｌｙｎｕｃｌｅｏｔｉｄｅｓ＆ｐｏｌｙｐｅｐｔｉｄｅｈａｖｉｎｇｄｅｓｉｒｅｄｃｈａｒａｃｔｅｒｉｓｔｉｃｓ（０２／０５／１９９９出願）、米国特許出願第６０／１１８，８５４号を参照のこと）。
【０１６５】
さらに、上記の方法はまた、並行様式で実施され得、ここで引き続く物理的スクリーニングのための個々のライブラリーのメンバーの各々（複数の遺伝子、タンパク質、ポリサッカライドなどを含む）は、空間的に分離された容器または容器のアレイにおいて合成されるか、またはプール様式で合成される。プール様式では、所望の複数の分子の全てまたは一部が、単一の容器において合成される。多くの他の合成アプローチは公知であり、そして他方に対する一方の特定の利点は、当業者に容易に決定され得る。
【０１６６】
本明細書中で議論されるプロセスは、ハイスループットシステムを使用する生成に対して敏感に反応する。ハイスループット（例えば、ロボット利用）システムは、市販されている（例えば、ＺｙｍａｒｋＣｏｒｐ．，Ｈｏｐｋｉｎｔｏｎ，ＭＡ；ＡｉｒＴｅｃｈｎｉｃａｌＩｎｄｕｓｔｒｉｅｓ，Ｍｅｎｔｏｒ，ＯＨ；ＢｅｃｋｍａｎＩｎｓｔｒｕｍｅｎｔｓ，Ｉｎｃ．Ｆｕｌｌｅｒｔｏｎ，ＣＡ；ＰｒｅｃｉｓｉｏｎＳｙｓｔｅｍｓ，Ｉｎｃ．，Ｎａｔｉｃｋ，ＭＡなどを参照のこと）。これらのシステムは、代表的には、全てのサンプルおよび試薬のピペッティング、液体分配、時間設定された（ｔｉｍｅｄ）インキュベーション、およびアッセイに適切な検出器でのマイクロプレートの最終的な読みとりを含む全体的な手順を自動化する。これらの設定可能なシステムは、ハイスループットおよび迅速な起動ならびに高い程度の融通性およびカスタマイズを提供する。このようなシステムの製造は、詳細なプロトコルに種々のハイスループットを提供する。従って、例えば、ＺｙｍａｒｋＣｏｒｐ．は、クローニング発現および化学的または組換え的に生成された産物のスクリーニングについてのハイスループットシステムの使用を記載する技術会報を提供する。
【０１６７】
（ＸＩ．生成されたストリング集団の使用）
（Ａ）．遺伝的／進化アルゴリズムの使用）
１つの実施態様において、本発明の方法は、キャラクターストリングの集団を提供する。特に好ましいキャラクターストリングは、コードされた生物学的分子を提示し、そして代表的には、このコードされた分子は、互いが生物学的組織化のレベルを反映するいくらかの関係を有する。結果的に、本発明の方法により生成されたこのキャラクターストリングは、均質な配列空間からの、ランダムなまたは無計画な選択を反映しないが、むしろ、組織化（例えば、遺伝子、遺伝子ファミリー、個体、亜集団など）の特定のレベルが自然界で見出されることを反映する関連性（または変化）の程度を捕捉する。従って、本発明の方法により生成されたキャラクターストリングの収集物（例えば、構成された（ｐｏｐｕｌａｔｅｄ）データ構造）は、種々の進化モデルについての有用な開始点を提供し、そして進化アルゴリズム（進化計算）における使用のために便利である。
【０１６８】
このようなモデルにおいて使用された場合、本発明の方法により生成されたこの集団（キャラクターストリングの収集物）は、任意の集団に対する進化的アルゴリズムの実行よりはるかに多くの情報を提供する。
【０１６９】
例えば、進化的アルゴリズムが開始点として利用される場合、ランダムまたは任意のメンバーのセット、シミュレーションの動力学は、任意の開始点から特定の溶液までの前進を反映する（例えば、得られる集団における特性の分配）。開始点は任意であり、そして本質的に天然のプロセスにより生成された集団と関連しないので、これらの動力学は、天然のプロセス／集団の動力学に関する情報を提供しない。
【０１７０】
対照的に、本発明の方法により生成されるキャラクターストリングの収集物は、従来の進化アルゴリズムにおいて使用される開始点をランダムに生成するより、はるかに多くの情報を含む。第１に、集団の各メンバーは、分子構造に関するかなりの情報を含む。従って、１つのメンバーが、単に「自己／非自己」としてではなく別のメンバーから区別されるが、むしろメンバーは、関連性／類似性の程度により区別される。本発明の方法により生成された集団のメンバーは、変化する共変動の程度を反映する。
【０１７１】
さらに、本発明の方法により生成される集団は、初期ストリングにコードされる生物学的組織化のレベルの微細な構造特徴を反映するので、シミュレーションの初期動力学は、これらのストリングセットを使用して実行されるシミュレーションの初期動力学は、「実世界」集団の動力学を反映し、そして進化プロセスへかなりの洞察を提供する。
【０１７２】
さらに特定の分子が、本発明の方法を使用して生成されるメンバーにより提示されるので、これらのデータ構造を使用して実行された進化アルゴリズムは、分子進化および／または新たなかつ有用な分子実体の設計に関する実際の情報を提供する。
【０１７３】
（Ｂ）指標生成における使用）
別の実施態様において、本発明の方法により生成されるデータ構造は、本質的に任意の種類の情報を指標化するためのタグ（指標）として使用され得る。このアプローチにおいて、より大きな類似性の情報は、より大きな類似性を有するデータ構造（キャラクターストリング）のメンバーを使用してタグ化される。その一方で、より低い類似性の情報は、より低い類似性を有するデータ構造のメンバーでタグ化される。好ましい実施態様において、データの２つの異なる断片をタグ化するために使用されるキャラクターストリングの類似性は、タグ化された情報の類似性を反映する（タグ化された情報の類似性と比例する）。
【０１７４】
検索が行われる場合、最初のヒットが伝統的な検索技術を用いて同定される。次いで、密接に関連した情報が所望であれば、このデータ構造は、上記の周知の類似性アルゴリズムのいずれかを使用して類似するメンバーについて検索され得る。これらの類似性アルゴリズムは、多くのデータ領域（ｄａｔａｓｐａｃｅ）の完全、迅速、かつ有効な検索を提供するように設計される。所望の類似性のメンバー（指標）が同定されると、それらは、タグ化されたデータに注意を向けさせ、それによりエンドユーザーに関連する情報を提供する。
【０１７５】
（Ｃ）データベース検索における参考対象物としての使用）
関連出願では、本発明の方法により生成されるこのデータ構造、またはこのようなデータ構造のメンバー（すなわち、キャラクターストリング）は、データベース検索において参照対象物として使用され得る。例えば、初期の公知の情報（例えば、分子構造、または上記の知識データベース（ｋｎｏｗｌｅｄｇｅｄａｔａｂａｓｅ）からの指標ストリング）は、本明細書中に記載の方法に従ってコードされ、そして改変される。これは、関連するが、明らかではない、初期のコードされた情報の改変を捕捉する新たなデータ構造を生成する。
【０１７６】
得られる情報（例えば、データ構造のメンバー）を解析して、実際の分子または理論的分子を同定し、そしてこれは同じかまたは関連する分子についての代表的なデータベースを検索するために使用され得る。コードされた情報がデータベース指標に由来する場合、このデータ構造のメンバーを使用して、本来のデータベースまたは新たなデータベースをプローブし、関連する／関連した情報を同定し得る。
【０１７７】
（Ｄ）特定の分子特性を付与する構造モチーフの同定）
例えば、機能的操作を容易にするために、特定の特性を担い得る分子（例えば、タンパク質）の領域を同定することは、しばしば、興味深い。これは、通常Ｘ線結晶学により得られる構造情報を使用して、伝統的に行われる。
【０１７８】
類似のまたはなお同一の反応を触媒する天然に存在する酵素の配列は、広範に変化し得；配列は、わずか５０％以下で同一であり得るが、このような酵素のファミリーは、１つの同一の反応を触媒し得、これらの酵素の他の特性は有意に異なり得る。これらの特性としては、例えば、温度および有機溶媒に対する安定性、至適ｐＨ、可溶性、固定化された場合に活性を保持する能力、異なる宿主系での発現の容易さの物理的特性が挙げられる。それらはまた、活性（Ｋ_catおよびＫ_m）、受容される基質の範囲、および行われる化学的な事象（ｅｖｅｎｏｆｃｈｅｍｉｓｔｒｉｅｓ）を含む触媒特性が挙げられる。本明細書中で記載される方法は、非触媒性タンパク質（例えば、サイトカインのようなリガンド）および核酸配列（例えば、多くの異なるリガンドにより誘導可能であり得るプロモーター）さえにも適用され得る。複数の機能的重要性（ｄｉｍｅｎｓｉｏｎｓ）が「相同な」配列のファミリーによりコードされる。
【０１７９】
類似する触媒機能を有する酵素間の分岐が理由で、特定の特性と個々のアミノ酸とを特定の位置で相関づけることは通常は可能でない。あまりにも多くのアミノ酸相違が存在する。しかし、バリアントのライブラリーは、本発明の方法に従う初期ストリングへファミリーのメンバーをコードし、次いで、そのコードされたバリアントを有するデータ構造を構成するようにサブストリングを選択し、そして結び付けることにより相同な天然の配列のファミリーから調製され得る。
【０１８０】
このコードされたか、または解析されたバリアントは、所望の特性についてインシリコで試験され得、そして／またはコードされたバリアントは結び付けられ得、そして対応する分子は、物理的に上記のように合成される。次いで、この合成された分子は、１つ以上の所望の特性についてスクリーニングされ得る。
【０１８１】
データ構造のメンバーは、特定の特性についての特定の条件セットの下で試験される場合、これらの条件についてのこのデータ構造（または初期ストリング収集物）からの配列の最適な組み合わせが決定され得る。このアッセイ条件をわずか１つのパラメーターにおいて変化させる場合、ライブラリー（データ構造）由来の異なる個体が最良のパフォーマーとして同定される。スクリーニング条件は非常に類似しているので、大部分のアミノ酸は、おそらく、最良のパフォーマーの２つのセット（初期ストリング収集物における最良のパフォーマー（セット１）および構成されたデータ構造における最良のパフォーマー（セット２））の間で保存される。従って、この２つの異なる条件化での最良の酵素の配列の比較により、性能における差異の原因である配列差異が同定される。
【０１８２】
素因成分分析（例えば、Ｐａｒｔｅｋｔｙｐｅｓｏｆｔｗａｒｅを用いて）は、このような分析に有用な多くの複数変量ツールのうちの１つである。
【０１８３】
（Ｅ）音楽の発生における使用）
さらに別の実施態様において、本発明の方法を使用して、音楽を発生させ得る。多くの周知のプログラムのいずれかを使用して、生物学的分子（例えば、ＤＮＡ、タンパク質など）は、音符にコードされ得る。これは、特定の音符上に特定のサブユニットをマッピングする工程を包含し得る。これらの音符のタイミングおよび／または音質は、そのサブユニットが存在するモチーフおよび／または二次構造によって決定される。
【０１８４】
従って、例えば、プログラムＳＳミディ（ＳＳ−ｍｉｄｉ）は、種々の核酸配列およびアミノ酸配列を音楽にコードするために使用されている。１つのアプローチ（ＤＮＡカリプソ）において、プリンは、ピリミジンの３／２の速度で再生され、塩基Ｃ、Ｔ、Ｇ、Ａは、音符Ｃ、Ｆ、Ｇ、Ａにマッピングされ、そして第一鎖は、ジャズオルガンを用いて再生されたが、その相補鎖は、バスを用いて再生された。他のアプローチにおいて、音符／サブユニットがヘリックス中に見出され、次いで、それがβ−シート中に見出される場合、音符の継続時間がより長くあり得る。他のバリアントも、もちろん可能である。
【０１８５】
本発明の方法において、生物学的分子はストリングにコードされ、そのサブストリングが選択および結び付けられ、そしてデータ構造が上記のように設置される。次いで、この設置されたデータ構造は、このデータ構造にコードされた新規の配列を音楽にマッピングするプログラム（例えば、ＳＳミディ）への入力として使用される。このデータ構造は、上記のように繰り返して再設置され得、これによって、このように生成された音楽句のバリアントを発生させる。
【０１８６】
（Ｆ）合成機械の駆動における使用）
上記に示されるように、本発明の方法によって生成されたデータ構造を使用して、そのコード分子（例えば、ポリペプチド、核酸、ポリサッカリドなど）の化学合成のためのデバイスを駆動し得る。ほんのわずかの開始配列（「シードメンバー」）のみを使用して、本発明の方法は、何十、何百、何千、何万、何十万、またはさらには何百万もの異なるコード分子を、文字式で提供する。得られたデータ構造、またはそのメンバーを使用して、化学（または、組換え）合成を駆動する場合、実質的に任意のサイズの所望の分子の「コンビナトリアル」ライブラリーが調製され得る。このような「コンビナトリアル」ライブラリーは、治療剤、生産加工分子、特定の酵素などについてスクリーニングするためのシステムを提供するために、広く所望される。
【０１８７】
（実施例）
以下の実施例は、本発明を限定するためでなく、例示するために提供される。
【０１８８】
（実施例１：サブチリシンファミリーモデル）
アミノ酸配列を整列した（コドン使用頻度は、好ましい発現系のためのレトロ翻訳（ｒｅｔｒｏｔｒａｎｓｌａｔｉｏｎ）に最適化され得、そして合成のためのオリゴヌクレオチドの数は最小化され得る）。７つの親の全ての可能な対のドットプロット対様式アライメントを作製した（図５、図６、図７）。対６および対７は、７アミノ酸以上の各ウインドウあたり９５％の同一性パーセントを示し、一方、他の全ての対は、７アミノ酸以上の各ウインドウあたり８０％の同一性パーセントを示した。低い相同性の交差が高度に相同な親の支出で提示され得るように、アライメントのストリンジェンシー（および引き続く親間の交差の提示）が各対について個々に操作され得ることに留意する。構造的偏りまたは活性部位の偏りは、このモデルにおいて全く組込まれなかった。
【０１８９】
（実施例２：キメラポリヌクレオチドの合成のための交差オリゴヌクレオチドの設計のためのプロセス）
第１に、キメラ接合を形成するための交差オペレーターを適用するために、サブストリングを、親（開始）ストリングにおいて同定および選択した。これは、以下によって実行される：ａ）全ての親のキャラクターストリング間の対様式相同領域の全てまたは一部を同定する工程、ｂ）各々の選択された対様式相同領域内の少なくとも１つの交差点を指標化するために、同定された対様式相同性領域の全てまたは一部を選択する工程、ｃ）各々の選択された対様式非相同性領域内の少なくとも１つの交差点を指標化するために、１つ以上の対様式非相同性領域を選択する工程（「ｃ」は、省略可能な任意の工程であり、そして構造−活性に基づく選抜が適用され得る工程でもある）であって、それによって、交差点のさらなる選択に適切な親のキャラクターストリングの、位置的かつ親指標化領域／エリア（サブストリング）のセットの記述を提供する工程。
【０１９０】
第２に、パート１で選択されたサブストリングのセットの各サブストリング内の交差点のさらなる選択を実行する。この工程は、以下を含む：ａ）各々の選択されたサブストリングにおいて少なくとも１つの交差点を無作為に選択する工程、ならびに／またはｂ）各々の選択されたサブストリング内の交差点選択の確率を決定するための、１以上のアニーリングシュミレーションに基づくモデルを使用して、各々の選択されたサブストリングにおいて少なくとも１つの交差点を選択する工程、および／またはｃ）各々の選択されたサブストリングのおよそ中間における１つの交差点を選択する工程であって、これによって、対様式交差点のセットを作製する工程であり、ここで、各点は、この点でキメラ接合を形成することが所望される各々の親ストリングにおける対応する文字位置に指標化される。
【０１９１】
第３に、任意のコドン使用頻度調整を実行する。相同性（ＤＮＡまたはアミノ酸をコードするストリング）を決定するために使用される方法に依存して、このプロセスは変更され得る。例えば、ＤＮＡ配列を使用する場合：ａ）選択された発現系のためのコドンの調整を、全ての親ストリングについて実行し、そしてｂ）親間のコドンの調整を、全ての対応する位置での全ての所定のアミノ酸についてのコドン使用頻度を標準化するために実施し得る。このプロセスは、遺伝子ライブラリー合成のための異なるオリゴヌクレオチドの総数を有意に減少し得、そしてアミノ酸相同性がＤＮＡ相同性より高い場合か、または高度に相同な遺伝子のファミリー（例えば、８０％＋の同一性）を伴う場合に、特に有利であり得る。
【０１９２】
このオプションは、注意して実行されるべきである。なぜなら、これは本質的に、選抜変異オペレーターの発現であるからである。従って、所望しない結果を有し得る、この偏りの導入に対するオリゴヌクレオチドのコストを削減する利点を考慮する。より代表的には、大部分の親における所定の位置でアミノ酸をコードするコドンを使用する。
【０１９３】
アミノ酸配列を使用する場合：ａ）ＤＮＡを縮重するために配列をレトロ翻訳する；ｂ）元のＤＮＡ（大部分の親の、または対応する親の）におけるコドン使用頻度に対する位置ごとの参照を使用して、縮重するヌクレオチドを定義するか、および／または選択された発現系に適切なコドン調整を実行する。ここで、物理的アッセイを実行する。
【０１９４】
この工程をまた使用して、もしあるならば、引き続く同定／ＱＡ／脱回旋（ｄｅｃｏｎｖｏｌｕｔｉｏｎ）／ライブラリーエントリーの操作のために、遺伝子のコード部分内に任意の制限部位を導入し得る。パート２で同定された全ての交差点（親の対に指標化された）を、調整されたＤＮＡ配列に対応して指標化する。
【０１９５】
第４に、オリゴヌクレオチド配置を、遺伝子アセンブリスキームのために選択する。この工程は、いくつかの決定工程を包含する：
均一の４０〜６０マーのオリゴヌクレオチドを代表的に使用する（より長いオリゴヌクレオチドを使用することは、親の構築のためのオリゴヌクレオチドの数の減少を生じるが、近接して位置される交差／変異の提示を提供するために、さらなる専用のオリゴヌクレオチドを使用する）。
【０１９６】
より短いオリゴヌクレオチドまたはより長いオリゴヌクレオチドのいずれが許容されるか（すなわち、はい／いいえ？の決定）を選択する。「はい」の決定は、ギャップ（欠失／挿入）（特に、１〜２アミノ酸）を有する異なる長さの高い相同性の遺伝子のオリゴヌクレオチドの総数を削減する。
【０１９７】
重複の長さ（代表的には、１５〜２０塩基（これは、対称または非対称であり得る））を選択する。
【０１９８】
縮重オリゴヌクレオチドが許容されるか否か（はい／いいえ？）を選択する。別の強力なコスト削減特徴およびさらなる配列相違性を得るための強力な手段でもある。部分的縮重スキームおよび最小縮重スキームは、変異誘発ライブラリーを確立する際に特に有利である。
【０１９９】
ソフトウェアツールがこれらの操作に使用される場合、パラメーターのいくつかの変更を実行し、最大のライブラリー複雑性および最小のコストを選択する。種々の長さのオリゴヌクレオチドを使用する複雑なアセンブリスキームを行うことは、プロセスの指標化、および引き続く、位置的にコードされる並行または部分的プール形式でのライブラリーのアセンブリを、有意に複雑にする。これが、精巧なソフトウェアを用いないでなされる場合、単純かつ均一なスキーム（例えば、全てのオリゴヌクレオチドが、２０塩基の重複を有する４０塩基長である）を使用し得る。
【０２００】
第５に、「便宜的配列（ｃｏｎｖｅｎｉｅｎｃｅｓｅｑｕｅｎｃｅ）」を、親ストリングの前後に設計する。理想的には、これは、最終的に全てのライブラリーエントリーにおいて確立される同じセットである。これらは、任意の制限部位、アセンブルされた産物同定のためのプライマー配列、ＲＢＳ、リーダーペプチド、および他の特別または所望の特徴を含む。原理的に、この便宜的配列を後の段階で定義し得、そしてこの段階では、適切な長さの「ダミー」セットを使用し得る（例えば、容易に認識可能な禁制文字からのサブストリング）。
【０２０１】
パート６において、全ての親を確立するためのオリゴヌクレオチドストリングの指標化マトリクスを、選択されたスキームに従って作製する。全てのオリゴヌクレオチドの指標は、以下を含む：親識別子（親ＩＤ）、コード鎖または相補鎖の表示、および位置番号。交差点を、頭部および尾部の便宜的サブストリングを有する全ての親ストリングの指標化コードストリングについて決定する。全ての鎖の相補鎖を作製する。全てのコードストリングを、パート４の選択されたアセンブリＰＣＲスキームに従って選択する（例えば、４０ｂｐの増分において）。全ての相補ストリングを、同じスキームに従って分割する（例えば、４０ｂｐを２０ｂｐシフトで）。
【０２０２】
パート７において、オリゴヌクレオチドの指標化マトリクスを、全ての対様式交差操作について作製する。第１に、対様式交差マーカーを有する、全てのオリゴヌクレオチドを決定する。第２に、親交差マーカーの同じ位置および同じ対を有する、全てのオリゴヌクレオチドの全てのセット（交差点あたり４つ）を決定する。第３に、同じ交差マーカーで標識されている、４つのオリゴヌクレオチドストリングの全てのセットを取り、そして２つのコード鎖および２つの相補鎖をコードする文字を有する４つのキメラオリゴヌクレオチドストリングの別の誘導セット（例えば、４０＝２０＋２０スキームにおいて２０ｂｐシフトを有する）を作製する。１つの親の順方向末端配列ストリング、それに続いて交差点後の第２の親の逆方向末端を有する、２つのコードストリングが可能である。相補ストリングもまた、同じ様式で設計し、これによって、ＰＣＲによる遺伝子ライブラリーアセンブリに適切なオリゴヌクレオチドをコードするストリングの、指標化完全インベントリーを得る。
【０２０３】
このインベントリーをさらに、必要に応じて、全ての重複オリゴヌクレオチドを検出し、これらを計数し、そして各オリゴヌクレオチドストリングの指標における「存在比＝量」フィールドに対する計数値の導入を付随させて、インベントリーから消去することによって洗練し得る。これは、ライブラリー合成のためのオリゴヌクレオチドの総数を減少するために（特に、親配列が高度に相同である場合において）、非常に有利な工程であり得る。
【０２０４】
本明細書の以上において記載される方法および材料に対して、請求される本発明の精神または範囲から逸脱することなく改変が行われ得、そして本発明は、以下を含む多くの異なる用途に適用され得る：
反復プロセスに含まれる、シャッフリングされた核酸を生成するため、および／またはシャッフリングされた核酸を試験するための統合システムの使用。
【０２０５】
本明細書の以上において記載された選択ストラテジー、材料、構成要素、方法または基材のいずれか１つの使用を利用する、アッセイ、キットまたはシステム。キットは、必要に応じて、方法またはアッセイを実施するための説明書、包装材料、アッセイ、デバイスまたはシステムの構成要素を含む１以上の容器などを、さらに含む。
【０２０６】
さらなる局面において、本発明は、本明細書中の方法および装置を具体化するキットを提供する。本発明のキットは、必要に応じて、以下の１以上を含む：（１）本明細書中に記載のシャッフリングされた成分；（２）本明細書中に記載される方法を実施するため、および／または本明細書中の選択手順を操作するための説明書；（３）１以上のアッセイ成分；（４）核酸または酵素、他の核酸、トランスジェニック植物、動物、細胞などを保管するための容器；（５）包装材料；ならびに（６）本明細書中に記載されるプロセスおよび／または決定工程のいずれかを実行するためのソフトウェア。
【０２０７】
さらなる局面において、本発明は、本明細書中の任意の構成要素またはキットの使用、本明細書中の任意の方法またはアッセイの実施、および／または本明細書中の任意のアッセイまたは方法を実施するための任意の装置またはキットの使用を提供する。
【０２０８】
本明細書中に記載される実施例および実施態様が、例示目的のみのものであること、およびこれらを考慮して種々の改変または変更が、当業者によって示唆され、そして本出願の精神および権利ならびに添付の特許請求の範囲内に含まれるべきであることが、理解される。本明細書中に引用される全ての刊行物、特許、および特許出願は、全ての目的のためにその全体が参考として、本明細書中に援用される。
【図面の簡単な説明】
【図１】図１は、本発明の方法の１つの実施態様を示すフローチャートを図示する。
【図２】図２は、本発明の方法（単数または複数）に従うサブ配列の選択および結び付けを図示する。
【図３】図３は、本発明の方法（単数または複数）に従うサブ配列の選択および結び付けを図示し、ここでこの結び付けは、サブストリングの規則性を固定するためのアラインメントアルゴリズムを利用する。
【図４】図４は、本発明に従う代表的なデジタルデバイス７００を図示する。
【図５】図５は、異なるサブチリシン（初期キャラクターストリングの典型的なセット）についての類似性パーセントを示すチャートおよび関係系統樹である。
【図６】図６は、異なるサブチリシンについての相同性領域を示す、対をなすドットプロットアラインメントである。
【図７】図７は、７つの異なる親サブチリシンについての相同性領域を示す、対をなすドットプロットアラインメントである。[0001]
(Cross-reference of related applications)
This application is a continuation-in-part of US patent application Ser. No. 09 / 416,837, filed Oct. 21, 1999.
[0002]
This application is also a “METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDS,” filed by Selinovov et al., PCT application (filed Jan. 18, 1999) (filed by Jonathan Alan Quite Law Office, attorney document number: 02-289-3PC0). Claims priority over "AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS". PCT application (filed Jan. 18, 1999) is a U.S. patent application 09 / 416,375 filed Oct. 12, 1999, "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES HAIRDICTED HAVIDEST HAVIDES HAVIDES HAVIDES HAVIDES HAVIDES HAVIDES HAVIDES HAVIDES HAVIDES HAVIDS This is a continuous continuation application US patent application Ser. No. 09 / 416,375 filed Oct. 12, 1999 by Selifonov et al. Is a US patent application No. 60 / 116,447 filed Jan. 19, 1999 “METHODS FOR MAKING CHARACTER” by Selinovov and Stemmer. STRINGS, POLYNUCELOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS, and also US Patent Application No. 60 / 118,854, METHODS FORMING LISTING G DESIRED CHARACTERISTICS "is a regular application.
[0003]
This application is also directed to "OLIGON CULCULATION MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al. PCT application (filed 18 January 1999) (filed by Jonathan Alan Quite Law Office, attorney document number 02-296-3PC). Claim priority. The PCT application (filed January 18, 1999) is a continuation-in-part of U.S. patent application Ser. No. 09 / 408,392, “OLIGON CLUOTIDE MEDIATED NUCLEIC ACID RECOMBINATION” filed Sep. 28, 1999 by Crameri et al. US patent application Ser. No. 09 / 408,392 is a regular application of US Patent Application No. 60 / 118,813 “OLIGON UCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION” filed on February 5, 1999 by Crameri et al. And also Crameri et al. US Patent Application No. 60 / 141,049, filed June 24, 1999, "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION".
[0004]
This application is also related to US patent application Ser. No. 09 / 408,393 filed Sep. 28, 1999, “USE OF CODON VARIED OLIGONCLUTEOTITED SYNTHESIS SHUFFFING” by Welch et al.
[0005]
This application claims priority and claims the benefit of each of these applications, as provided in 35 USC 119 and / or 120 USC law as appropriate. All of these applications are hereby incorporated by reference in their entirety for all purposes.
[0006]
(Copyright notice)
Part of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent document or patent disclosure by anyone, because the patent document or patent appears in the US Patent and Trademark Office patent file or record, but in other respects, All rights are reserved for everything.
[0007]
(Declaration concerning rights to inventions made under research and development supported by the federal government)
Not applicable.
[0008]
(Field of Invention)
The present invention relates to the field of computer modeling and simulation. In particular, the present invention provides a novel way to populate a data structure for use in evolutionary modeling.
[0009]
(Background of the Invention)
There is an extensive history of the use of computers to simulate and / or explore the life evolution of individual genetic systems and / or population genetic / phenotypic systems. The driving force behind most artificial life (A life) simulations is the algorithms that artificial life forms evolve and / or adapt to their environment. This basic algorithm can be divided into two main categories: learning algorithms (eg, algorithms represented by neural networks) and evolutionary algorithms represented by, for example, genetic algorithms.
[0010]
Many artificial life researchers, especially those interested in higher-order processes such as learning and adaptation, provide their organisms with a neural network that acts as an artificial brain (eg Touretsky (1088- 1991), Neural Information Processing Systems, Vols. 1-4, Morgan Kaufmann, 1988-1991). A neutral network is a learning algorithm. They can be trained to classify images into categories, for example. A typical problem is recognizing which character corresponds to a predetermined handwritten character.
[0011]
A neural network consists of input-output devices called neurons, which are organized in a (highly connected) network. Typically, the network is organized into multiple layers: an input layer that receives sensory input, any number of so-called hidden layers that perform the actual calculations, and an output layer that reports the results of these calculations. The Neural network training involves adjusting the strength of connections between neurons in the network.
[0012]
Another major type of basic algorithm that can be biologically influenced is the “evolution” algorithm. While process learning (eg, neural networks) is metaphorically based on process learning in individual organisms, evolutionary algorithms are influenced by evolutionary changes within a population of individuals. For neural networks, evolutionary algorithms have only been widely accepted in recent years by academic and industrial organizations.
[0013]
Evolutionary algorithms are generally iterative. An iteration is typically referred to as a “generation”. The basic evolutionary algorithm traditionally starts with a randomly selected population of individuals. In each generation, individuals “compete” within themselves to solve the imposed problems. Individuals that perform relatively well appear to "survive" to the next generation. Individuals surviving the next generation can be subjected to small random modifications. If the algorithm is set up correctly and the problem is in fact one object for answers in this manner, as the iteration proceeds, the population will contain answers that improve quality.
[0014]
The most famous evolutionary algorithm is J.I. Holland's genetic algorithm (reprinted by JH Holland (1992) Adaptation in Natural and Artificial Systems. University of Michigan Press 1975, MIT Press). Genetic algorithms are widely used in the actual context (eg, financial forecasting, management science, etc.). It applies particularly well to multivariate problems where the solution space is discontinuous ("rugged") and is poorly understood. To apply the genetic algorithm, one skilled in the art defines: 1) a mapping from a set of parameter values to a set of (0-1) bit strings (eg, character strings), and 2) a bit string. To real number (so-called fitness function).
[0015]
In most evolutionary algorithms, a randomly selected set of bit strings constitutes the initial population. In a basic genetic algorithm, the cycle is repeated for the following; the fitness of each individual in the population is evaluated; a copy of the individual is made in proportion to the fitness; and the cycle is repeated . A typical starting point for such an evolutionary algorithm is a randomly selected set of bit strings. The use of an “arbitrary”, random, or accidental starting population can strongly bias the evolutionary algorithm away from an efficient, accurate, or concise solution of the immediate problem (in particular, If the algorithm is used to model or analyze biological history or biological processes). In fact, the only “force” that makes an evolutionary algorithm whatever the solution is is fitness determination and the accompanying pressure. The solution is ultimately reached, but the process begins with a random (eg, arbitrary) initial state in which the members of the group are not related to each other, so the transition of the group as the algorithm progresses Reveals little or no information that reflects the evolution of the simulated system.
[0016]
Furthermore, evolutionary algorithms are typically relatively sophisticated simulations and provide population level information. Specific genetic information (if any) is typically present as an abstract representation of the allele (typically as a single character) or as the frequency of the allele. As a result, evolutionary algorithms provide little or no information about molecular level events.
[0017]
Similarly, a neural network and / or cellular automaton selects an essentially artificial construct as its starting point and utilizes internal laws (algorithms) to approximate biological processes. As a result, such models generally mimic processes or metaprocesses, which also provide little or no information or insight regarding molecular level events.
[0018]
(Summary of the Invention)
The present invention provides a novel method for generating “initial” populations suitable for further computer manipulation (eg, via genetic / evolutionary algorithms). The members of the population produced by the methods of the present invention possess varying degrees of “association” or “similarity” with each other, reflecting the degree of covariance found in the naturally occurring population. Further, unlike the population used as input in a typical evolutionary algorithm, the population generated by the methods provided herein typically includes detailed information about individual members, and that information Are typically complex enough to provide a “continuous” (rather than binary) measure of variability and / or association between members. Indeed, the method of the present invention provides detailed coding of molecular information in individuals including populations created according to the method of the present invention.
[0019]
Accordingly, in one embodiment, the present invention provides a method of forming a collection of data structures having character strings (eg, generating a collection or library of character strings). The method preferably includes i) encoding two or more biological molecules in a character string and providing a population of two or more different initial character strings, wherein: Each of the biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from the character string; iii) concatenating the substrings of the initial character string Forming one or more solution strings of approximately the same length as one or more; iv) adding a solution string to a collection of strings (data structure); and v) optionally one or more Using that solution string as the initial string in the population of initial character strings, Or (ii) from repeating (iv) step. In certain preferred embodiments, “encoding” includes encoding one or more nucleic acid sequences and / or one or more amino acid sequences in a character string. The nucleic acid sequence and / or amino acid sequence may be unknown and / or selected by chance, but preferably encodes a known protein (s). In one preferred embodiment, the biological molecules are at least about 30%, preferably at least about 50%, more preferably at least about 75%, and most preferably at least about 85%, 90%, or 95 of each other. Even% is chosen to have sequence identity.
[0020]
In one embodiment, the substring (s) is selected so that the end of the substring has a different correspondence of the initial character string than the overall sequence identity between the same two strings. Of about 3 to about 300, preferably about 6 to about 20, more preferably about 10 to about 100, and most preferably about 20 to about 50 characters having a higher sequence identity to the region Occurs in the character string area. In another embodiment, the selection step comprises from about 4 to about 100, preferably from about 4 to about 50, even more preferably from about 4 to about 10, and even more preferably from about 6 to about the end of the substring. Selecting substrings to occur in a predefined motif of about 30 and most preferably about 6 to about 20 characters can be included.
[0021]
In one embodiment, the selection and concatenation links substrings from two different initial strings so that the overall sequence identity between the two different initial strings is more than between the two different initial strings. It can include linking in a region of about 3 to about 20 characters with high sequence identity. Selection also aligns the two or more initial character strings, maximizes pair identity between the two or more substrings of the character string, and pairs of aligned pairs for the end of one substring. The step of selecting a member character may be included.
[0022]
In certain embodiments, the “adding” step involves calculating the theoretical PI, PK, molecular weight, hydrophobicity, secondary structure and / or other properties of the protein encoded by the character string. In one preferred embodiment, the solution string has a sequence identity of greater than 30%, preferably greater than 50%, more preferably greater than 75%, or 85% relative to the initial string Only the solution string is added to the group (data structure).
[0023]
The method may further include the step of randomly changing one or more characters of the character string. This includes a number of methods including, but not limited to, introducing random strings into the initial string population and / or utilizing a stochastic operator as described herein. Can be achieved according to. In certain preferred embodiments, the above operations are performed in a computer.
[0024]
In another embodiment, the present invention provides i) encoding two or more biological molecules in a character string to provide a population of two or more different initial character strings, wherein the biological molecule Each of which contains at least about 10 subunits; ii) select at least two substrings from the character string; iii) concatenate the substrings and have approximately the same length as one or more of the initial character strings Iv) add the solution string to the collection of strings (ie, form a collection of data structures); and v) optionally, add one or more solution strings. , Using as an initial string in the population of initial character strings, from steps (i) or (ii) to (i ) Repeated, a computer program solution comprising computer code. That is, a computer program solution that includes computer code that performs the operations described herein. Program code may be provided in a compiled manner, as source code, as object code, as executable, etc. The program may be provided on any convenient medium (eg, magnetic medium, optical medium, electronic medium, magneto-optical medium, etc.). The code may also reside on a computer (eg, in memory (dynamic or static memory), on a hard drive, etc.).
[0025]
In another embodiment, the present invention provides a system for generating labels (tags) and / or music derived from sequences of biological molecules. The system includes an encoder that encodes two or more initial strings from a biological molecule (eg, nucleic acid and / or protein); an isolator for identifying and selecting substrings from two or more strings; A concatenator for associating strings; a data structure for storing substrings associated as a collection of strings; measuring the number and / or variability of a collection of strings and sufficient to exist in the collection of strings A comparator for determining a string; as well as a command writer for writing a collection of strings into a raw string file. In a preferred embodiment, the isolator includes a comparator for aligning and determining a region of identity between two or more initial strings. Similarly, comparators can include means for calculating sequence identity, and isolators and comparators can share this means if desired. In a preferred embodiment, the isolator has a sequence identity with the end of the substring having a higher sequence identity to another corresponding region of the initial character string than the overall sequence identity between the same two strings. Substrings are selected to occur in the string region of 3 to about 100 characters.
[0026]
In another embodiment, the isolator has from about 4 to about 100, preferably from about 4 to about 50, even more preferably from about 4 to about 10, even more preferably from about 6 to about about the end of the substring. The substring is selected to occur in a predefined motif of 30 and most preferably from about 6 to about 20 characters. In one embodiment, the isolator and concatenator individually or in combination concatenate substrings from two different initial strings so that the concatenation is as a whole between the two different initial strings. A region of about 3 to about 300, more preferably about 5 to about 200, most preferably about 10 to about 100 characters having higher sequence identity between the two different initial strings than sequence identity. Occurs inside. In one preferred implementation, the isolator aligns two or more initial character strings, maximizes pair identity between two or more substrings of the character string, and is aligned with respect to the end of one substring. Select a character that is a member of a pair.
[0027]
The comparator may impose various selection criteria on any wide range. Thus, in various embodiments, the comparator can calculate the theoretical PI, PK, molecular weight, hydrophobicity, secondary structure and / or other properties of the encoded protein. In one preferred embodiment, the comparator adds the string to the data structure only if the string has more than 30% identity with the initial string.
[0028]
The system may include an operator that randomly changes one or more characters in the character string as needed. In certain embodiments, such an operator may randomly select and change one or more occurrences of a particular preselected character in the character string. Preferred data structures in this system accumulate encoded (or deconvoluted) nucleic acid sequences and / or encoded or deconvolved amino acid sequences.
[0029]
A further understanding of the invention can be obtained from a detailed discussion of the specific embodiments below. For purposes of clarity, this discussion refers to apparatus, methods, and concepts related to particular embodiments. However, the method of the present invention can operate in various types of logic devices. Accordingly, it is intended that the invention not be limited except as provided in the appended claims (as interpreted under equivalence).
[0030]
Further, it is understood that a logic system can include a wide variety of different components and different functions in a modular format. Different embodiments of the system may include different mixtures of elements and functions, and group various functions as part of various elements. For purposes of clarity, the present invention will be described with respect to a system that includes many different innovative components and innovative combinations of components. No inference should be made that the invention is limited to combinations including all of the innovative components listed in any illustrative embodiment herein.
[0031]
(Definition)
The terms “character string”, “word”, “binary string” or “coded string” refer to sequence information (eg, the nucleotide sequence of a nucleic acid, the amino acid sequence of a protein, the sugar sequence of a polysaccharide, etc.). Represents any entity that can accumulate (subunit structure). In one embodiment, the character string can be in the form of a simple array of characters (letters, numbers, or other symbols), or in the form of tangible or intangible (eg, electronic, magnetic, etc.) Can be a numerical representation of such information in The character string need not be “straight”, but may also exist in many other forms (eg, a linked list).
[0032]
“Character” when used in reference to a character in a character string refers to a subunit of the string. In a preferred embodiment, the character of the character string encodes one subunit of the encoded biological molecule. Thus, for example, in a preferred embodiment, if the encoded biological molecule is a protein, the string character encodes a single amino acid.
[0033]
“Motif” refers to the pattern of subunits that contain biological molecules. This motif may refer to a subunit pattern of an uncoded biological molecule or a subunit pattern of an encoded representation of a biological molecule.
[0034]
The term substring refers to a string found in another string. A substring may include a full length “parent” string, but typically a substring represents a substring of a full length string.
[0035]
The term “data structure” refers to a structure for storing information and optionally accompanying devices, and typically refers to a number of “portions” of information. The data structure can be a simple record (eg, a list) of information, or the data structure can contain additional information (eg, annotations) about the information contained therein, and various “members” of the data structure. A relationship between (a “part of information”) can be established and a pointer can be provided, or can be tied to a resource external to the data structure. Data structures can be intangible, but are tangible when stored / displayed in a tangible medium. The data structure is simple list, linked list, indexed list, data table, index, hash index, flat file database, relational database, local database, It may represent various information architectures including, but not limited to, distributed databases, thin client databases, and the like. In the preferred embodiment, the data structure provides enough fields for the storage of one or more character strings. The data structure is preferably configured to allow alignment of the character strings and optionally store information regarding alignment and / or string similarity and / or string differences. In one embodiment, this information is an alignment map showing the form of alignment “scores” (eg, similarity index) and / or individual subunit (eg, nucleotides in the case of nucleic acids) alignment. The term “coded character string” refers to a representation of a biological molecule that retains the desired sequence and / or structural information about the biological molecule.
[0036]
As used herein, similarity refers to a measure of similarity between encoded representations of molecules (eg, initial character strings) or between molecules displayed by encoded character strings. be able to.
[0037]
When referring to string operations (eg, insertions, deletions, transformations, etc.), before the operation codes for the coded representation of the biological molecule or the coded representation represents the operation. It is understood that this can be performed on
[0038]
As used with respect to biological molecules, the term “subunit” refers to the characteristic “monomer” of which biology is composed. Thus, for example, a nucleic acid subunit is a nucleotide, a polypeptide subunit is an amino acid, a polysaccharide subunit is a sugar, and the like.
[0039]
The terms “pool” or “population” are used interchangeably when used with respect to strings.
[0040]
“Biological molecule” refers to a molecule typically found in a biological organism. Preferred biological molecules include biological macromolecules that are polymeric and typically composed of multiple subunits in nature. Representative biological molecules include, but are not limited to, nucleic acids (in the form of nucleotide subunits), proteins (in the form of amino acid subunits), polysaccharides (in the form of sugar subunits), and the like.
[0041]
The phrase “encodes a biological molecule” preferably includes the information content of the original biological molecule, and therefore can be used to recreate that information content. Means the generation of
[0042]
The term “nucleic acid”, unless otherwise limited, refers to a single-stranded or double-stranded form of a deoxyribonucleotide polymer or ribonucleotide polymer, which is a natural nucleotide that can function in a manner similar to a naturally occurring nucleotide. Includes known analogs.
[0043]
“Nucleic acid sequence” refers to the regularity and identity of nucleotides comprising nucleic acids.
[0044]
The terms “polypeptide”, “peptide” and “protein” are used interchangeably herein and refer to a polymer of amino acid residues. This term applies to amino acid polymers in which one or more amino acids are artificial chemical analogs of the corresponding naturally occurring amino acids, and naturally occurring amino acid polymers.
[0045]
“Polypeptide sequence” refers to the regularity and identity of amino acids comprising a polypeptide.
[0046]
As used herein, the phrase “adding a solution string to a collection of strings” does not require mathematical addition. Rather, it refers to the process of identifying one or more strings as being included in a set of strings. This means
A means of copying or moving the string in question into a data structure that is a collection of strings; a means of setting or providing a pointer to a data structure that displays a collection of strings from a string; a specific set Various means including, but not limited to, means for setting a flag associated with a string indicating its inclusion in, or simply specifying a rule that the string so generated is included in the population. It can be achieved by means.
[0047]
(Detailed explanation)
(I. Generation of character string groups)
The present invention relates to an actual or theoretical entity suitable for use as an initial (or mature / processed) population in an evolutionary model, more preferably in an evolutionary model typified by a genetic algorithm. Provides a new computational method for generating a large population presentation. When initialized to reflect the characteristics of a particular biological organism, the entity generated by this method of the present invention provides significant information about the underlying molecular biology (eg, representative amino acids Sequence or nucleic acid sequence), thereby allowing models based on genetics or other algorithms to provide information on the process of evolution at an unprecedented level, ie the molecular level.
[0048]
In a particularly preferred embodiment, the method of the present invention generates a population of character strings, where each character string represents one or more biological molecules. Using several strings as “seed”, the present invention produces a large population of strings with an “evolutionary” relationship to the initial seed members. Generated by the method of the present invention, in contrast to traditional genetic algorithms, where the initial set of members is arbitrary, random / accidental, or selected for mathematical or expressive convenience The resulting population is derived, in a preferred embodiment, from known and existing biological “precursors” (eg, specific nucleic acid sequences and / or polypeptide sequences).
[0049]
In a preferred embodiment, the present invention includes the following steps:
1) identifying / selecting two or more biological molecules;
2) encoding a biological molecule into a character string;
3) selecting at least two substrings from the character string;
4) combining these substrings to form one or more solution strings of approximately the same length as the one or more initial character strings;
5) adding the solution string to a collection of strings, which can be a set of initial strings or a separate set;
6) If necessary, introducing further variations into the resulting string set;
7) If necessary, adding a soot pressure to the resulting string set;
8) Repeat steps (2) or (3) to (7) as needed using one or more solution strings as initial strings in the collection of initial character strings.
Each of these operations is described in more detail below.
[0050]
(II. Encoding one or more biological molecules into a character string)
The methods of the present invention typically utilize one or more “seed” members. This “seed” member is preferably a presentation of one or more biological molecules. Thus, the initial stage of a preferred embodiment of the present invention includes selecting two or more biological molecules and encoding the biological molecules into one or more character strings.
[0051]
(A step of identifying / selecting “seed / early” biological molecules)
Virtually any biological molecule can be used in the methods of the invention. However, preferred biological molecules are “polymeric” biopolymers that include multiple “subunits”. Biopolymers that are particularly well suited for the methods of the present invention include, but are not limited to, nucleic acids (eg, DNA, RNA, etc.), proteins, glycoproteins, carbohydrates, polysaccharides, specific fatty acids, and the like. Not.
[0052]
When a nucleic acid is selected, it is recognized that the nucleic acid can be single stranded or double stranded, but the single stranded may be sufficient to represent / encode a double stranded nucleic acid. obtain. This nucleic acid is preferably a known nucleic acid. Such nucleic acid sequences can be readily determined from a number of sources, including public databases (eg, GenBank), proprietary databases (eg, Incyte database), scientific publications. Products, commercial or private sequencing laboratories, and in-situ sequencing laboratories.
[0053]
Nucleic acid molecules can include genomic nucleic acids, cDNA, mRNA, artificial sequences, natural sequences with modified nucleotides, and the like.
[0054]
In one preferred embodiment, two or more biological molecules are “related” but not identical. Thus, the nucleic acids may represent the same gene or genes, but may differ in the strain, species, genus, family, eye, gate, or world from which they are derived. Similarly, in one embodiment, the differences between molecules resulting from the fact that proteins, polysaccharides, or other molecules are selected from different strains, species, genera, families, eyes, gates, or kingdoms Are the same protein, polysaccharide, or other molecule.
[0055]
A biological molecule can represent a single gene product (eg, mRNA, cDNA, protein, etc.) or they can represent a collection of gene products and / or non-coding amino acids. In certain preferred embodiments, the biological molecule represents a member of one or more specific metabolic pathways (eg, regulatory, signaling, or synthetic pathways). Thus, for example, a biological molecule comprises an entire operon or a member that contains a complete biosynthetic pathway (eg, lac operon, protein: B-DNA gal operon, colicin A operon, lux operon, polyketide synthesis pathway, etc.) May be included.
[0056]
In certain preferred embodiments, the biological molecule can include a number of different genes, proteins, and the like. Thus, in certain embodiments, the biological molecule is an individual or multiple individuals of the same or different species, such as whole nucleic acids (eg, genomic DNA, cDNA, or mRNA), whole proteins, or whole lipids, etc. Can be included.
[0057]
In certain embodiments, a biological molecule may reflect the entire “presentation” of a population of species molecules. A high level presentation of a population of molecules is achieved in the laboratory and can be performed in silico according to the method of the invention. Methods for presenting complex molecules or populations of molecules are described in Representational Difference Analysis (RDA) and related techniques (eg, Listisyn (1995) Trends Genet. 11 (8): 303-307, Risinger et al. (1994) Mol Carcinog. 11). (1): 13-18, and Michels et al. (1998) Nucleic Acids Res. 26:15 3608-3610, and references cited therein).
[0058]
Particularly preferred biological molecules for encoding and manipulation in the methods of the invention include proteins, and / or molecules of various classes of proteins (eg, therapeutic proteins such as erythropoietin (EPO), insulin , Peptide hormones such as human growth hormone; neutrophil activating peptide-78, GROα / MGSA, Groβ, GROγ, MIP-1α, MIP-16, MCP-1, epidermal growth factor, fibroblast growth factor, liver Cell growth factor, insulin-like growth factor, interferon, interleukin, keratinocyte growth factor, leukemia inhibitory factor, oncostatin M, PD-ECSF, PDGF, pleiotropin, SCF, c-kit ligand, angiogenic factor (eg Vascular endothelial growth factor VEGF- , VEGF-B, VEGF-C, VEGF-D, placental growth factor (PLGF), etc., growth factors (eg, G-CSF, GM-CSF), soluble receptors (eg, IL4R, IL-13R, IL-10R) Nucleic acids encoding growth factors and cytokines such as soluble T cell receptors).
[0059]
Other preferred encoding molecules include, but are not limited to, transcriptional activators and expression activators. Transcriptional and expression activators include genes and / or proteins that regulate cell growth, differentiation, regulation, etc. found in prokaryotes, viruses, and eukaryotes including fungi, plants, and animals . Expression activators include cytokines, inflammatory molecules, growth factors, growth factor receptors, and oncogene products, interleukins (eg, IL-1, IL-2, IL-8, etc.), interferons, FGF, IGF-1, IGF-II, FF, PDGF, TNF, TGF-α, TGF-β, EGK, KGF, SCR / c-kit, CD40L / CD40, VLA-4 / VCAM-1, ICAM-1 / LFA-1, and hyalurin (Hyalurin) / CD44, signaling molecules, and corresponding oncogene products (eg, Mos, RAS, Raf, and Met); and transcriptional activators and suppressors (eg, p53, Tat, Fos, Myc, Jun, Myb, Rel) and steroid hormones Scepters such as, but not limited to, estrogen, progesterone, testosterone, aldosterone, LDL receptor ligands, and receptors for corticosterone.
[0060]
Preferred molecules for encoding in the methods of the present invention are also proteins from infectious or otherwise pathogenic organisms (eg Aspergillus, Candida, E. coli, Staphylococci, Streptocci, Clostridia, Neisseria spp., Enterobacteriacea spp., Helicobacter spp., Vibrio spp., Capylobacter spp., Pseudomonas spp., Ureaplasma spp., Legionella spp., Spirochetes spp., Mycobacteria spp., Actnomyces spp., Nocardia spp., Chlamydia spp., Rickettsia spp., Coxiella spp, Ehrilichia genus, Rochalimaea Including RNA viruses, orthomyxovirus, dsDNA viruses, characteristic protein such as a retrovirus) - protozoa, viruses (+) RNA virus, (); Brucella, Yersinia, Fracisella, and Pasturella.
[0061]
Still other suitable molecules include inhibitors of transcription, crop plague toxins, industrially important enzymes (eg, proteases, nucleases, and lipases).
[0062]
Preferred molecules include members of the relevant “family” of nucleic acids or the proteins they encode. Association (eg, inclusion or exclusion from “family”) can be determined by protein function and / or by sequence identity with other members of the family. Sequence identity can be determined as described herein, and preferably family members have at least about 30% sequence identity, more preferably at least about 50% sequence identity, and most Preferably they share at least about 80% sequence identity. In certain instances, it is desirable to include molecules with low (eg, less than about 30% sequence identity) having significant association. Such methods are well known in the bioinformatics literature and typically involve the incorporation of molecular folding patterns with sequence / similarity information. One common implementation of such an approach involves a “threading algorithm”. Threading algorithms detect distant homology by comparing sequences against structural templates. If the structural similarity between the target and the template is sufficiently large, those associations can be detected in the absence of significant sequence similarity. Threading algorithms are well known to those skilled in the art, and are available, for example, from the NCBI Structure Group Threading Package (available from the National Center for Biological Information (eg, http: //www.ncbi.relm.rht. .Html))) and SeaFold (Molecular Simulations, Inc.).
[0063]
(B) Encoding biological molecules into character strings)
Biological molecules are encoded in character strings. In the simplest example, the character string is identical to the character code used to represent the biological molecule. Thus, for example, a character string can include the letters A, C, G, T, or U, where a nucleic acid is encoded. Similarly, standard amino acid nomenclature can be used to represent polypeptide sequences. Alternatively, to some extent, it will be appreciated that the encoding scheme is arbitrary. Thus, for example, in the case of nucleic acids, A, C, G, T, or U can be represented by the integers 1, 2, 3, 4, and 5, respectively, and the nucleic acid itself is a single ( It can be represented as a string of these integers, which are integers (typically large). Other encoding schemes are also possible. For example, a biological molecule can be encoded into a character string, where each “subunit” of the molecule is encoded into a multi-character representation. Alternatively, various compressed representations are also possible (eg, where repetitive motifs are represented only once, with appropriate pointers identifying each occurrence).
[0064]
Biological molecules also need not be encoded in a data structure that is a separate / single string. More complex data structures (eg, arrays, linked lists, indexed structures (including but not limited to databases or data tables)) can also be used to encode biological molecules.
[0065]
Essentially any data structure that can allow entry, storage, and retrieval of a representation of a biological molecule is suitable. Although these operations can be accomplished manually (eg, using paper and pencil, or card files, etc.), preferred data structures are data that can be manipulated optically and / or electronically and / or magnetically. A structure, and thus a data structure that allows automated input, storage, and output operations (eg, by a computer).
[0066]
(III. Selection of substring)
In a preferred embodiment, a character string encoding a biological molecule provides an initial population of strings from which substrings are selected. Typically, at least two substrings are selected, and one substring is derived from each initial character string. If there are more than two initial character strings, not all initial character strings need to provide substrings, as long as at least two initial character strings provide such substrings. However, in a preferred embodiment, at least one substring is selected from each initial string.
[0067]
(A) Substring length)
There is virtually no limit to the maximum number of substrings that can be selected from the initial string, other than the maximum number of theoretical strings that can be generated from any given string. Thus, for example, the maximum number of substrings selected from the initial string is the number of strings generated by the complete permutation of the initial string.
[0068]
However, using a relatively reasonably long initial string, the number of permutations is very large. Thus, in a preferred embodiment, the substring is selected from the initial string so that the substring does not overlap. Expressed differently, in a preferred embodiment, substrings from any one of the initial strings, when combined with the correct regularity, are substrings from which they are selected. Selected to reproduce the initial string.
[0069]
Preferred substrings are also chosen not to be too short. Typically, a substring is not shorter than the short string length required to represent one subunit of the encoded biological molecule. Thus, for example, if the encoded biological molecule is a nucleic acid, the substring is long enough to encode at least one nucleotide. Similarly, if the encoded biological molecule is a polypeptide, the substring is long enough to encode at least one amino acid.
[0070]
In preferred embodiments, the substring selected is at least 2, preferably at least 4, more preferably at least 10, even more preferably at least 20, and most preferably at least 50 of the encoded biological molecule, One hundred, 500, or 1000 subunits may be encoded.
[0071]
The substring length can be selected to capture a specific level of biological organization. For example, substrings encoding the entire gene, cDNA, mRNA can be selected. At the level of “higher” organization, substrings can be selected that encode a series of related genes, cDNAs, mRNAs, etc., as can be found in operons or regulatory or synthetic pathways. At a “higher” level of organization, substrings that encode the entire nucleic acid (eg, genomic DNA, total RNA, total cDNA) of an individual can be selected. As long as the initial string from which the substring is selected encodes a higher level of organization, there is virtually no limitation on the “level of organization” captured in the substring. Thus, if the substring (s) are selected to encode individual genes, the initial string can encode the entire metabolic pathway. If this substring is selected to encode an individual's entire nucleic acid, the initial string may encode an entire population of nucleic acids and the like.
[0072]
Conversely, substrings can also be selected to encode a particular level of subunits of biological organization. Thus, for example, substrings are used to select specific domains of proteins, specific regions of chromosomes (eg, regions that are characteristically amplified, deleted, or translocated), etc. Can be done.
[0073]
(B) Substring selection algorithm)
Any of a wide variety of approaches can be used to select substrings. This particular approach is determined by the problem being modeled. Preferred selection approaches include, but are not limited to, random substring selection, uniform substring selection, morph-based selection, alignment-based selection, and frequency-biased selection. The same substring selection method need not be applied to every initial character string, but rather different substring selection methods can be used for different initial strings. Furthermore, the multiple substring selection method can be applied to any initial character string.
[0074]
(1. Random substring selection)
In one simple approach, substrings can be selected randomly. Many approaches are available for “random” selection of substrings. For example, where the substring of minimum length “L” is selected from a coded character string of length “M”, the “cutting point” ranges from L to ML (to avoid short end strings). It can be selected using a random number generator that generates an integer (indicating a position along the string). “Internal” substrings of length less than L are discarded.
[0075]
In another approach, the address of each position along the character string is specified (eg, with an integer ranging from 1 to N, where N is the length of the character string). Select the minimum substring length “L” and the maximum substring length “M”. A random number generator is then used to generate the number “V” ranging from L to M. The algorithm then selects a substring from positions 1 to V, and position V + 1 becomes position 1 again. This process is then repeated until the initial string is spanned.
[0076]
Other methods of randomly selecting substrings are easily devised. For the purposes of the present invention, “random” selection does not require the selection process to meet formal statistical requirements for randomness. Pseudorandom or accidental selection is sufficient in this context.
[0077]
(2. Uniform substring selection)
In uniform substring selection, the number of desired substrings to be obtained from each initial string is determined. The initial string is then uniformly divided into the desired number of substrings. If the initial string length does not allow uniform splitting, one or more shorter or longer substrings may be allowed.
[0078]
(3. Selection based on motif)
Substrings can be selected from the initial string using motif-based selection. In this approach, the initial character string is scanned for the occurrence of a specific preselected motif. The substring is then selected such that the end point of the substring occurs in a predefined association with the motif. Thus, for example, the end may be a motif or a subunit number preselected from the end of the motif within “upstream” or “downstream”.
[0079]
The motif can be completely arbitrary or can reflect the properties of a physical factor or biological molecule. Thus, for example, if the encoded biological molecule is a nucleic acid, the motif is the binding specificity of a restriction endonuclease (eg, EcoRv, HindIII, BamHI, PvuII, etc.), protein binding site, specific intron / exon junction Can be selected to reflect transposons, etc. Similarly, if the encoded biological molecule is a protein, the motif may reflect a protease binding site, protein binding site, receptor binding site, specific ligand, complementarity determining region, epitope, and the like.
[0080]
Similarly, polysaccharides can contain specific sugar motifs, glycoproteins can have specific sugar motifs and / or specific amino acid motifs, and the like.
[0081]
The motif need not reflect in detail the primary structure of the encoded biological molecule. Secondary and tertiary structure motifs are also possible and can be used to delineate substring endpoints. Thus, for example, the encoded protein can wrap a characteristic α-helix, β-sheet, α-helix motif. The occurrence of this motif can then be used to delineate the substring endpoint.
[0082]
Another “higher regularity” type of motif may be a “meta-motif”, as described, for example, by “fragmented digestion”. In this approach, the substring endpoint is not determined by the occurrence of a single motif, but is delineated by coordinated patterns and spacing of one or more motifs.
[0083]
Motifs that do not strictly reflect the sequence pattern, but rather reflect the amount of information in a particular domain of the character string can also be selected / utilized. Thus, for example, R _i As represented by (b, l), US Pat. No. 5,867,402 describes a computer system and calculation method for processing sequence signals by transformation into an information weight matrix. R _i The value R by (b, l) _i A specific sequence signal is applied to the information weight matrix to generate, followed by a second transformation that encompasses the individual information content of the specific sequence signal. Other approaches to determining character string information are also known. (Staden, (1984) Nucleic Acids Res. 12: 505-519; Schneidr (1994) Nanotechnology 5: 1-8; Herman et al. (1992) J. Bacteriol. 3558-3560; Schneider et al. (1990) Nucleic Acids. 18 (20): 6097-6100; see also Berg, et al. (1988) J. Mol. Biol., 200 (4): 709-723).
[0084]
Other contemplated motifs reflect biological signals. Thus, for example, in the case of the encoded nucleic acid, methionine, one motif delineating the end point of the substring can be a stop or start codon, or in the case of a protein, etc., a polyadenylation signal. obtain.
[0085]
The same motif need not be applied to every initial sequence. In addition, multiple motifs, meta-motifs and / or motif / meta-motif combinations can be applied to any sequence.
[0086]
(4. Selection based on alignment)
In another approach, the substring is selected by aligning two or more initial character strings and by selecting a region of high identity between the initial strings to select the end points of the substring. . Thus, for example, after sequence alignment, a substring is at least about 5 subunits in length, preferably at least about 10 subunits in length, more preferably at least about 20 subunits in length, even more preferably at least about at least about Over a window spanning a length of 30 subunits, and most preferably at least about 50, 100, 200, 500 subunits, or even 1000 subunits, the end point of the substring is at least 30% Preferably at least 50%, more preferably at least 70%, even more preferably at least 80%, and most preferably at least 85%, 90%, 95%, or even at least 99% sequence identity. (E.g. in the middle of the area that has ) End point of the substring may be selected to produce.
[0087]
The terms “sequence identity” or “percent sequence identity” or “percent identity” or percent “homology” in the context of two or more biological macromolecules (eg, nucleic acids or polypeptides) are the same sequence. Subunits that are the same when compared and aligned for maximal matches, as measured using sequence comparison algorithms or by visual inspection for certain sequences (e.g., amino acid residues) Or two or more sequences or subsequences with a specified percentage of nucleotides).
[0088]
With respect to sequence comparison, typically one sequence acts as a reference sequence, to which test sequences are compared. In a preferred embodiment, when using a sequence comparison algorithm, test and reference sequences are entered into a computer, subsequence coordinates are designed if necessary, and sequence algorithm program parameters are designed. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence relative to the reference sequence, based on the designed program parameters.
[0089]
Alignment and sequence comparison algorithms are well known to those skilled in the art. For example, an optimal alignment of sequences for comparison can be an algorithm including, but not limited to: Smith and Waterman (1981) Adv. Apple. Math. 2: 482 local homology algorithm, Needl man and Wench (1970) J. MoI. Mol. Biol. 48: 443 homology alignment algorithm, Pearson and Lipan (1988) Pro. Natl. Acad. Sic. USA 85: 2444 by similarity search methods, commercial modules and / or commercial software packages (eg, the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, WI). By algorithm execution (eg, GAP, BESTFIT, FASTA, and TFASTA), or by visual inspection (usually see Amusable et al., Supra).
[0090]
One example of a useful algorithm is PILEUP. PILEUP uses progressively paired alignments to generate multiple sequence alignments from groups of related sequences, indicating relationship and percent sequence identity. It also plots a genealogy or endogmy showing the clustering relationships used and creates an alignment. PILEUP is described in Feng and Doolittle (1987) J. MoI. Mol. Evol. Use the 35: 351-360 progressive alignment method simplification. This method used is similar to the method described by Higgins and Sharp (1989) CABIOS 5: 151-153. The program can align up to 300 sequences of each maximum length of 5000 nucleotides or amino acids. The multiple alignment procedure begins with the alignment of the two most similar sequence pairs and produces a cluster of the two aligned sequences. This cluster is then aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences are aligned by a simple progression of an alignment that forms two individual sequence pairs. Final alignment is achieved by a series of progressively paired alignments. This program is run by designing specific sequences for their sequence comparison regions and their amino acid coordinates or nucleotide coordinates, and by designing program parameters. For example, percent sequence identity relationships using reference parameters compared to other test sequences and using the following parameters (default gap weight (3.00), default gap length weight (0.10), and weighted end gaps) Can be determined.
[0091]
An example of another algorithm suitable for determining percent sequence identity and sequence similarity is the BLAST algorithm, which is described in Altschul et al. (1990) J. MoI. Mol. Biol. 215: 403-410. Software for performing BLAST analyzes is publicly available through the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). This algorithm includes the initial identification of a high scoring sequence pair (HSP) by identifying a short word of length W in the query sequence, which is aligned with the same length word in the database sequence Either meet or meet a threshold score T that evaluates to some positives. Consider T as the neighborhood word score threshold (Altschul et al., Supra). These initial neighborhood word hits act as seeds for initiating searches and find longer HSPs containing them. This word hit is advanced in both directions along each sequence for as far as the cumulative alignment score can be increased. If the cumulative alignment score drops near its X amount from its maximum achieved value, if the cumulative score falls below zero due to the accumulation of one or more negative scoring residue alignments, or the end of either sequence The word hit progression stops in each direction. The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment. The BLAST program has 11 word lengths (W), 50 BLOSUM62 scoring matrices (see Henikoff and Henikoff (1989) Proc. Natl. Acad. Sci. USA 89: 10915) Alignment (B), 10 exceptions ( E), M = 5, N = -4, and comparison of both strands is used as default.
[0092]
In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of similarity between two sequences (eg, Karlin and Altschul (1993) Proc. Natl. Acad. Sci. USA 90). : 5873-5787). One measure of similarity provided by the BLAST algorithm is the smallest total probability (P (N)), which provides an indication of probability, which coincides with the coincidence between two nucleotide or amino acid sequences . For example, a nucleic acid is similar to a reference sequence if the smallest total probability in the comparison of the test nucleic acid to the reference nucleic acid is less than about 0.1, more preferably less than about 0.01, and most preferably less than about 0.001. Is considered to be.
[0093]
The above identified similarity algorithm is intended to be exemplary and not limiting. It is understood that the similarity can be determined over the full length initial character string or can be limited to a particular subdomain.
[0094]
(5. Frequency-biased selection)
In a frequency-biased partial sequence selection method, the partial sequences are selected such that the end points of the partial sequences occur in a specific relationship to the partial sequence domain that meets certain preselected frequency criteria. . For example, if it is desired to exclude encoded biological molecules that contain highly repeated subunit patterns (eg, high concentrations of AC repeats such as “ACACACACACAC” in the case of nucleic acids) Unit selection can be designed to generate an endpoint before the appearance of a specific repeat density of a specific subunit or subunit motif. At this moment, the repeat density is the number of occurrences of subunits or subunit motifs per character string length measured in number of subunits or length of subunit motif, respectively.
[0095]
Thus, in the example suggested above, the substring is a character that occurs at a frequency of more than 0.5 (50%) AC motifs, for example, over a length of at least 4 motifs (in this case, 8 subunits). A substring endpoint can be selected to occur adjacent to the string region.
[0096]
Another example of such a selection is a substring selection based on the occurrence of a particular subunit, with at least 100% occurrence over the X subunit. Thus, for example, if the encoded biological molecule is a nucleic acid and this subunit is adenosine “A”, a frequency-biased selection may result in the occurrence of a polyadenylation signal (eg, AAAAAAA) as a substring endpoint. It can be set with. Depending on the design of the frequency-biased substring selection criteria, equivalent results can be obtained using a motif-based selection scheme, as described above.
[0097]
(6. Other criteria)
Numerous other criteria can be used to influence and / or determine the selection of a particular substring. Such criteria include the expected hydrophobicity of the molecule encoded by the substring and / or PI and / or PK. Other criteria include reasonable information regarding the number of crossings, the desired fragment size, the distribution of substring lengths, and / or the folding of the molecules encoded by the substrings.
[0098]
(IV. Concatenation of substrings)
Once a group of substrings is selected from the initial string, the substrings are concatenated to generate a new string that is approximately or exactly the same length as the parent initial string. This string concatenation can be performed in a wide variety of ways.
[0099]
In one embodiment, the substrings are concatenated randomly to produce a “recombination” string. In one approach to such “random” concatenation, each substring is assigned a unique identifier (eg, an integer or other identifier). This distinguished name is then randomly selected from the pool (eg, using a random number generator) and the partial sequences corresponding to those distinguished names are combined to produce a concatenated sequence. If the combined subsequence is approximately or exactly the length of the starting character string, the process begins again to generate another string. This process is repeated until all substrings are used. Alternatively, this substring can be selected without removing them from the “substring pool” and the process is repeated until the desired number of “full-length” strings are obtained.
[0100]
However, in the preferred embodiment, it is desirable to maintain the relative regularity of the substrings that form the concatenated string as it exists in the initial string. This can be accomplished by any wide number of means. For example, each substring selected from the parent string is “tagged” with an identifier (eg, a pointer) that identifies the position of the substring in the initial string relative to the position of other strings derived from the parent string. obtain. Substrings derived from corresponding positions in other initial strings are assigned similar position identifiers. This approach is illustrated in FIG. 2 when each of the three initial strings (named A, B, and C) generates five substrings, numbered 1-5. As described, each substring can be uniquely identified (eg, A1, A2, ... A5, B1, B2, ... B5, C1, C2, ... C5). A concatenated string can then be generated by randomly selecting a substring of pool 5 from pool 1 (consisting of A1, B1 and C2), pool 2 (consisting of A2, B2 and C2), etc. This process can be repeated until the three strings are reconstructed.
[0101]
In this concatenation scheme, when a shared substring is concatenated, it is removed from the substring pool. However, by “copying” a subsequence from this pool, and thus using this in the linked sequence while still retaining the substring available for later linking Can be achieved. This creates greater diversity.
[0102]
In other embodiments, various alignment and / or similarity algorithms may generally be used to maintain the related sequences of substrings during ligation. In this approach, subsequences are assigned a relative position in the linked sequence by associating regions of high similarity (see, eg, FIG. 3).
[0103]
In preferred embodiments, the originally encoded biological molecules have some relationship to each other. Thus, for example, if an encoded molecule represents a member in a particular enzyme family, the molecule represents an individual from a particular population or the like. Subsequences are expected to share domains with significant similarity. In addition, important functional domains tend to be conserved and therefore also increase the similarity of specific domains of subsequences. Thus, aligning regions with a high degree of similarity between subsequences tends to reconstruct the relative regularity of subsequences that reflect their regularity in the initial string.
[0104]
It is not required that complete regularity be established in all associated character strings. Percentage of bound sequences (eg, preferably at least 1 percent, more preferably at least 10 percent, even more preferably at least 20%, and most preferably at least 40 percent, at least 60%, or at least 80 percent. ) Preferably retains its original regularity.
[0105]
The use of similarity criteria to reorder the subsequences can be used for sequencing by hybridization (SBH) methods, where similarity algorithms are used to reconstruct nucleic acid sequences from fragments of the complete sequence. Similar (eg, Barinaga (1991) Science, 253: 1489; Bains (1992) Bio / Technology 10: 757-758; Drmanac and Crkvenjakov, Yugoslavia patent application No. 570/87, 1987; Drmanac et al. (1989) Gen 4: 114; Strezkaska et al. (1991) Proc. Natl. Acad. Sci. USA 88: 10089; and Drmanac and Crkvenjakov, US Pat. See No. 31).
[0106]
It will be appreciated that a particular association alone or selection and association operation may be represented together by a particular operator. Such specific operators are known in genetic algorithms. Thus, for example, a “crossover” (reciprocal translocation) operator can be defined, in which subsequences at similar positions in two different initial sequences are exchanged. Similarly, a “link” operator is defined that links certain subsequences in a crossover event so that they crossover each other (whether or not they are adjacent subsequences). obtain. In view of the foregoing disclosure, other operators are known to those skilled in the art.
[0107]
(V. Add a solution string to a collection of strings)
The bound strings generated by the method of the present invention are added to a collection of strings that form a “populated data set”. The string in this collection can be used as an initial string in further iterations of the methods described herein (see FIG. 1). In this context, adding refers to the process of identifying one or more strings as contained within a set of strings. This involves copying or moving the string in question into a data structure that is a collection of strings, setting or providing a pointer from that string to a data structure that represents the collection of strings, and associated with that string Various flags including, but not limited to, setting a flag (indicating that the string should be included in a particular set) or simply designing a rule so that the string so generated is included in the collection It can be achieved by the following means.
[0108]
Once one or more associated character strings have been generated, selection criteria can be used to specify that the associated strings are included in the collection of strings (eg, as an initial string for the second iteration). And / or as an element of a populated data structure). A wide number of selection criteria can be utilized.
[0109]
In one embodiment, the similarity measure can be used as a selection criterion. Thus, the newly generated associated character strings are of a certain predetermined similarity to each other and / or the initial string (or its encoded molecule) and / or one or more “reference” strings. (E.g., greater than 10%, preferably greater than 20% or 30%, more preferably greater than 40% or 50%, and most preferably 60%, 70%, 80%, or even 90% Must be shared).
[0110]
Selection can also include the use of algorithms that evaluate “relevance” even when the sequence identity is very low. Such methods include “threading” algorithms and / or covariance measurements.
[0111]
Other selection criteria may require that the molecule represented by the associated string satisfy certain properties predicted by the computer. Thus, for example, the selection criteria can be the minimum or maximum molecular weight, the specific minimum or maximum free energy in a specific buffer system, the minimum or maximum contact surface with a specific target molecule or surface, the specific in a specific buffer system Net charge, expected PK, PI, binding avidity, specific secondary or tertiary form, etc. may be required.
[0112]
Still other selection criteria may require that the molecule represented by the associated string meet certain empirical physical assayed properties. Thus, for example, the selection criteria may be that the molecule represented by the associated string has a particular temperature stability, level of enzyme activity, produces a solution of a particular pH, a particular temperature and / or pH optimum. It may require having appropriate conditions, having a minimum or maximum solubility in a particular solvent system, binding to a target molecule with minimum or maximum affinity, and the like. The physical determination of a particular selection criterion is typically that the molecule represented by the associated string is synthesized (eg, chemically or recombinantly) or isolated. Request that.
[0113]
The application of such selection criteria in physical systems is known to those skilled in the art (eg, Stemmer et al. (1991) Tumor Targeting 4: 1-4; Ness et al. (1999) Nature Biotechnology 17: 893-896; Chang et al. (1999) Nature Biotechnology 17: 793-797; Minsull and Stemmer (1999) Current Opinion in Chemical Biology 3: 284-290; Christians et al. (1999) Nature Biotechnology 9: 25: 99N9; -291; Crameri et al. (1997) Nature Biotechno. Y 15: 436-438; Zhang et al. (1997) Proc. Natl. Acad. Sci., USA, 94: 4504-4509; Patten et al. (1997) Curr. Opin. Biotech. ) Nature Med. 2: 100-103; Crameri et al. (1996) Nature Biotechnology 14: 315-319; Gates et al. (1996) J. Mol. Biol. 255: 373-386; BioTechniques 18: 194-195; U.S. Pat. Nos. 5,605,793, 5,811,238, 5,830,721, 5,834,252, No. 5,837,458, WO95 / 22625, WO97 / 0078, WO97 / 35966, WO99 / 41402; WO99 / 41383, WO99 / 41369, WO9941368, EP09343670; WO9923107; WO9921979; WO9831837; )
[0114]
(VI. Introduction of further modifications)
In certain instances, it may be desirable to introduce further modifications to the population. This is especially true when repeated iterations of the evolutionary algorithm using the initial population generated by the method of the invention do not answer the modeled problem (eg, no member meets the selection criteria). Desired.
[0115]
Many methods can be used to introduce modifications into the string population generated by the methods of the present invention. Note that modifications can be introduced in the initial string (input to the method) or in the associated string (output). Preferably, such modifications are introduced before the selection step, but in certain cases, modifications can be introduced after selection (eg, before the second iteration).
[0116]
In one approach, a stochastic operator is introduced into an algorithm that randomly / accidentally changes one or more subunits containing the encoded molecule. Note that modifications can be introduced into an uncoded molecule (which is then encoded into the character string) and / or modifications can be introduced directly into the encoded character string. A probabilistic operator typically invokes two selection processes. One selection process involves determining which subunits are changed. On the other hand, other selection processes include selection / determination of what subunits are changed. Both selection processes can be probabilistic or in the selection process, or the others can be determinants. Thus, for example, the selection of a subunit to “mutate” can be random / coincident, but the mutation can always enter the same new / replacement subunit. Alternatively, the particular subunit to be mutated can be predetermined, but the selection of the mutated / obtained subunit can be random / coincident. In yet another embodiment, both the selection of subunits to be mutated and the result of the mutation can be random / accidental.
[0117]
In a preferred embodiment, the stochastic operator also takes as input or parameter “mutation frequency” that sets the average frequency of occurrence of “mutation”. Thus, for example, if the mutation frequency is set to 10%, the probabilistic operator will only allow the mutation to occur in one of the ten subunits included in the initial string. The mutation frequency can also set a range (eg, 5% to 10%, etc.).
[0118]
The “stochastic operator” need not be applied to all initial strings or to all substrings that contain initial strings. Thus, in certain embodiments, the action of a stochastic operator is constrained to a particular initial string and / or a particular substring (eg, domain) of one or more initial strings.
[0119]
If both selection criteria for a stochastic operator are fixed, the operator is no longer stochastic but rather introduces a “directed mutation”. Such an operator may be directed to change all subunits “A” encountered by the operator to subunit “B”. The directed mutation operator can still take mutation frequency as a parameter / attribute / input. As noted above, the mutation frequency limits the number of “encountered” subunits that the operator actually transforms.
[0120]
As described above, it is also understood that the probabilistic operator may change one or more encoded subunits. In certain embodiments, the operator modifies multiple encoded subunits, or even the entire substring / domain.
[0121]
Modifications can also be introduced through the use of insertion or deletion operators. Insertion or deletion operators are essentially variants of “stochastic mutation” operators. Instead of transforming one or more subunits, the deletion operator deletes one or more subunits, while the insertion operator inserts one or more subunits. Again, the deletion operator and the insertion operator have two selection processes; one process for selecting the site of insertion or deletion, and another process for selecting the size of the deletion or the identity of the insertion. One or both of the selection processes can be probabilistic. If both selection processes are predetermined (non-stochastic), the insertion or deletion operator is a directed insertion operator or a directed deletion operator. For a probabilistic operator, the insertion or deletion operator can take the mutation frequency as a parameter / attribute / input.
[0122]
In another embodiment, the modification can be augmented by adding one or more initial strings that are randomly or accidentally generated and have an essential relationship to the initial strings derived from the biological molecule. Absent. The modified introductory initial string can be generated as a strictly random or accidental string, or in certain embodiments, the modified string is generated according to certain predetermined criteria (e.g., occurrence of a particular subunit). Frequency, minimum and / or maximum degree of similarity to the encoded string). The modified introductory initial string need not be a full-length string, but may simply include one or more substrings. Note that strings or substrings of this nature can also be used to reduce modification. Thus, if a particular molecular domain is “preferred”, a string or substring encoding this domain can be added to the population of initial strings.
[0123]
(VII. Residing data structure)
In one embodiment, all linked strings generated by the method of the present invention are used to populate the data structure and / or in another iteration of the method described herein. Used as an initial string. In other embodiments, selection criteria are imposed as described above, and only the associated string that meets the selection criteria is used as the initial string and / or used to populate the data structure. The data structure can be populated with the associated representation of the encoded molecule used in the above operation, or the associated string can be partially deconvolved to produce a simpler encoded one Can be reproduced as or directed to the representation of the encoded biological molecule, and these deconvolution integrated strings can be used to populate the data structure.
[0124]
In one embodiment, the data structure is as simple as a sheet of paper on which the associated strings are written, or a group of cards in which each card has one or more associated strings listed. possible. In a preferred embodiment, the data structure is a medium (e.g., mechanical and / or fluid and / or optical and / or quantum and / or magnetic) that allows the data structure to be manipulated by a suitably designed computer. And / or electronic). In particularly preferred embodiments, the data structure is formed in computer memory (eg, dynamic, static, read-only, etc.) and / or in an optical, magnetic, or magneto-optical storage medium.
[0125]
That data structure may simply provide a list of associated strings, even in computer accessible form. Alternatively, the data structure can be constructed to store the relationships between the various “entries”. At a simple level, this may involve maintaining simple identity and / or regularity of entries. More sophisticated data structures are also available and for indexing and / or screening relationships between one or more entries in a data structure (eg, a concatenated string) and / or Ancillary structures for maintaining can be provided. The data structure may further include annotations on the entry (eg, origin, type, physical characteristics, etc.) or annotations on the link between the entry and the external data source. Preferred data structures include, but are not limited to, lists, linked lists, tables, hash tables and other indexes, flat file databases, relational databases, local or distributed computer systems. In a particularly preferred embodiment, the data structure is a data file stored on a conventional (eg, magnetic and / or optical) medium or a data file read into a computer memory.
[0126]
(VIII. Implementation in a programmed digital device)
The present invention, when loaded into an appropriately configured computing device, causes the device to populate the data structure (eg, generate a pool / collection of associated strings) according to the method of the present invention. And / or may be implemented on a fixed medium or a transmissible program component containing data.
[0127]
FIG. 4 illustrates a digital device 700 that can be understood as a logic device that can read instructions from media 717 and / or network port 719. The apparatus 700 can then use the instructions to direct the encoded representation of the molecule and the encoding of the biological molecule manipulation of the population of data structures. One type of logic device that may embody the invention is a computer system such as that illustrated in 700 that includes a CPU 707, optical input devices 709 and 711, a disk drive 715 and optionally a monitor 705. Fixed medium 717 can be used to program such a system and can represent a disk-type optical or magnetic medium or memory. Communication port 719 may also be used to program such a system and may represent any type of communication connection.
[0128]
The present invention may also be implemented in the circuit of a particular integrated circuit (ASIC) or programmable logic device (PLD) application. In such cases, the present invention may be implemented in a computer-understood descriptor language that can be used to generate an ASIC or PLD that is operated as described herein.
[0129]
The invention may also be implemented within the circuitry or logic process of other digital devices (eg, cameras, displays, image editing devices, etc.).
[0130]
(IX. Embodiments on the website)
The method of the present invention may be implemented in a localized computing environment or a distributed computing environment. In a distributed environment, the method can be implemented on a single computer or multiple computers that include multiple processors. The computer can be linked through a common bus, for example, but more preferably the computer is a node on the network. The network can be a generalized or dedicated, local or wide area network, and in certain preferred embodiments the computer can be an intranet or Internet component.
[0131]
In the preferred embodiment, the client system is typically connected to a server computer running a web browser and running a web server. This web browser is typically an IBM Web Explorer, or a program such as NetScape or Mosaic. The web server is typically a program such as, but not necessarily, IBM's HTTP Daemon or other WWW daemon. The client computer is bi-directionally connected to the server computer through a line or via a wireless system. The server computer is then bi-directionally connected to a website that provides access to software that implements the method of the present invention (the server hosts this website).
[0132]
A user of a client connected to an intranet or the Internet may cause the client to request resources that are part of a website hosting an application that provides an implementation of the method of the present invention. The server program then processes the request to reply with specific resources (assuming they are currently available). A standard naming convention known as Uniform Resource Locator (“URL”) is applied. This rule includes several forms of location names. This currently includes, for example, the following subclasses: Hypertext Transport Protocol ("http"), File Transport Protocol ("ftp", gopher, and Wide Area Information Service ("WAIS"). If so, this may include additional URL resources, so the client user can easily know the presence of a new resource that he or she did not specifically request.
[0133]
The software implementing the method of the present invention may run locally on a server hosting a website in a true client-server architecture. Thus, the client computer's post requests the requested process from the host server running locally, and then downloads the results back to the client. Alternatively, the method of the present invention may be performed in a “multi-tier” format, where the components of the method are performed locally by the client. This can be achieved by software downloaded from the server in response to a request by the client (eg, a Java application) or by software installed “permanently” on the client.
[0134]
In one embodiment, the application implementing the method of the present invention is divided into frames. In this paradigm, instead of looking at the application as a collection of features or functions, it is instead helpful to view the application as a collection of distributed frames or views. For example, a typical application typically includes a set of menu items, each of which calls a specific frame--ie, a form that represents a specific function of the application. In this regard, applications are viewed as applet collections or bundles of functionality, rather than as a monolithic body of code. In this manner, from within the browser, the user selects a web page link and then invokes a specific frame (ie, sub-application) of the application. Thus, for example, one or more frames may provide a function for entering and / or encoding a biological molecule in one or more character strings, while another frame is a code Provides a tool for generating and / or increasing the diversity of generated character strings.
[0135]
In addition to representing the application as a collection of frames, the application is also represented as a location on the intranet and / or the Internet (a Universal Resource Locator (URL) that points to the application. Each URL preferably has two Includes characteristics: content data about URLs (ie, any data is stored on the server) along with the data format or MIME (Multipurpose Internet Mail Extension) format, which is how the web browser receives the data from the server Can be determined (for example, interpretation of .gif files like bitmap images). This serves as a description of how to process the data once accepted by the browser, and if the binary data stream is accepted as HTML format, the browser renders it as an HTML page, instead. If you accept it in the form of a bitmap, the browser will render it as a bitmap image, and so on.
[0136]
In Microsoft Windows, there are different technologies that allow a host application to register a relationship with a data object (ie, a specific type of data). One technique is to register an application with a specific file extension (eg, .doc-"Microsoft Word document") for one with Windows; this is employed by the Windows application. This is the most commonly used technique. Another approach taken in Microsoft Object Linking and Embedded (OLE) is the class Globally Unique Identifier, i.e. 16 bytes to indicate a specific server application to call (to host a document with GUID). The use of identifiers. This class ID is registered in a specific device connected to a specific DLL (Dynamic Link Library) or an application server.
[0137]
In one embodiment for a particular purpose, the technique for associating a host application with a document is through the use of a MIME format. MIME provides a standardized technique for packaging document objects, which includes a MIME header that indicates which application is appropriate for hosting the document. All these documents are stored in a format suitable for transfer over the Internet.
[0138]
In one preferred embodiment, the method of the present invention is implemented, in part, using the MIME format inherent in the use of the method of the present invention. The MIME format contains the information necessary to create a document (eg, Microsoft ActiveX document) locally, but is further needed to find and download program code to represent the display of the document, if necessary. It also contains information. If this program code already exists locally, it only needs to be downloaded for the purpose of updating the local replica. This defines a new document format that contains information that supports downloadable program code to represent the display of the document.
[0139]
The MIME format is. May be associated with the APP file extension. . A file with an APP extension is an OLE document, which is realized by OLE DocObject. . Since an APP file is a single file, it can be placed on the server and linked using HTML HREF. this. The APP file preferably includes the following pieces of data: (1) The CLDSID of the ActiveX object, which is an OLE Document Viewer implemented as one or more forms suitable for use with the method of the present invention; 2) the code base of the URL where the code of the object can be found, and (3) the required version number (if needed). Once. Once the APP DocObject handler code is installed and the APP MIME format is registered, it is used to the user's web browser. An APP file can be downloaded.
[0140]
On the server side,. Since the APP file is actually a single file, the web server simply accepts the request and returns this file to the client. When an APP file is downloaded,. The APP DocObject handler sends this. Request to download the code base for objects specific to the APP file. The functionality of this system is available in Windows through the CoGetClassObjectFromURL function. After the code base of the ActiveX object is downloaded, this. The APP DocObject handler requests the browser to create its own display, for example, by calling the ActivateMe method on the Explorer document site. The Internet Explorer then calls DocObject to demonstrate the display as evidence (which is done by creating an ActiveX display object example from the downloaded code). Once created, the ActiveX display object is launched in place in the Internet Explorer. Internet Explorer creates the appropriate form and controls the children of the form.
[0141]
Once this form is created, it can establish a connection to any original remote server object that it needs to perform its function. At this point, the user may interact with this form, which appears to be embedded in the Internet Explorer frame. If the user changes to a different page, the browser assumes responsibility for eventually closing and destroying this form (as well as abandoning any outstanding connections to the remote server).
[0142]
In one preferred embodiment, the entry point to the system from the end user's desktop is the corporate home page or the home page of another specific website. This page may include multiple links in a conventional manner, if desired. In response to a user clicking on a particular link to an application page (eg, a page that provides the functionality of the method of the present invention), the web browser connects to an application page (file) that resides on the server.
[0143]
In one embodiment, when a user requests access to the method of the present invention, the user may use a specific page format (e.g., an application (performing one or more elements of the method of the present invention) in a web browser). Application (appdoc) page) for execution at a predetermined location. Each application page is located using a URL, so other pages may have hyperlinks to it. Multiple application pages can be grouped by creating a catalog page that includes hyperlinks to application pages. When the user selects a hyperlink that points to an application page, the web browser downloads the application code and executes the page within the browser.
[0144]
When a browser downloads an application page, this browser (based on the defined MIME format) calls a local handler, which is a handler for a certain type of document. That is, in particular, the application page preferably includes a Globally Unique Identifier (GUID) and a code base URL for identifying the remote (downloadable) application to call to host the document. Given the document object and GUID that arrives with the application page, the local handler looks at the client device to see if the hosting application is already locally resident (eg, by examining the Windows 95 / NT registry). . At this point, the local handler may choose to call the local copy (if any) or download the latest version of the host application.
[0145]
Download codes for different models are commercially available. When code is downloaded, a “code base” specification (file) is first requested from the server. The code base itself can range from a simple DLL file to a Cabinet file (Microsoft .cab file) containing multiple compressed files. Still further, an information (eg, Microsoft .inf) file may be employed to instruct the client system how to install the downloaded application. These mechanisms provide great flexibility in choosing which application components are downloaded and when.
[0146]
For the preferred embodiment, the exact mechanism employed to actually download the program code relies on a standard Microsoft ActiveX API (Application Programming Interface) -call. The ActiveX API does not provide native support for web-distributed applications, but the API locates the exact version of the program code, copies it to the local device, verifies its integrity, and passes it to the client Can be called to register with the operating system. Once this code is downloaded, the handler calls the application host that currently exists to represent the document object (in a manner similar to calling the hosting application through the registry if the registry is already installed). Can do that.
[0147]
As long as the hosting application (OLE server) is loaded at the client, the client system may employ an OLE document view architecture to correctly represent the application within the browser. This is done to add the application's menu to the browser's menu and (as opposed to requiring the application to run within the single ActiveX control rectangle--restricted stated). It involves using traditional OLE methodologies to correctly change the size of the application when changing the size. Once the application is executed on the client, it may execute remote logic using, for example, a Remote Procedure Call (RPC) methodology. In this manner, logic suitably implemented as a remote procedure can also be used.
[0148]
In certain preferred embodiments, the method of the present invention is implemented as one or more frames that provide the following functionality. The ability to encode two or more biological molecules into a character string to provide a collection of two or more different initial character strings, wherein each of the biological molecules has at least about 10 sub-strings The ability to select at least two substrings from the character string; the ability to combine the substrings to form one or more product strings of approximately the same length as the one or more initial character strings; and the strings Ability to add (place) product strings to any collection.
[0149]
The ability to encode two or more biological molecules preferably provides one or more windows. Here, the user can insert an indication of the biological molecule. In addition, the coding function also provides access to a personal database and / or public database accessible through the local network and / or the Internet, if desired, so that one included in the database. The above sequences can be input into the method of the present invention. Thus, for example, in one embodiment, when an end user enters a nucleic acid sequence into a coding function, the user requested a GenBank search, if necessary, and returned by such a search. It may have the ability to input more than one sequence into the coding function and / or the diversity generation function.
[0150]
Methods for realizing Internet and / or intranet implementations of computer processes and / or data access processes are well known to those skilled in the art and are recorded in great detail (eg, Cluer et al. (1992) A General Framework). for the Optimization of Object-Oriented Queries, Proc SIGMOD International Conference on Management of Data, San Diego, California, June 2-5, S19 Hen; ACM Press, pages 383-392; ISO-ANS , Working Draft, "Information Technology-Database Language SQL", Jim Melton ed., Intenational Organization for Standardization and American National Standards Institute, 7 May 1992; Microsoft Corporation, "ODBC 2.0 Programmer's Reference and SDK Guide.The Microsoft Open Database Standard for Microsoft Windows.TM. And Windows NT.TM., Microsoft Open Database Connective y.TM. Software Development Kit ", 1992, 1993, 1994 Microsoft Press, pp. 3-30 and 41-56; 199.chi.SQL, September 11, 1997, etc.).
[0151]
Those skilled in the art will recognize that many improvements can be made to this configuration without departing from the scope of the present invention. For example, in a two-stage configuration, a server system that performs the function of a WWW gateway may also perform the function of a web server. For example, any one of the above embodiments may be modified to accept requests from the user (s) end that are in a format other than URL. Yet another change involves adapting to multiple manager environments.
[0152]
(X. Physical evaluation and feedback loop integration)
As noted above, in certain preferred embodiments, the selection criteria may require that the molecules presented by the associated string meet certain empirical physical assayed properties. In order to assay these properties, it is necessary to obtain the encoded molecule. To accomplish this, the molecule presented by the linked string is physically synthesized (eg, chemically or by recombinant methods) or isolated.
[0153]
The physical synthesis of genes, proteins, and polysaccharides encoded by a collection of character strings generated in accordance with the present invention creates a physical presentation that is sensitive to physical assays for one or more desired properties. Is the main means to do.
[0154]
In a preferred embodiment, gene synthesis techniques typically involve libraries in a consistent manner and in strict adherence to sequence presentation provided in a collection of linked strings produced by the methods of the invention. Used to build.
[0155]
A preferred gene synthesis method is 10 ^Four -10 ⁹ Allows rapid construction of libraries of gene / protein changes. This is typically more difficult to create and maintain, and sometimes cannot be created and maintained, as is fully sampled by physical assays or selection methods. So it is suitable for screening / selection protocols. For example, existing physical assay methods in the art (including, for example, “life and death” selection methods) generally require about 10 by specific screening of a specific library. ⁹ Sampling, and many assays are about 10 ^Four -10 ^Five Limited to member sampling. Therefore, building several smaller libraries is the preferred method. This is because a large library cannot be easily sampled completely. However, larger libraries are still created and sampled using, for example, high throughput methods.
[0156]
There are many methods that can be used to synthesize genes, polysaccharides, proteins, etc. using well-defined sequences, and the field is developing rapidly. For discussion purposes only, this discussion focuses on one of the many possible and available types of known methods for the production of biological molecules.
[0157]
Current techniques in polynucleotide synthesis are best illustrated by well-known and mature phosphoramidite chemistry, which allows one skilled in the art to efficiently prepare oligonucleotides. Although it is possible to use this chemistry for conventional synthesis of oligonucleotides significantly longer than 100 bp, it is somewhat impractical. And the synthetic yield decreases and the degree of production required increases. “Representative” 40-80 bp size oligonucleotides can be obtained routinely and directly in very high purity.
[0158]
Oligonucleotides and still complete synthetic (double or single stranded) genes are available from a number of commercial sources (eg, The Midland Certified Reagent Company ([email protected]), The Great American Gene Company (http: / /Www.genco.com), ExpressGen, Inc. (www.expressgen.com), Operon Technologies Inc. (alameda, Calif.) Similarly, peptides can be ordered. PeptidoGenic ([email protected]), HTI Bio-pro = duct, Inc. (http://www.htbio.com), B A Biomedicals, Ltd. (U.K.Bio-Synthesis, Inc. Can custom from any of a variety of sources, such as.
[0159]
A related demonstration of total gene synthesis from small fragments that are easily sensitive to optimization, parallelism, and high throughput is described in Dillon and Rosen (1990) Biotechniques, (9) 3: 298-300. A simple and rapid PCR-based gene assembly process from a set of partially overlapping single-stranded oligonucleotides without using ligase is described. Some groups have also successfully applied variations of the same PCR-based gene assembly approach to the synthesis of increasing sizes of various genes, and thus the general approach to library synthesis of mutated genes in this method. (For helpful references, see Sandhu et al. (1992) Biotechniques, 12 (1): 15-16, Prodomou and Pearl (1992) Protein Engin., 5 ( 8): 827-829, Chen et al. (1994) JACS, 1994 (11): 8799-8800, Hayashi et al. (1994) Biotechniques, 17: 310-314, etc.).
[0160]
More recently, Stemmer et al. (1995) Gene 1645: 49-53 has shown that PCR-based assembly methods construct larger genes up to at least 2.7 kb from dozens or even hundreds of synthetic 40 bp oligonucleotides. Provided evidence that it is useful. These authors also include four known PCR-based gene synthesis protocols (oligonucleotide synthesis, gene assembly, gene amplification, and typically cloning) when “circular” assembly PCR is used. From the process, it was demonstrated that the gene amplification process can be omitted.
[0161]
Once prepared, the gene can be inserted into a vector according to conventional methods well known to those skilled in the art, and the vector can be used to transfect host cells and express the encoded protein. Cloning methodologies to achieve these objectives and sequencing methods to confirm nucleic acid sequences are well known in the art. Suitable cloning and sequencing techniques, as well as instructions sufficient to guide one of ordinary skill in the art through many cloning practices, can be found in Berger and Kimmel, Guide to Molecular Cloning Technologies, Methods in Enzymology, Vol. 152, Academic Press, Inc. , San Diego (Berger); Sambrook et al. (1989) Molecular Cloning_A Laboratory Manual (2nd edition) Volumes 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, Cold Spring Harbor, Cold Spring Harbor, Cold Spring Harbor, Cold Spring Harbor, and Cold Spring Harbor. M.M. Edited by Ausubel et al., Current Protocols, Greene Publishing Associates, Inc. And John Wiley & Sons, Inc. Found in the joint venture (1994, augmented). Product information from manufacturers of biological reagents and laboratory equipment also provides useful information in known biological methods. Such manufacturers include SIGMA Chemical company (Saint Louis, MO), R & D systems (Minneapolis, MN) Pharmacia LKB Biotechnology (Piscataway, NJ), CLONTECH Laboratories. (Palo Alto, CA), Chem Genes Corp. , Aldrich Chemical Company (Milwaukee, WI), Glen Research, Inc. , GIBCO BRL Life Technologies, Inc. (Gaithersberg, MD), Fluka ChemaBio Chema Analyika (Fluka Chemie AG, Buchs, Switzerland), Invitrogen, San Diego, CA and Applied Biosystems Is mentioned.
[0162]
Physical molecules, once expressed, can be screened for one or more properties, and the molecules can be determined whether they meet the selection criteria. A character string encoding a molecule that meets the physical selection criteria is then selected as described above. Many assays for physical properties (eg, binding specificity and / or avidity, enzyme activity, molecular weight, charge, thermal stability, optimal temperature, optimal pH, etc.) are well known to those skilled in the art.
[0163]
In certain embodiments, physical molecules can be subjected to one or more “shuffling” procedures and optionally screened for specific physical properties to generate new molecules. This new molecule is then encoded and processed according to the method described above.
[0164]
Various “shuffling methods” are known. These methods include the present inventors and co-workers (eg, Stemmer (1994) Nature 370: 389-391, Stemmer et al. (1994), Proc. Natl. Acad. Sci. USA 91: 10747-10551, Stemmer, US Pat. No. 5,603,793, Stemmer et al., US Pat. No. 5,830,721, Stemmer et al., US Pat. No. 5,811,238, Minshull et al., US Pat. No. 5,837,458 , Crameri et al. (1996) Nature Med. 2 (1) 100-103, PCT publications WO95 / 22625, WO97 / 20078, WO96 / 33207, WO97 / 33957, WO98 / 27230, WO97 / 35966, WO98 / 31837, W 98/13487, WO98 / 13485, and WO98 / 42832) the methods taught are mentioned. In addition, several co-pending applications describe important DNA shuffling methodologies (eg, co-pending US patent application Ser. No. 09 / 116,118 (filed Jul. 15, 1998), 60 / 102,362, and Selifonov and Stemmer's Methods for making character strings, polynucleotides & polypeptides having desirable charactoristics (US patent application No. 60 / No. 60 / No.
[0165]
Furthermore, the above method can also be performed in a parallel fashion, where each individual library member (including multiple genes, proteins, polysaccharides, etc.) for subsequent physical screening is spatially Synthesized in separate containers or arrays of containers or synthesized in a pooled fashion. In the pool mode, all or part of the desired plurality of molecules is synthesized in a single container. Many other synthetic approaches are known and one particular advantage over the other can be readily determined by one skilled in the art.
[0166]
The processes discussed herein are sensitive to production using high throughput systems. High-throughput (eg, robotic) systems are commercially available (eg, Zymark Corp., Hopkinton, Mass .; Air Technical Industries, Mentor, Ohio; Beckman Instruments, Inc. Fullerton, Calif .; Precision Science., Inc.). , MA, etc.). These systems typically include pipetting all samples and reagents, liquid dispensing, timed incubation, and final reading of the microplate with a detector appropriate for the assay. Automate the process. These configurable systems provide high throughput and quick start-up as well as a high degree of flexibility and customization. The manufacture of such a system provides various high throughputs for detailed protocols. Thus, for example, Zymark Corp. Provides a technical bulletin describing the use of high-throughput systems for cloning expression and screening of chemically or recombinantly produced products.
[0167]
(XI. Use of generated string population)
(A). Use of genetic / evolutionary algorithms)
In one embodiment, the method of the present invention provides a collection of character strings. Particularly preferred character strings present encoded biological molecules, and typically the encoded molecules have some relationship with each other that reflects the level of biological organization. As a result, this character string generated by the method of the present invention does not reflect a random or unplanned selection from a homogeneous sequence space, but rather is organized (eg, gene, gene family, individual, Capture the degree of relevance (or change) that reflects that a particular level of subpopulation (such as a subpopulation) is found in nature. Thus, the collection of character strings (eg, populated data structures) generated by the method of the present invention provides a useful starting point for various evolution models and evolution algorithms (evolution calculations) Convenient for use in.
[0168]
When used in such a model, this population (a collection of character strings) generated by the method of the present invention provides much more information than performing an evolutionary algorithm on an arbitrary population.
[0169]
For example, if an evolutionary algorithm is used as a starting point, a random or arbitrary set of members, the dynamics of the simulation will reflect progress from any starting point to a particular solution (eg, characteristics in the resulting population Distribution). These dynamics do not provide information about the dynamics of the natural process / population, since the starting point is arbitrary and is not inherently associated with the population generated by the natural process.
[0170]
In contrast, the collection of character strings generated by the method of the present invention contains much more information than randomly generating the starting point used in conventional evolutionary algorithms. First, each member of the population contains considerable information about the molecular structure. Thus, one member is distinguished from another member rather than simply “self / non-self”, but rather members are distinguished by the degree of relevance / similarity. The members of the population generated by the method of the present invention reflect the varying degree of covariation.
[0171]
Furthermore, since the population generated by the method of the present invention reflects the fine structural features of the level of biological organization encoded in the initial string, the initial dynamics of the simulation use these string sets. The initial dynamics of the simulation carried out in time reflect the dynamics of the “real world” population and provide considerable insight into the evolutionary process.
[0172]
Since more specific molecules are presented by members generated using the methods of the present invention, evolution algorithms implemented using these data structures are useful for molecular evolution and / or new and useful molecular entities. Provides actual information on the design of
[0173]
(B) Use in index generation)
In another embodiment, the data structure generated by the method of the present invention can be used as a tag (indicator) to index essentially any kind of information. In this approach, greater similarity information is tagged using members of a data structure (character string) with greater similarity. On the other hand, lower similarity information is tagged with members of the data structure with lower similarity. In a preferred embodiment, the similarity of the character strings used to tag two different pieces of data reflects the similarity of the tagged information (proportional to the similarity of the tagged information) ).
[0174]
When a search is performed, the first hit is identified using traditional search techniques. The data structure can then be searched for similar members using any of the well-known similarity algorithms described above if closely related information is desired. These similarity algorithms are designed to provide a complete, fast, and effective search of many data spaces. Once the desired similarity members (indicators) are identified, they draw attention to the tagged data, thereby providing information relevant to the end user.
[0175]
(C) Use as a reference object in database search)
In related applications, this data structure generated by the method of the present invention, or a member of such a data structure (ie, a character string) can be used as a reference object in a database search. For example, the initial known information (eg, molecular structure, or index string from the above knowledge database) is encoded and modified according to the methods described herein. This creates a new data structure that captures the relevant but not obvious modification of the initial coded information.
[0176]
Analyzing the resulting information (eg, members of a data structure) to identify actual or theoretical molecules, which can be used to search a representative database for the same or related molecules . If the coded information comes from a database index, members of this data structure can be used to probe the original database or a new database to identify relevant / related information.
[0177]
(D) Identification of structural motifs that confer specific molecular properties)
For example, to facilitate functional manipulation, it is often interesting to identify regions of a molecule (eg, a protein) that can assume a particular property. This is traditionally done using structural information usually obtained by X-ray crystallography.
[0178]
The sequence of naturally occurring enzymes that catalyze similar or still identical reactions can vary widely; the sequences can be identical by no more than 50%, but a family of such enzymes is one identical And other properties of these enzymes can be significantly different. These properties include, for example, physical properties of stability to temperature and organic solvents, optimal pH, solubility, ability to retain activity when immobilized, and ease of expression in different host systems. . They are also active (K _cat And K _m ), The range of substrates that are accepted, and catalytic properties including the chemical events that take place (even of chemistry). The methods described herein can be applied even to non-catalytic proteins (eg, ligands such as cytokines) and nucleic acid sequences (eg, promoters that can be inducible by many different ligands). Multiple functional dimensions are encoded by a family of “homologous” sequences.
[0179]
Because of the branching between enzymes with similar catalytic functions, it is not usually possible to correlate specific properties with individual amino acids at specific positions. There are too many amino acid differences. However, the library of variants encodes the members of the family into an initial string according to the method of the present invention, and then selects and links substrings to form a data structure with that encoded variant, and makes homology Can be prepared from a family of natural sequences.
[0180]
This encoded or analyzed variant can be tested in silico for the desired property and / or the encoded variant can be linked and the corresponding molecule is physically synthesized as described above. . The synthesized molecule can then be screened for one or more desired properties.
[0181]
If members of a data structure are tested under a particular set of conditions for a particular property, the optimal combination of sequences from this data structure (or initial string collection) for those conditions can be determined. If this assay condition is varied in only one parameter, different individuals from the library (data structure) are identified as the best performers. Since the screening conditions are very similar, most amino acids probably have two sets of best performers (the best performer in the initial string collection (set 1) and the best performer in the structured data structure ( Saved between sets 2)). Thus, a comparison of the best enzyme sequences under these two different conditions identifies the sequence differences responsible for the differences in performance.
[0182]
Predisposing component analysis (eg, using Partek type software) is one of many multivariate tools useful for such analysis.
[0183]
(E) Use in music generation)
In yet another embodiment, the method of the present invention may be used to generate music. Using any of a number of well-known programs, biological molecules (eg, DNA, proteins, etc.) can be encoded into musical notes. This can include mapping a particular subunit onto a particular note. The timing and / or sound quality of these notes is determined by the motif and / or secondary structure in which the subunit is present.
[0184]
Thus, for example, the program SS Midi (SS-midi) has been used to encode various nucleic acid and amino acid sequences into music. In one approach (DNA calypso), purines are regenerated at 3/2 the rate of pyrimidines, bases C, T, G, A are mapped to notes C, F, G, A, and the first strand is It was regenerated using a jazz organ, but its complementary strand was regenerated using a bus. In other approaches, if a note / subunit is found in a helix, then it is found in a β-sheet, the duration of the note may be longer. Other variants are of course possible.
[0185]
In the method of the present invention, the biological molecule is encoded into a string, its substrings are selected and linked, and the data structure is set up as described above. This installed data structure is then used as input to a program (eg, SS Midi) that maps the new sequence encoded in this data structure to music. This data structure may be reiterated repeatedly as described above, thereby generating a variant of the music phrase thus generated.
[0186]
(F) Use in driving a synthesis machine)
As indicated above, the data structures generated by the methods of the invention can be used to drive devices for chemical synthesis of the coding molecule (eg, polypeptide, nucleic acid, polysaccharide, etc.). Using only a few start sequences (“seed members”), the method of the present invention allows dozens, hundreds, thousands, tens of thousands, hundreds of thousands, or even millions of different coding molecules. , Provided as a character expression. When the resulting data structure, or member thereof, is used to drive chemical (or recombinant) synthesis, a “combinatorial” library of the desired molecule of virtually any size can be prepared. Such “combinatorial” libraries are widely desired to provide a system for screening for therapeutic agents, production engineered molecules, specific enzymes, and the like.
[0187]
(Example)
The following examples are provided to illustrate, but not to limit the invention.
[0188]
(Example 1: Subtilisin family model)
The amino acid sequences were aligned (codon usage can be optimized for retrotranslation for preferred expression systems and the number of oligonucleotides for synthesis can be minimized). A dot plot versus modal alignment of all seven possible pairs of parents was created (FIGS. 5, 6, 7). Pairs 6 and 7 showed a percent identity of 95% for each window of 7 amino acids or more, while all other pairs showed a percent identity of 80% for each window of 7 amino acids or more. Note that the stringency of alignment (and subsequent presentation of cross-parent crossing) can be manipulated individually for each pair, so that low homology crossings can be presented at the expense of highly homologous parents. No structural or active site bias was incorporated in this model.
[0189]
Example 2: Process for the design of crossed oligonucleotides for the synthesis of chimeric polynucleotides
First, substrings were identified and selected in the parent (starting) string to apply a crossing operator to form a chimeric junction. This is performed by: a) identifying all or part of the pairwise homology region between all parent character strings, b) at least one intersection in each selected pairwise homology region Selecting all or a portion of the identified pairwise homology regions to index at least one intersection point within each selected pairwise homology region Selecting one or more pairwise heterologous regions ("c" is an optional step and is also a step to which structure-activity based selection can be applied), comprising: Providing a description of a set of positional and parent indexed areas / areas (substrings) of parent character strings suitable for further selection of intersections.
[0190]
Second, perform further selection of intersections within each substring of the set of substrings selected in Part 1. This step includes: a) randomly selecting at least one intersection in each selected substring, and / or b) determining the probability of intersection selection in each selected substring. Selecting at least one intersection in each selected substring using a model based on one or more annealing simulations, and / or c) approximately in the middle of each selected substring Selecting one intersection, thereby creating a set of pairwise intersections, where each point is each parent string for which it is desired to form a chimera junction Is indexed to the corresponding character position in.
[0191]
Third, any codon usage frequency adjustment is performed. Depending on the method used to determine homology (DNA or amino acid-encoding string), this process can be modified. For example, when using DNA sequences: a) codon adjustment for the selected expression system is performed for all parent strings, and b) codon adjustment between parents at all corresponding positions. This can be done to standardize codon usage for all given amino acids. This process can significantly reduce the total number of different oligonucleotides for gene library synthesis and if the amino acid homology is higher than the DNA homology or a family of highly homologous genes (eg, 80% + Can be particularly advantageous.
[0192]
This option should be performed with caution. This is because it is essentially the expression of a selective mutation operator. Therefore, consider the benefits of reducing the cost of oligonucleotides for the introduction of this bias, which may have undesirable results. More typically, a codon encoding an amino acid at a given position in most parents is used.
[0193]
When using amino acid sequences: a) retrotranslating sequences to degenerate DNA; b) position-by-position reference to codon usage in original DNA (most parent or corresponding parent) Used to define degenerate nucleotides and / or to perform codon adjustments appropriate for the selected expression system. Here, a physical assay is performed.
[0194]
This process can also be used to introduce optional restriction sites within the coding portion of the gene for subsequent identification / QA / devolution / library entry manipulation, if any. All intersections identified in part 2 (indexed to the parent pair) are indexed corresponding to the adjusted DNA sequence.
[0195]
Fourth, the oligonucleotide configuration is selected for the gene assembly scheme. This process involves several determination steps:
Homogeneous 40-60 mer oligonucleotides are typically used (using longer oligonucleotides results in a decrease in the number of oligonucleotides for parental construction, but in close proximity / Use additional dedicated oligonucleotides to provide a presentation of the mutation).
[0196]
Choose whether shorter or longer oligonucleotides are acceptable (ie, yes / no? Determination). The “yes” determination reduces the total number of oligonucleotides of different lengths of highly homologous genes with gaps (deletions / insertions) (particularly 1-2 amino acids).
[0197]
The length of overlap (typically 15-20 bases, which can be symmetric or asymmetric) is selected.
[0198]
Choose whether degenerate oligonucleotides are allowed (yes / no?). It is also a powerful means for obtaining another powerful cost-reducing feature and further sequence differences. Partially degenerate and minimally degenerate schemes are particularly advantageous in establishing mutagenesis libraries.
[0199]
When a software tool is used for these operations, it performs several parameter changes and selects maximum library complexity and minimum cost. Performing a complex assembly scheme using oligonucleotides of various lengths can significantly complicate process indexing and subsequent assembly of libraries in a positionally encoded parallel or partial pool format. To. If this is done without elaborate software, a simple and uniform scheme (eg, all oligonucleotides are 40 bases long with 20 base overlap) can be used.
[0200]
Fifth, “convenience sequences” are designed around the parent string. Ideally this is the same set that will eventually be established in all library entries. These include any restriction sites, primer sequences for assembled product identification, RBS, leader peptides, and other special or desired features. In principle, this expedient sequence can be defined at a later stage, and at this stage a “dummy” set of appropriate length can be used (eg, a substring from a forbidden character that is easily recognizable) .
[0201]
In Part 6, an indexed matrix of oligonucleotide strings for establishing all parents is created according to the selected scheme. All oligonucleotide indicators include: parent identifier (parent ID), coding or complementary strand designation, and position number. An intersection is determined for all parent string indexed code strings with head and tail expedient substrings. Make complementary strands of all strands. All code strings are selected according to the selected assembly PCR scheme of Part 4 (eg, in 40 bp increments). All complementary strings are split according to the same scheme (eg, 40 bp with 20 bp shift).
[0202]
In Part 7, an oligonucleotide indexing matrix is created for all pairwise crossover operations. First, all oligonucleotides that have pairwise crossover markers are determined. Second, determine all sets (4 per intersection) of all oligonucleotides with the same location and the same pair of parental crossover markers. Third, another set of four chimeric oligonucleotide strings that take all sets of four oligonucleotide strings labeled with the same crossing marker and have two coding strands and two complementary strand coding characters An induction set is created (eg, having a 20 bp shift in a 40 = 20 + 20 scheme). Two code strings are possible, with one parent forward terminal sequence string followed by a second parent reverse terminal after the intersection. Complementary strings are also designed in the same manner, thereby obtaining an indexed complete inventory of strings encoding oligonucleotides appropriate for PCR gene library assembly.
[0203]
This inventory is further monitored, if necessary, by detecting all duplicate oligonucleotides, counting them, and accompanying the introduction of a count for the “abundance = amount” field in the indicator of each oligonucleotide string. You can refine by erasing from. This can be a very advantageous step to reduce the total number of oligonucleotides for library synthesis (especially when the parent sequence is highly homologous).
[0204]
Modifications may be made to the methods and materials described hereinabove without departing from the spirit or scope of the claimed invention, and the invention applies to many different uses, including the following: Can be:
Use of an integrated system to generate shuffled nucleic acids and / or to test shuffled nucleic acids included in an iterative process.
[0205]
An assay, kit or system that utilizes the use of any one of the selection strategies, materials, components, methods or substrates described herein above. The kit optionally further includes instructions for performing the method or assay, packaging material, one or more containers containing components of the assay, device or system, and the like.
[0206]
In a further aspect, the present invention provides kits embodying the methods and devices herein. The kit of the present invention optionally comprises one or more of the following: (1) shuffled components as described herein; (2) to perform the methods described herein; And / or instructions for operating the selection procedures herein; (3) one or more assay components; (4) to store nucleic acids or enzymes, other nucleic acids, transgenic plants, animals, cells, etc. (5) packaging material; and (6) software for performing any of the processes and / or determination steps described herein.
[0207]
In further aspects, the present invention uses any component or kit herein, performs any method or assay herein, and / or performs any assay or method herein. Provide the use of any device or kit to do.
[0208]
The examples and embodiments described herein are for illustrative purposes only, and various modifications or alterations have been suggested by one skilled in the art in view of these, and the spirit and rights of this application As well as within the scope of the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
[Brief description of the drawings]
FIG. 1 illustrates a flow chart illustrating one embodiment of the method of the present invention.
FIG. 2 illustrates the selection and linking of subsequences according to the method (s) of the present invention.
FIG. 3 illustrates the selection and association of subsequences according to the method (s) of the present invention, where the association utilizes an alignment algorithm to fix the substring regularity.
FIG. 4 illustrates an exemplary digital device 700 according to the present invention.
FIG. 5 is a chart and related phylogenetic tree showing percent similarity for different subtilisins (a typical set of initial character strings).
FIG. 6 is a paired dot plot alignment showing regions of homology for different subtilisins.
FIG. 7 is a paired dot plot alignment showing regions of homology for seven different parent subtilisins.

Claims

生成のために分子を同定するためのコンピュータによって実行される方法であって、連結されたストリングによって該分子が表され、該コンピュータはプロセッサおよびメモリを備え、該方法は、以下の工程：
ｉ）該プロセッサが、該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、該メモリ内の初期キャラクターストリングのデータ構造中に２以上の生物学的分子をコードする工程であって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む、工程；
ｉｉ）該プロセッサが、該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程；
ｉｉｉ）該プロセッサが、該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成する工程；
ｉｖ）該プロセッサが、該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入する工程；
ｖ）該プロセッサが、該メモリ内の少なくとも１つの異なる初期キャラクターストリングに対して、該メモリ内の解ストリングの該データ構造を投入するために用いられる該解ストリングのうちの少なくとも１つの配列の同一性を決定付ける工程；ならびに
ｖｉ）該プロセッサが、生成のために１以上の解生物学的分子を選択する工程であって、該１以上の解生物学的分子が、該少なくとも１つの異なる初期キャラクターストリングに対して３０％を超える配列同一性を有している該メモリ内の解ストリングの該データ構造を投入するために用いられる解ストリングの１つ以上と対応している、工程を包含する、方法。A computer-implemented method for identifying molecules for generation, wherein the molecules are represented by concatenated strings, the computer comprising a processor and a memory , the method comprising the following steps:
i) the processor encoding two or more biological molecules in the data structure of the initial character string in the memory to provide a collection of two or more different initial character strings in the memory ; Wherein each of said biological molecules comprises at least about 10 subunits;
ii) process said processor, for selecting at least two sub-strings from the different initial character strings in the memory;
Step iii) said processor, which connects the sub-string, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) the processor adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ;
v) the processor is the same for at least one different initial character string, at least one sequence of該解string used to populate the data structure of the solutions strings in the memory in the memory Vi) determining the sex; and vi) the processor selecting one or more antibiological molecules for generation, wherein the one or more antibiological molecules are the at least one different initial stage. Corresponding to one or more of the solution strings used to populate the data structure of the solution strings in the memory having greater than 30% sequence identity to the character string ,Method.

前記コードする工程が、前記キャラクターストリング中に２以上の核酸配列をコードする工程を包含する、請求項１に記載の方法。 The method of claim 1, wherein the encoding step includes encoding two or more nucleic acid sequences in the character string.

前記２以上の核酸配列が、天然に存在するタンパク質をコードする核酸配列を含む、請求項２に記載の方法。 The method of claim 2, wherein the two or more nucleic acid sequences comprise a nucleic acid sequence encoding a naturally occurring protein.

前記コードする工程が、前記キャラクターストリング中に２以上のアミノ酸配列をコードする工程を包含する、請求項１に記載の方法。 The method of claim 1, wherein the encoding step includes encoding two or more amino acid sequences in the character string.

前記２以上のアミノ酸配列が、天然に存在するタンパク質をコードするアミノ酸配列を含む、請求項４に記載の方法。The two or more amino acid sequence comprises an amino acid sequence encoding a naturally occurring protein, The method of claim 4.

前記初期キャラクターストリングが、互いに少なくとも３０％の配列同一性を有する、請求項１に記載の方法。 The method of claim 1, wherein the initial character strings have at least 30% sequence identity with each other.

前記（ｉｉ）の選択する工程が、
初期キャラクターストリングのうち１つにおいて少なくとも１つのサブストリングを選択する工程であって、該サブストリングの両端が、該初期キャラクターストリング内の約３〜約２０文字のストリング領域で、該初期キャラクターストリングと別の初期キャラクターストリングの対応する領域に対して、該初期キャラクターストリングと該別の初期キャラクターストリングとの間におけるこの２つの初期キャラクターストリングの全体としての配列同一性よりも高い配列同一性を有するサブストリングとして存在する、工程
を包含する、請求項１に記載の方法。The step (ii) of selecting includes
Selecting at least one substring in one of the initial character strings, wherein both ends of the substring are a string region of about 3 to about 20 characters in the initial character string, and the initial character string and A sub-region having a sequence identity higher than the overall sequence identity of the two initial character strings relative to a corresponding region of another initial character string between the initial character string and the other initial character string. The method of claim 1 comprising the step of being present as a string.

前記（ｉｉ）における選択する工程が、前記サブストリングの末端が約４〜約８の文字の所定のモチーフにおいて存在するようにサブストリングを選択する工程を包含する、請求項１に記載の方法。 The method of claim 1, wherein the selecting in (ii) comprises selecting a substring such that the end of the substring is present in a predetermined motif of about 4 to about 8 characters.

前記（ｉｉ）における選択する工程が、２以上の前記初期キャラクターストリングを整列し、該初期キャラクターストリングのうちの２以上のサブストリングの間の対同一性を最大化し、そして該２以上のサブストリングのうちの１つの末端について整列された対のメンバーである文字を選択する工程を包含する、請求項１に記載の方法。 The selecting in (ii) aligns two or more of the initial character strings, maximizes pair identity between two or more of the initial character strings, and the two or more substrings The method of claim 1, comprising selecting a letter that is a member of a pair aligned for one of the ends.

前記プロセッサが前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字を無作為に変更する工程、をさらに包含する、請求項１に記載の方法。 The method of claim 1, further comprising: the processor randomly changing one or more characters of the initial character string or answer character string.

前記方法がさらに、前記プロセッサが前記初期キャラクターストリングまたは解キャラクターストリング中の特定の予め選択された文字の１以上の出現を、無作為に選択し、そしてそれらを変化させる工程、を包含する、請求項１０に記載の方法。 The method further comprises the processor randomly selecting and changing one or more occurrences of particular preselected characters in the initial character string or solution character string. Item 11. The method according to Item 10.

前記コードする工程、選択する工程、連結する工程が、インターネットサイト上で施される、請求項１に記載の方法。 The method of claim 1, wherein the encoding, selecting, and connecting steps are performed on an Internet site.

前記コードする工程、選択する工程、連結する工程が、サーバー上で実施される、請求項１に記載の方法。 The method of claim 1, wherein the encoding, selecting, and connecting steps are performed on a server.

前記コードする工程、選択する工程、連結する工程が、ネットワークに接続したクライアント上で実施される、請求項１に記載の方法。 The method of claim 1, wherein the encoding, selecting, and linking steps are performed on a client connected to a network.

前記（ｉ）の初期キャラクターストリングが、同じ遺伝子または同じタンパク質ファミリーであるが配列が異なっているものをコードしている点で関連付けられている、請求項１に記載の方法。 The method of claim 1, wherein the initial character strings of (i) are related in that they encode the same gene or the same protein family but different sequences.

前記プロセッサが前記解ストリングによって表される分子についての計算機により予測された特性を決定する工程、を包含する、請求項１に記載の方法。 The method of claim 1, comprising the step of determining a computer predicted property for a molecule represented by the solution string.

前記解ストリングによって表される分子が、一連の容器においては並行して生成される、請求項１に記載の方法。 The method of claim 1, wherein the molecules represented by the solution string are generated in parallel in a series of containers.

前記解ストリングによって表される分子が、オリゴヌクレオチドのアセンブリにより生成される、請求項１に記載の方法。 The method of claim 1, wherein the molecule represented by the solution string is generated by assembly of oligonucleotides.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有している、請求項１に記載の方法。 The method of claim 1, wherein the one or more solution strings of (vi) have greater than 50% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有している、請求項１に記載の方法。 The method of claim 1, wherein the one or more solution strings of (vi) have greater than 75% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと８５％を超える配列同一性を有している、請求項１に記載の方法。 The method of claim 1, wherein the one or more solution strings in (vi) have greater than 85% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９０％を超える配列同一性を有している、請求項１に記載の方法。 The method of claim 1, wherein the one or more solution strings in (vi) have greater than 90% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９５％を超える配列同一性を有している、請求項１に記載の方法。 The method of claim 1, wherein the one or more solution strings of (vi) have greater than 95% sequence identity with the at least one initial character string.

前記解ストリングをデータ構造に加える工程が、１よりも多い解ストリングを該データ構造に加えることを包含する、請求項１に記載の方法。 The method of claim 1, wherein adding the solution string to a data structure comprises adding more than one solution string to the data structure.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、無作為にサブストリングを選択することを包含する、請求項１に記載の方法。 The method of claim 1, wherein selecting at least two substrings from the initial character string comprises randomly selecting substrings.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、均一にサブストリングを選択することを包含する、請求項１に記載の方法。 The method of claim 1, wherein selecting at least two substrings from the initial character string comprises selecting substrings uniformly.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、モチーフに基づいて選択することを包含する、請求項１に記載の方法。 The method of claim 1, wherein selecting at least two substrings from the initial character string comprises selecting based on a motif.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、整列に基づいて選択することを包含する、請求項１に記載の方法。 The method of claim 1, wherein selecting at least two substrings from the initial character string comprises selecting based on alignment.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、頻度に基づいて選択することを包含する、請求項１に記載の方法。 The method of claim 1, wherein selecting at least two substrings from the initial character string comprises selecting based on frequency.

メモリを含むコンピュータに以下のことを行わせるプログラムがコードされたコンピュータ読み取り可能媒体：
ｉ）該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、初期キャラクターストリング中に２以上の生物学的分子をコードすることであって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む；
ｉｉ）該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択する；
ｉｉｉ）該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成する；
ｉｖ）該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入する；
ｖ）該メモリ内の少なくとも１つの異なる初期キャラクターストリングに対して、該解ストリングのうちの少なくとも１つの配列の同一性を決定付ける；ならびに
ｖｉ）生成のために１以上の解生物学的分子を選択することであって、該１以上の解生物学的分子が、該少なくとも１つの異なる初期キャラクターストリングに対して３０％を超える配列同一性を有している該メモリ内の解ストリングの該データ構造を投入するために用いられる該解ストリングの１つ以上と対応している。A computer readable medium encoded with a program that causes a computer including a memory to :
i) encoding two or more biological molecules in the initial character string to provide a collection of two or more different initial character strings in the memory, wherein Each comprising at least about 10 subunits;
ii) selecting at least two sub-strings from the different initial character strings in the memory;
iii) by connecting the substring, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ;
v) for at least one different initial character string in the memory, determining the identity of at least one sequence of the solution string; and vi) selecting one or more debiological molecules for generation The data of the solution string in the memory , wherein the one or more debiological molecules have greater than 30% sequence identity to the at least one different initial character string Corresponds to one or more of the solution strings used to populate the structure .

前記２以上の生物学的分子が核酸配列である、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the two or more biological molecules are nucleic acid sequences.

前記２以上の生物学的分子が天然に存在するタンパク質をコードする核酸配列である、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the two or more biological molecules are nucleic acid sequences that encode a naturally occurring protein.

前記２以上の生物学的分子がアミノ酸配列である、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the two or more biological molecules are amino acid sequences.

前記初期キャラクターストリングが互いに少なくとも３０％の配列同一性を有している、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the initial character strings have at least 30% sequence identity with each other.

前記プログラムが、コンピュータに、（ｉｉ）において、初期キャラクターストリングのうち１つにおいて少なくとも１つのサブストリングを選択させ、該サブストリングの両端が、該初期キャラクターストリング内の約３〜約２０文字のストリング領域で、該初期キャラクターストリングと別の初期キャラクターストリングの対応する領域に対して、該初期キャラクターストリングと該別の初期キャラクターストリングとの間におけるこの２つの初期キャラクターストリングの全体としての配列同一性よりも高い配列同一性を有するサブストリングとして存在する、請求項３０に記載のコンピュータ読み取り可能媒体。 The program causes a computer to select at least one substring in one of the initial character strings in (ii), the ends of the substring being a string of about 3 to about 20 characters in the initial character string A region corresponding to the initial character string and another initial character string, relative to the overall sequence identity of the two initial character strings between the initial character string and the other initial character string. 32. The computer readable medium of claim 30, wherein the computer readable medium is present as a substring having a high sequence identity.

前記プログラムがコンピュータに、サブストリングの両端が約４〜約８文字の所定のモチーフ中に存在するようにサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 31. The computer readable medium of claim 30, wherein the program causes a computer to select a substring such that both ends of the substring are in a predetermined motif of about 4 to about 8 characters.

前記プログラムがコンピュータに、前記初期キャラクターストリングのうち２以上を整列させて該キャラクターストリングの２以上のサブストリングの間の対同一性を最大化し、そして１つのサブストリングの末端について整列したペアうちの１つのメンバーである文字を選択することによってサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 The program causes a computer to align two or more of the initial character strings to maximize pair identity between two or more substrings of the character string, and 31. The computer readable medium of claim 30, wherein the substring is selected by selecting a character that is a member.

前記プログラムがコンピュータに、前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字をさらに無作為に変更させる、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the program causes a computer to further randomly change one or more characters of the initial character string or solution character string.

前記プログラムがコンピュータに、前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の特定のあらかじめ選択された文字の出現をさらに無作為に選択し、そして変更させる、請求項３８に記載のコンピュータ読み取り可能媒体。 40. The computer-readable medium of claim 38, wherein the program causes a computer to further randomly select and change the appearance of one or more specific preselected characters of the initial character string or solution character string. Possible medium.

前記コンピュータ読み取り可能媒体が、磁気媒体、光学媒体、および磁気光学媒体からなる群より選択される、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the computer readable medium is selected from the group consisting of a magnetic medium, an optical medium, and a magneto-optical medium.

前記コンピュータ読み取り可能媒体がコンピュータの動的メモリまたは静的メモリである、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the computer readable medium is a computer dynamic memory or static memory.

前記（ｉ）の初期キャラクターストリングが、同じ遺伝子または同じタンパク質ファミリーであるが配列が異なっているものをコードしている点で関連付けられている、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the initial character strings of (i) are related in that they encode the same gene or the same protein family but differing sequences.

前記プログラムがコンピュータに、１以上の所望の特性について、前記解ストリングによって表される分子の物理的スクリーニングを命令させる、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the program causes a computer to instruct physical screening of a molecule represented by the solution string for one or more desired properties.

前記プログラムがコンピュータに、前記解ストリングによって表される分子についての計算機による予測された特性を決定することを命令させる、請求項３０に記載のコンピュータ読み取り可能媒体。 31. The computer readable medium of claim 30, wherein the program causes a computer to instruct a computer predicted property for a molecule represented by the solution string.

前記プログラムがコンピュータに、特定の特性について解ストリングのデータ構造のメンバーを試験し、そして、多変量解析を使用して該特定の特性の差異の原因となる配列の差異を決定させる、請求項３０に記載のコンピュータ読み取り可能媒体。 31. The program causes a computer to test members of a solution string data structure for a particular property and determine the sequence differences that cause the particular property difference using multivariate analysis. A computer-readable medium according to claim 1.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有する、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the one or more solution strings of (vi) have greater than 50% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有する、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the one or more solution strings of (vi) have greater than 75% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９５％を超える配列同一性を有する、請求項３０に記載のコンピュータ読み取り可能媒体。32. The computer readable medium of claim 30, wherein the one or more solution strings of (vi) have greater than 95% sequence identity with the at least one initial character string.

前記プログラムは、コンピュータに、前記データ構造に１を超える解ストリングを加えることによって、前記解ストリングをデータ構造に加えさせる、請求項３０に記載のコンピュータ読み取り可能媒体。 The computer-readable medium of claim 30, wherein the program causes a computer to add the solution string to the data structure by adding more than one solution string to the data structure.

前記プログラムは、コンピュータに、無作為にサブストリング選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 31. The computer readable medium of claim 30, wherein the program causes a computer to select at least two substrings from the initial character string by randomly selecting substrings.

前記プログラムがコンピュータに、均一なサブストリング選択をすることにより前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 32. The computer readable medium of claim 30, wherein the program causes a computer to select at least two substrings from the initial character string by making uniform substring selections.

前記プログラムはコンピュータに、モチーフに基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 The computer-readable medium of claim 30, wherein the program causes a computer to select at least two substrings from the initial character string by making a selection based on a motif.

前記プログラムはコンピュータに、整列に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 31. The computer readable medium of claim 30, wherein the program causes a computer to select at least two substrings from the initial character string by making a selection based on alignment.

前記プログラムはコンピュータに、頻度に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項３０に記載のコンピュータ読み取り可能媒体。 The computer-readable medium of claim 30, wherein the program causes a computer to select at least two substrings from the initial character string by making a frequency-based selection.

生成のために分子を同定するためのコンピュータによって実行される方法であって、連結されたストリングによって該分子が表され、該コンピュータはプロセッサおよびメモリを備え、該方法は、以下の工程：
ｉ）該プロセッサが、該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、初期キャラクターストリングのデータ構造中に２以上の関連する生物学的分子をコードする工程であって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む、工程；
ｉｉ）該プロセッサが、該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程；
ｉｉｉ）該プロセッサが、該サブストリングを連結して、１以上の解ストリングを形成する工程；
ｉｖ）該プロセッサが、該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入する工程；
ｖ）該プロセッサが、該メモリ内の解ストリングの該データ構造を投入するために用いられる該解ストリングの少なくとも１つが、該メモリ内の少なくとも１つの異なる初期キャラクターストリングに対する類似性に関して少なくとも予め定義づけられた基準値を有するか否かを決定付ける工程；ならびに
ｖｉ）該プロセッサが、生成のために１以上の解生物学的分子を選択する工程であって、該１以上の解生物学的分子が、該少なくとも１つの異なる初期キャラクターストリングに対して予め定義付けられた値を上回る配列同一性を有すると決定付けられた該メモリ内の解ストリングの該データ構造を投入するために用いられる１以上の解ストリングに対応する、工程
を包含する、方法。A computer-implemented method for identifying molecules for generation, wherein the molecules are represented by concatenated strings, the computer comprising a processor and a memory , the method comprising the following steps:
i) encoding two or more related biological molecules in the data structure of the initial character string to provide a collection of two or more different initial character strings in the memory; Wherein each of the biological molecules comprises at least about 10 subunits;
ii) process said processor, for selecting at least two sub-strings from the different initial character strings in the memory;
iii) the processor concatenating the substrings to form one or more solution strings;
iv) the processor adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ;
v) at least one of the solution strings used by the processor to populate the data structure of the solution strings in the memory is at least pre-defined with respect to similarity to at least one different initial character string in the memory Determining whether to have a determined reference value; and vi) selecting one or more antibiological molecules for generation, wherein the processor selects the one or more antibiological molecules Is used to populate the data structure of the solution string in the memory determined to have a sequence identity greater than a predefined value for the at least one different initial character string Corresponding to the solution string of:

生成のための分子を同定するためのコンピュータによって実行される方法であって、連結されたストリングによって該分子が表され、該コンピュータはプロセッサおよびメモリを備え、該方法は、以下の工程：
ｉ）該プロセッサが、該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、該メモリ内の初期キャラクターストリングのデータ構造中に２以上の天然に存在する生物学的分子をコードする工程であって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む、工程；
ｉｉ）該プロセッサが、該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程；
ｉｉｉ）該プロセッサが、該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成する工程；
ｉｖ）該プロセッサが、該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入する工程；ならびに
ｖ）該プロセッサが、生成のために１以上の解生物学的分子を選択する工程であって、該１以上の解生物学的分子が、該少なくとも１つの異なる初期キャラクターストリングに対して３０％を超える配列同一性を有する該メモリ内の解ストリングの該データ構造を投入するために用いられる該解ストリングの１つ以上と対応している、工程
を包含する、方法。A computer-implemented method for identifying a molecule for generation, wherein the molecule is represented by a concatenated string, the computer comprising a processor and a memory , the method comprising the following steps:
i) two or more naturally occurring biological molecules in the data structure of the initial character string in the memory for the processor to provide a collection of two or more different initial character strings in the memory ; An encoding step, wherein each of said biological molecules comprises at least about 10 subunits;
ii) process said processor, for selecting at least two sub-strings from the different initial character strings in the memory;
Step iii) said processor, which connects the sub-string, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) the processor adds the solution string to the data structure in the memory and populates the data structure of the solution string in the memory ; and v) the processor generates one or more solution data for generation. a process of selecting a biological molecule, said one or more solutions biological molecule, said at least said solutions strings in said memory having a sequence identity of greater than 30% for one different initial character string A method comprising the steps corresponding to one or more of the solution strings used to populate the data structure .

前記コードする工程が、２以上の核酸配列を前記キャラクターストリング中にコードすることを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein the encoding step includes encoding two or more nucleic acid sequences in the character string.

前記コードする工程が、前記キャラクターストリング中に２以上のアミノ酸配列をコードすることを包含し、該２以上のアミノ酸配列が、天然に存在するタンパク質をコードするアミノ酸配列を含む、請求項５６に記載の方法。Wherein the step of code, includes that encode two or more amino acid sequences in the character string, the two or more amino acid sequence comprises an amino acid sequence encoding a naturally occurring protein, to claim 56 The method described.

前記初期キャラクターストリングが、互いに少なくとも３０％の配列同一性を有する、請求項５６に記載の方法。 57. The method of claim 56, wherein the initial character strings have at least 30% sequence identity with each other.

前記（ｉｉ）の選択する工程が、
初期キャラクターストリングのうち１つにおいて少なくとも１つのサブストリングを選択する工程であって、該サブストリングの両端が、該初期キャラクターストリング内の約３〜約２０文字のストリング領域で、該初期キャラクターストリングと別の初期キャラクターストリングの対応する領域に対して、該初期キャラクターストリングと該別の初期キャラクターストリングとの間におけるこの２つの初期キャラクターストリングの全体としての配列同一性よりも高い配列同一性を有するサブストリングとして存在する、工程
を包含する、請求項５６に記載の方法。The step (ii) of selecting includes
Selecting at least one substring in one of the initial character strings, wherein both ends of the substring are a string region of about 3 to about 20 characters in the initial character string, and the initial character string and A sub-region having a sequence identity higher than the overall sequence identity of the two initial character strings relative to a corresponding region of another initial character string between the initial character string and the other initial character string. 57. The method of claim 56, comprising the steps present as a string.

前記（ｉｉ）における選択する工程が、前記サブストリングの末端が約４〜約８の文字の所定のモチーフにおいて存在するようにサブストリングを選択する工程を包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein the selecting in (ii) comprises selecting a substring such that the end of the substring is present in a predetermined motif of about 4 to about 8 characters.

前記（ｉｉ）における選択する工程が、２以上の前記初期キャラクターストリングを整列し、該初期キャラクターストリングのうちの２以上のサブストリングの間の対同一性を最大化し、そして該２以上のサブストリングのうちの１つの末端について整列された対のメンバーである文字を選択する工程を包含する、請求項５６に記載の方法。 The selecting in (ii) aligns two or more of the initial character strings, maximizes pair identity between two or more of the initial character strings, and the two or more substrings 57. The method of claim 56, comprising selecting a letter that is a member of a pair aligned for one end of.

前記プロセッサが前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字を無作為に変更する工程、をさらに包含する、請求項５６に記載の方法。 57. The method of claim 56, further comprising: the processor randomly changing one or more characters of the initial character string or answer character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有している、請求項５６に記載の方法。 57. The method of claim 56, wherein the one or more solution strings in (v) have greater than 50% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有している、請求項５６に記載の方法。 57. The method of claim 56, wherein the one or more solution strings in (v) have greater than 75% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと８５％を超える配列同一性を有している、請求項５６に記載の方法。 57. The method of claim 56, wherein the one or more solution strings in (v) have greater than 85% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９０％を超える配列同一性を有している、請求項５６に記載の方法。 57. The method of claim 56, wherein the one or more solution strings in (v) have greater than 90% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９５％を超える配列同一性を有している、請求項５６に記載の方法。 57. The method of claim 56, wherein the one or more solution strings in (v) have greater than 95% sequence identity with the at least one initial character string.

前記解ストリングをデータ構造に加える工程が、１よりも多い解ストリングを該データ構造に加えることを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein adding the solution string to a data structure includes adding more than one solution string to the data structure.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、無作為にサブストリングを選択することを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein selecting at least two substrings from the initial character string comprises randomly selecting substrings.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、均一にサブストリングを選択することを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein selecting at least two substrings from the initial character string comprises selecting substrings uniformly.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、モチーフに基づいて選択することを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein selecting at least two substrings from the initial character string comprises selecting based on a motif.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、整列に基づいて選択することを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein selecting at least two substrings from the initial character string comprises selecting based on alignment.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、頻度に基づいて選択することを包含する、請求項５６に記載の方法。 57. The method of claim 56, wherein selecting at least two substrings from the initial character string comprises selecting based on frequency.

メモリを含むコンピュータに以下のことを行わせるプログラムがコードされたコンピュータ読み取り可能媒体：
ｉ）該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、初期キャラクターストリング中に２以上の天然に存在する生物学的分子をコードすることであって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む；
ｉｉ）該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択すること；
ｉｉｉ）該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成すること；
ｉｖ）該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入すること；および
ｖ）生成のために１以上の解生物学的分子を選択することであって、該１以上の解生物学的分子が、該少なくとも１つの異なる初期キャラクターストリングに対して３０％を超える配列同一性を有する該メモリ内の解ストリングの該データ構造を投入するために用いられる該解ストリングの１つ以上と対応している。A computer readable medium encoded with a program that causes a computer including a memory to :
i) encoding two or more naturally occurring biological molecules in the initial character string to provide a collection of two or more different initial character strings in the memory, wherein the organism Each of the molecular molecules comprises at least about 10 subunits;
ii) selecting at least two sub-strings from the different initial character strings in the memory;
iii) by connecting the substring, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ; and v) selecting one or more solution biological molecules for generation. Wherein the one or more antibiological molecules are used to populate the data structure of the solution strings in the memory having greater than 30% sequence identity to the at least one different initial character string Corresponding to one or more of the solution strings to be generated.

前記プログラムがコンピュータに、２以上の核酸配列を前記キャラクターストリング中にコードすることによりコードさせる、請求項７５に記載のコンピュータ読み取り可能媒体。76. The computer readable medium of claim 75, wherein the program causes a computer to code by encoding two or more nucleic acid sequences in the character string.

前記プログラムがコンピュータに、２以上のアミノ酸配列を前記キャラクターストリング中にコードさせ、ここで、該２以上のアミノ酸配列は、天然に存在するタンパク質をコードするアミノ酸配列を含む、請求項７５に記載のコンピュータ読み取り可能媒体。76. The program of claim 75, wherein the program causes a computer to code two or more amino acid sequences in the character string, wherein the two or more amino acid sequences include an amino acid sequence that encodes a naturally occurring protein. Computer readable medium.

前記初期キャラクターストリングが互いに少なくとも３０％の配列同一性を有している、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer readable medium of claim 75, wherein the initial character strings have at least 30% sequence identity with each other.

前記プログラムが、コンピュータに、（ｉｉ）において、初期キャラクターストリングのうち１つにおいて少なくとも１つのサブストリングを選択させ、該サブストリングの両端が、該初期キャラクターストリング内の約３〜約２０文字のストリング領域で、該初期キャラクターストリングと別の初期キャラクターストリングの対応する領域に対して、該初期キャラクターストリングと該別の初期キャラクターストリングとの間におけるこの２つの初期キャラクターストリングの全体としての配列同一性よりも高い配列同一性を有するサブストリングとして存在する、請求項７５に記載のコンピュータ読み取り可能媒体。 The program causes a computer to select at least one substring in one of the initial character strings in (ii), the ends of the substring being a string of about 3 to about 20 characters in the initial character string A region corresponding to the initial character string and another initial character string, relative to the overall sequence identity of the two initial character strings between the initial character string and the other initial character string. 76. The computer readable medium of claim 75, wherein the computer readable medium is present as a substring having a high sequence identity.

前記プログラムがコンピュータに、（ｉｉ）において、サブストリングの両端が約４〜約８文字の所定のモチーフ中に存在するようにサブストリングを選択することにより選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer readable medium of claim 75, wherein the program causes a computer to select by selecting a substring in (ii) such that both ends of the substring are in a predetermined motif of about 4 to about 8 characters. Possible medium.

前記プログラムがコンピュータに、（ｉｉ）において、前記初期キャラクターストリングのうち２以上を整列させて該初期キャラクターストリングの２以上のサブストリングの間の対同一性を最大化し、そして該２以上のサブストリングのうちの１つの末端について整列した対のメンバーである文字を選択することによって選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。 The program causes the computer to align, in (ii), two or more of the initial character strings to maximize pair identity between the two or more substrings of the initial character string, and the two or more substrings 76. The computer readable medium of claim 75, wherein the computer readable medium is selected by selecting a character that is a paired member aligned for one of the ends.

前記プログラムがコンピュータに、前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字をさらに無作為に変更させる、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer readable medium of claim 75, wherein the program causes a computer to further randomly change one or more characters of the initial character string or answer character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有する、請求項７５に記載のコンピュータ読み取り可能媒体。76. The computer readable medium of claim 75, wherein the one or more solution strings of (v) have greater than 50% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有する、請求項７５に記載のコンピュータ読み取り可能媒体。The computer readable medium of claim 75, wherein the one or more solution strings of (v) have greater than 75% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと８５％を超える配列同一性を有する、請求項７５に記載のコンピュータ読み取り可能媒体。The computer readable medium of claim 75, wherein the one or more solution strings of (v) have greater than 85% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９０％を超える配列同一性を有する、請求項７５に記載のコンピュータ読み取り可能媒体。76. The computer readable medium of claim 75, wherein the one or more solution strings of (v) have greater than 90% sequence identity with the at least one initial character string.

前記（ｖ）の１以上の解ストリングが、前記少なくとも１つの初期キャラクターストリングと９５％を超える配列同一性を有する、請求項７５に記載のコンピュータ読み取り可能媒体。76. The computer readable medium of claim 75, wherein the one or more solution strings of (v) have greater than 95% sequence identity with the at least one initial character string.

前記プログラムはコンピュータに、前記データ構造に１を超える解ストリングを加えることによって、前記解ストリングをデータ構造に加えさせる、請求項７５に記載のコンピュータ読み取り可能媒体。 The computer readable medium of claim 75, wherein the program causes a computer to add the solution string to the data structure by adding more than one solution string to the data structure.

前記プログラムはコンピュータに、無作為にサブストリング選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer-readable medium of claim 75, wherein the program causes a computer to select at least two substrings from the initial character string by randomly selecting substrings.

前記プログラムがコンピュータに、均一なサブストリング選択をすることにより前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer-readable medium of claim 75, wherein the program causes a computer to select at least two substrings from the initial character string by making uniform substring selections.

前記プログラムはコンピュータに、モチーフに基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。 76. The computer readable medium of claim 75, wherein the program causes a computer to select at least two substrings from the initial character string by making a motif based selection.

前記プログラムはコンピュータに、整列に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。The program in the computer, by a selection based on alignment to select at least two sub-strings from the initial character string, a computer-readable medium of claim 75.

前記プログラムはコンピュータに、頻度に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項７５に記載のコンピュータ読み取り可能媒体。The program causes the computer to select the at least two sub-strings from the initial character string by a selection based on the frequency, the computer-readable medium of claim 75.

生成のための分子を同定するためのコンピュータによって実行される方法であって、連結されたストリングによって該分子が表され、該コンピュータはプロセッサおよびメモリを備え、該方法は、以下の工程：
ｉ）該プロセッサが、該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、該メモリ内の初期キャラクターストリングのデータ構造中に２以上の生物学的分子をコードする工程であって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む、工程；
ｉｉ）該プロセッサが、該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程；
ｉｉｉ）該プロセッサが、該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成する工程；
ｉｖ）該プロセッサが、該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入する工程；
ｖ）該プロセッサが、該メモリ内の解ストリングの該データ構造における解ストリングのうち少なくとも１つについて１以上の計算により予測した特性を得る工程；
ならびに
ｖｉ）該プロセッサが、該１以上の計算により予測した特性に基づいて、生成のために１以上の解生物学的分子を選択する工程
を包含する、方法。A computer-implemented method for identifying a molecule for generation, wherein the molecule is represented by a concatenated string, the computer comprising a processor and a memory , the method comprising the following steps:
i) the processor encoding two or more biological molecules in the data structure of the initial character string in the memory to provide a collection of two or more different initial character strings in the memory ; Wherein each of said biological molecules comprises at least about 10 subunits;
ii) process said processor, for selecting at least two sub-strings from the different initial character strings in the memory;
Step iii) said processor, which connects the sub-string, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) the processor adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ;
v) obtaining a characteristic predicted by one or more calculations for at least one of the solution strings in the data structure of the solution strings in the memory ;
And vi) the processor comprising selecting one or more antibiological molecules for production based on the property predicted by the one or more calculations.

前記計算により予測した特性が、分子量の最大値もしくは最小値、自由エネルギーの最大値もしくは最小値、標的分子もしくは表面との接触表面の最大値もしくは最小値、特定された正味の電荷、予測ｐＫ、予測ｐＩ、結合親和性、二次構造、三次構造のうちの１以上を含む、請求項９４に記載の方法。 The properties predicted by the calculation are the maximum or minimum value of molecular weight, the maximum value or minimum value of free energy, the maximum value or minimum value of the contact surface with the target molecule or surface, the specified net charge, the predicted pK, 95. The method of claim 94, comprising one or more of predicted pi, binding affinity, secondary structure, tertiary structure.

前記コードする工程が、前記キャラクターストリング中に２以上のアミノ酸配列をコードすることを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein the encoding step includes encoding two or more amino acid sequences in the character string.

前記（ｉｉ）における選択する工程が、２以上の前記初期キャラクターストリングを整列し、該初期キャラクターストリングのうちの２以上のサブストリングの間の対同一性を最大化し、そして該２以上のサブストリングのうちの１つの末端について整列された対のメンバーである文字を選択する工程を包含する、請求項９４に記載の方法。 The selecting in (ii) aligns two or more of the initial character strings, maximizes pair identity between two or more of the initial character strings, and the two or more substrings 95. The method of claim 94, comprising selecting a letter that is a paired member aligned for one of the ends.

前記プロセッサが前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字を無作為に変更する工程、をさらに包含する、請求項９４に記載の方法。 95. The method of claim 94, further comprising the processor randomly changing one or more characters of the initial character string or answer character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有している、請求項９４に記載の方法。 95. The method of claim 94, wherein the one or more antibiological molecules of (vi) have greater than 50% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有している、請求項９４に記載の方法。 95. The method of claim 94, wherein the one or more antibiological molecules of (vi) have greater than 75% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと９０％を超える配列同一性を有している、請求項９４に記載の方法。 95. The method of claim 94, wherein the (vi) one or more antibiological molecules have greater than 90% sequence identity with the at least one initial character string.

前記解ストリングをデータ構造に加える工程が、１よりも多い解ストリングを該データ構造に加えることを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein adding the solution string to a data structure includes adding more than one solution string to the data structure.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、無作為にサブストリングを選択することを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein selecting at least two substrings from the initial character string comprises randomly selecting substrings.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、均一にサブストリングを選択することを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein selecting at least two substrings from the initial character string comprises selecting substrings uniformly.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、モチーフに基づいて選択することを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein selecting at least two substrings from the initial character string comprises selecting based on a motif.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、整列に基づいて選択することを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein selecting at least two substrings from the initial character string comprises selecting based on alignment.

前記初期キャラクターストリングから少なくとも２つのサブストリングを選択する工程が、頻度に基づいて選択することを包含する、請求項９４に記載の方法。 95. The method of claim 94, wherein selecting at least two substrings from the initial character string comprises selecting based on frequency.

メモリを含むコンピュータに以下のことを行わせるプログラムがコードされたコンピュータ読み取り可能媒体：
ｉ）該メモリ内の２以上の異なる初期キャラクターストリングの収集物を提供するために、該メモリ内の初期キャラクターストリング中に２以上の生物学的分子をコードすることであって、ここで該生物学的分子の各々が、少なくとも約１０のサブユニットを含む；
ｉｉ）該メモリ内の該異なる初期キャラクターストリングから少なくとも２つのサブストリングを選択すること；
ｉｉｉ）該サブストリングを連結して、１以上の該異なる初期キャラクターストリングとほぼ同じ長さの１以上の解ストリングを形成すること；
ｉｖ）該メモリ内のデータ構造に該解ストリングを加えて、該メモリ内の解ストリングのデータ構造を投入すること；
ｖ）該メモリ内の解ストリングの該データ構造における解ストリングのうち少なくとも１つについて１以上の計算により予測した特性を得ること；
ならびに
ｖｉ）該１以上の計算により予測した特性に基づいて、生成のために１以上の解生物学的分子を選択すること。A computer readable medium encoded with a program that causes a computer including a memory to :
i) encoding two or more biological molecules in the initial character string in the memory to provide a collection of two or more different initial character strings in the memory , wherein the organism Each of the molecular molecules comprises at least about 10 subunits;
ii) selecting at least two sub-strings from the different initial character strings in the memory;
iii) by connecting the substring, to form a substantially 1 or more solutions strings of the same length as one or more of the different initial character string;
iv) adding the solution string to the data structure in the memory and populating the data structure of the solution string in the memory ;
v) obtaining a predicted property by one or more calculations for at least one of the solution strings in the data structure of the solution strings in the memory ;
And vi) selecting one or more antibiological molecules for production based on properties predicted by the one or more calculations.

前記計算により予測した特性が、分子量の最大値もしくは最小値、自由エネルギーの最大値もしくは最小値、標的分子もしくは表面との接触表面の最大値もしくは最小値、特定された正味の電荷、予測ｐＫ、予測ｐＩ、結合親和性、二次構造、三次構造のうちの１以上を含む、請求項１０８に記載のコンピュータ読み取り可能媒体。 The properties predicted by the calculation are the maximum or minimum value of molecular weight, the maximum value or minimum value of free energy, the maximum value or minimum value of the contact surface with the target molecule or surface, the specified net charge, the predicted pK, 109. The computer readable medium of claim 108, comprising one or more of predicted pi, binding affinity, secondary structure, tertiary structure.

前記プログラムがコンピュータに、２以上のアミノ酸配列を前記キャラクターストリングにコードさせることによって（ｉ）においてコードさせる、請求項１０８に記載のコンピュータ読み取り可能媒体。109. The computer readable medium of claim 108, wherein the program causes a computer to code in (i) by causing the character string to encode two or more amino acid sequences.

前記プログラムがコンピュータに、（ｉｉ）において整列することにより、前記初期キャラクターストリングのうち２以上を整列させて該初期キャラクターストリングの２以上のサブストリングの間の対同一性を最大化し、そして該２以上のサブストリングのうちの１つの末端について整列した対のメンバーである文字を選択することによって選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。 The program aligns the computer in (ii) to align two or more of the initial character strings to maximize pair identity between two or more substrings of the initial character string, and 109. The computer readable medium of claim 108, wherein the computer readable medium is selected by selecting a character that is a member of a pair aligned for one end of the substring.

前記プログラムがコンピュータに、前記初期キャラクターストリングまたは解キャラクターストリングのうちの１つ以上の文字をさらに無作為に変更させる、請求項１０８に記載のコンピュータ読み取り可能媒体。 109. The computer readable medium of claim 108, wherein the program causes a computer to further randomly change one or more characters of the initial character string or answer character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと５０％を超える配列同一性を有する、請求項１０８に記載のコンピュータ読み取り可能媒体。109. The computer readable medium of claim 108, wherein the (vi) one or more antibiological molecules have greater than 50% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと７５％を超える配列同一性を有する、請求項１０８に記載のコンピュータ読み取り可能媒体。109. The computer readable medium of claim 108, wherein the (vi) one or more antibiological molecules have greater than 75% sequence identity with the at least one initial character string.

前記（ｖｉ）の１以上の解生物学的分子が、前記少なくとも１つの初期キャラクターストリングと９０％を超える配列同一性を有する、請求項１０８に記載のコンピュータ読み取り可能媒体。109. The computer readable medium of claim 108, wherein the (vi) one or more antibiological molecules have greater than 90% sequence identity with the at least one initial character string.

前記プログラムはコンピュータに、前記データ構造に１を超える解ストリングを加えることによって、前記解ストリングをデータ構造に加えさせる、請求項１０８に記載のコンピュータ読み取り可能媒体。 109. The computer readable medium of claim 108, wherein the program causes a computer to add the solution string to the data structure by adding more than one solution string to the data structure.

前記プログラムはコンピュータに、無作為にサブストリング選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。 109. The computer readable medium of claim 108, wherein the program causes a computer to select at least two substrings from the initial character string by randomly selecting substrings.

前記プログラムがコンピュータに、均一なサブストリング選択をすることにより前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。 109. The computer readable medium of claim 108, wherein the program causes a computer to select at least two substrings from the initial character string by making uniform substring selections.

前記プログラムはコンピュータに、モチーフに基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。 109. The computer readable medium of claim 108, wherein the program causes a computer to select at least two substrings from the initial character string by making a motif based selection.

前記プログラムはコンピュータに、整列に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。The program in the computer, by a selection based on alignment to select at least two sub-strings from the initial character string, a computer-readable medium of claim 108.

前記プログラムはコンピュータに、頻度に基づく選択をすることによって前記初期キャラクターストリングから少なくとも２つのサブストリングを選択させる、請求項１０８に記載のコンピュータ読み取り可能媒体。The program causes the computer to select the at least two sub-strings from the initial character string by a selection based on the frequency, the computer-readable medium of claim 108.