RU2792584C1

RU2792584C1 - Method for organizing the search for documents in applied unstructured data bases and a hardware version of dual memory for its implementation

Info

Publication number: RU2792584C1
Application number: RU2022106802A
Authority: RU
Inventors: Ануар Райханович Кулмагамбетов
Original assignee: Ануар Райханович Кулмагамбетов
Filing date: 2022-03-16
Publication date: 2023-03-22

Abstract

FIELD: computer engineering.

SUBSTANCE: auxiliary tables are formed: keywords, documents, binary strings of document numbers and binary strings that complement the inverted index. After selecting the lists by keywords from the inverted index, the lists of documents presented as binary strings are processed in the resulting list processing block. In this case, binary strings are loaded for processing into dual memory, in which the numeric value of the string (memory register) is equal to 2ⁿ. In the block for processing document pages, a response table is formed for each document by keywords. Dual memory is a logical memory circuit and contains data input channels, an SM input, switching the operation of the circuit from conventional memory mode to dual memory mode, a block for addressing bit numbers in memory cells, a switch, logical switches. The output of the write/read control block CS, RD, OE is connected to the output of the bit number selection block (column address).

EFFECT: simplifying and accelerating the procedure for searching for documents in accordance with the specified criteria.

2 cl, 18 dwg

Description

Заявляемое техническое решение относится к области использования больших прикладных базах данных (далее БД) с неструктурированными или слабо структурированными данными, например, библиотеки. Изобретение предназначено для поиска и отбора документов из больших БД документов, в максимальной степени соответствующих запросам пользователя. При этом документами являются печатные издания, такие как книги, журналы, статьи, брошюры, отчеты и т.д. The claimed technical solution relates to the use of large application databases (hereinafter DB) with unstructured or weakly structured data, such as libraries. The invention is intended for searching and selecting documents from large databases of documents that correspond to the user's requests to the maximum extent. In this case, documents are printed publications, such as books, magazines, articles, brochures, reports, etc.

Быстрый рост объемов накапливаемых данных, в сочетании с возрастающими информационными потребностями людей, требует постоянного совершенствования сопутствующих информационных систем с целью получения документов (информации) максимально удовлетворяющих потребностям потребителя (пертинентной информации). Требование проведения поиска, необходимых пользователям документов, при неопределенных и нечетко формулируемых запросах нуждается в интерактивном участии пользователя системы. При этом, выводимые на экран документы должны:The rapid growth in the volume of accumulated data, combined with the growing information needs of people, requires constant improvement of related information systems in order to obtain documents (information) that best meets the needs of the consumer (pertinent information). The requirement to conduct a search required by users of documents, with vague and fuzzy queries, requires the interactive participation of the user of the system. At the same time, the documents displayed on the screen should:

- не перегружать пользователя объемом излишней информацией (сотни найденных релевантных документов);- do not overload the user with excessive information (hundreds of relevant documents found);

- максимально соответствовать потребностям пользователя, что обеспечивается удобным механизмом анализа результатов поиска, наглядностью предоставляемой информации и возможностью интерактивной корректировки запросов.- meet the needs of the user as much as possible, which is ensured by a convenient mechanism for analyzing search results, the clarity of the information provided and the possibility of interactive adjustment of queries.

И при этом информационная система должна работать с полным охватом всей базы данных (БД) и не занимать много времени на обработку запроса (по разным оценкам, не более 20 секунд). Для информационных систем со структурированными данными (таблицами) поиск и обработка данных уже достаточно эффективно работает и встроено в бизнес процессы, это реляционные системы управления базами данных (СУБД), например, Oracle, Microsoft, MySQL. Однако в работе с неструктурированными данными еще очень далеко до получения качественного результата. And at the same time, the information system should work with full coverage of the entire database (DB) and not take much time to process a request (according to various estimates, no more than 20 seconds). For information systems with structured data (tables), data search and processing is already working quite effectively and is built into business processes, these are relational database management systems (DBMS), for example, Oracle, Microsoft, MySQL. However, when working with unstructured data, it is still very far from obtaining a qualitative result.

К информационно-поисковым системам, ориентированным на работу с неструктурированными данными NoSQL (Shashank Tiwari. Professional NoSQL. — John Wiley & Sons Inc, 2011. — 384 p. — ISBN 9780470942246, [1]), хранимыми в интернете можно отнести, например, Google, Yandex, Rambler, библиотечные поисковые системы. К основным недостаткам способа организации поиска документов этих систем можно отнести:Information retrieval systems focused on working with unstructured NoSQL data (Shashank Tiwari. Professional NoSQL. - John Wiley & Sons Inc, 2011. - 384 p. - ISBN 9780470942246, [1]) stored on the Internet include, for example, Google, Yandex, Rambler, library search engines. The main disadvantages of the method of organizing the search for documents of these systems include:

- низкая степень релевантности документов запросу. - low degree of relevance of documents to the request.

- большое количество документов, выдаваемых по запросу.- a large number of documents issued upon request.

- сложный отсев непрофессиональной и недостоверной информации.- complex elimination of unprofessional and unreliable information.

В существующих поисковых системах для нахождения необходимой информации отработан механизм поиска информации, основанный на построении инвертированного индекса (Маннинг К.Д., Рагхаван П., Шютце Х. "Введение в информационный поиск". Пер. с англ. – М, ООО «Вильямс», 2011 – 528с., [2]). Создается словарь ключевых слов, основанных на выборке слов из хранимой в базе данных (БД) информации. Словарь может включать любое количество слов, достаточное, с точки зрения разработчика, для отбора необходимой информации. При этом используются прикладные программы проведения лексического, морфологического, синтаксического, семантического анализов. Для выборки слов используется большой спектр отлаженных методов, основанных на алфавитной выборке, В-деревьях, тематическом разделении и других методах.In the existing search systems, to find the necessary information, an information retrieval mechanism based on the construction of an inverted index has been worked out (Manning K.D., Raghavan P., Schutze H. "Introduction to information retrieval". Translated from English - M, Williams LLC ", 2011 - 528s., [2]). A dictionary of keywords is created based on a selection of words from information stored in a database (DB). The dictionary may include any number of words sufficient, from the developer's point of view, to select the necessary information. At the same time, application programs for conducting lexical, morphological, syntactic, semantic analyzes are used. A wide range of well-established methods based on alphabetical sampling, B-trees, thematic division and other methods are used to select words.

При построении инвертированного индекса (структура данных) для поиска документов используются два основных подхода:When building an inverted index (data structure) for searching documents, two main approaches are used:

1) с точностью до документа (US2004030686A1 «Method and system of searching a database of records» (CARDNO ANDREW JOHN; MULGAN NICHOLAS JOHN, 12.02.2004), [3]; US2009030892 «SYSTEM OF EFFECTIVELY SEARCHING TEXT FOR KEYWORD, AND METHOD THEREOF» (IBM, 29.01.2009), [4]). В указанных аналогах [3, 4] каждому слову индекса сопоставляется список документов, где встречается данное слово. Далее осуществляется полнотекстовая обработка документа по всем ключевым словам. 1) accurate to the document (US2004030686A1 "Method and system of searching a database of records" (CARDNO ANDREW JOHN; MULGAN NICHOLAS JOHN, February 12, 2004), [3]; US2009030892 "SYSTEM OF EFFECTIVELY SEARCHING TEXT FOR KEYWORD, AND METHOD THEREOF » (IBM, 29.01.2009), [4]). In the indicated analogues [3, 4], each index word is associated with a list of documents where the given word occurs. Further, full-text processing of the document for all keywords is carried out.

2) с точностью до слова (US5696963 «System, method and computer program product for searching through an individual document and a group of documents» (SMARTPATENTS INC, 09.12.1997), [5]; US2004177064 «Selecting effective keywords for database searches» (IBM, 09.09.2004), [6]). В указанных аналогах [5, 6] Каждому слову индекса сопоставляется список документов, где встречается данное слово, а также указываются позиции слова в документе.2) accurate to the word (US5696963 "System, method and computer program product for searching through an individual document and a group of documents" (SMARTPATENTS INC, 09.12.1997), [5]; US2004177064 "Selecting effective keywords for database searches" (IBM, 09.09.2004), [6]). In the indicated analogues [5, 6] Each word of the index is associated with a list of documents where the given word occurs, and the position of the word in the document is also indicated.

В обоих подходах предлагаемые структуры данных могут дополняться различными параметрами (частотные, семантические и др. характеристики слов, документов). Составляются списки документов, позиций слов и эти списки анализируются на предмет пересечения списков и формирования нового списка в котором все ключевые слова встречаются.In both approaches, the proposed data structures can be supplemented with various parameters (frequency, semantic, and other characteristics of words, documents). Lists of documents, word positions are compiled and these lists are analyzed for the intersection of the lists and the formation of a new list in which all keywords occur.

В качестве прототипа к заявляемому способу принят подход с точностью до документа [4].As a prototype for the claimed method, an approach was adopted up to the document [4].

Достоинством первого варианта является простота и минимальный объем индекса.The advantage of the first option is simplicity and the minimum size of the index.

Недостатком является большой объем обработки при увеличении анализируемого количества ключевых слов. Высокая чувствительность к порядку обработки слов. Например, после выбора документов по первому сочетанию ключевых слов необходимо задать следующие дополнительные слова, но поиск будет вестись на множестве уже найденных документов. Важен порядок обработки ключевых слов при анализе, корректировке и изменении запроса.The disadvantage is a large amount of processing with an increase in the number of analyzed keywords. High sensitivity to word processing order. For example, after selecting documents by the first combination of keywords, you must specify the following additional words, but the search will be conducted on the set of documents already found. The order in which keywords are processed when analyzing, adjusting, and modifying a query is important.

Во втором варианте сразу для каждого слова извлекаются списки документов, где встречается данное слово, как в первом варианте, и дополнительно списки номеров позиций с точностью до позиции слова, начала и окончания слов (или длину), после чего производится операция анализа их пересечения. In the second variant, lists of documents where the given word occurs are immediately extracted for each word, as in the first variant, and additionally lists of position numbers accurate to the position of the word, the beginning and end of words (or length), after which the operation of analyzing their intersection is performed.

Достоинством второго варианта является удобно при обработке небольшого количества ключевых слов. Возможность восстановления текста, с некоторыми потерями качества, при его утрате.The advantage of the second option is that it is convenient when processing a small number of keywords. The ability to restore the text, with some loss of quality, if it is lost.

Недостатком является большой объем хранимого индекса, который может превышать объем памяти, занимаемой обрабатываемыми документами. Это зависит от количества ключевых слов, которое надо оптимизировать. Возрастает сложность хранения и обработки списков переменной длины. Большой объем обработки при увеличении анализируемого количества ключевых слов и росте БД.The disadvantage is that the stored index is large, which can exceed the amount of memory occupied by the processed documents. It depends on the number of keywords you need to optimize. The complexity of storing and processing lists of variable length increases. A large amount of processing with an increase in the analyzed number of keywords and the growth of the database.

В качестве прототипа к заявляемому устройству принята типовая классическая память для записи и считывания слов (Э. Таненбаум, Т. Остин. Архитектура компьютера, стр. 200, рис. 3.27, [7]). Работа оперативной памяти рассмотрена на примере, состоящем из 4 ячеек памяти (регистров), каждый регистр состоит из 3 бит. В общем случае память не ограничена ни по разрядности регистров, ни по количеству ячеек памяти (регистров). Существуют микросхемы на 8, 16, 32 и 64 бита при 512, 1024 и более ячеек памяти [7] стр. 204, отс.3.30. При увеличении памяти принцип ее работы не изменяется, а показанная на рис. 3.27, [7] схема многократно тиражируется с соответствующим увеличением разрядности регистров, количества входов/выходов и адресов. Здесь каждый триггер хранит один бит информации (на рисунке показаны 4 ряда триггеров по 3 в каждом ряду-регистре). Память содержит четыре 3-разрядных слова. Каждая операция считывает или записывает целое 3-разрядное слово. Логическая схема содержит 8 входных линий, в частности 3 входа для данных – I₀, I₁, I₂; 2 входа для адресов – A₀ и A₁; 3 входа управления – CS (Chip Select – выбор элемента памяти), RD (ReaD – чтение, этот сигнал позволяет отличать считывание от записи) и OE (Output Enable – разрешение выдачи выходных сигналов), а также 3 выходные линии для данных – O₀, O₁ и O₂. Состояние адресного входа определяет, каким четырем битам памяти разрешается ввод или вывод значения. Логика процесса такова, что бинарная строка <I₀, I₁, I₂> (здесь значение I_i равно 0 или 1, i=0,1,2) записывается в ячейку памяти (регистр) по адресу бинарной строки <A₀, A₁> в соответствии с командами, поступающими на входы <CS,RD,OE>. А также информация может считываться в соответствии с командами <CS,RD,OE> из регистров, указанных в адресе <A₀,A₁> в выходную бинарную строку <O₀, O₁,O₂> .As a prototype for the claimed device, a typical classical memory for writing and reading words was adopted (E. Tanenbaum, T. Austin. Computer architecture, p. 200, Fig. 3.27, [7]). The operation of RAM is considered using an example consisting of 4 memory cells (registers), each register consists of 3 bits. In the general case, memory is not limited either by the number of registers or by the number of memory cells (registers). There are microcircuits for 8, 16, 32 and 64 bits with 512, 1024 or more memory cells [7] p. 204, ots.3.30. With an increase in memory, the principle of its operation does not change, but the one shown in Fig. 3.27, [7] the circuit is repeatedly replicated with a corresponding increase in the bit depth of the registers, the number of inputs / outputs and addresses. Here, each trigger stores one bit of information (the figure shows 4 rows of triggers, 3 in each row-register). The memory contains four 3-bit words. Each operation reads or writes an entire 3-bit word. The logic circuit contains 8 input lines, in particular 3 inputs for data - I ₀ , I ₁ , I ₂ ; 2 inputs for addresses - A ₀ and A ₁ ; 3 control inputs - CS (Chip Select - selection of a memory element), RD (ReaD - reading, this signal allows you to distinguish reading from writing) and OE (Output Enable - permission to issue output signals), as well as 3 output lines for data - O ₀ , O ₁ and O ₂ . The address input state determines which four bits of memory are allowed to input or output a value. The logic of the process is such that the binary string <I ₀ , I ₁ , I ₂ > (here the value of I _i is equal to 0 or 1, i=0,1,2) is written to the memory cell (register) at the address of the binary string <A ₀ , A ₁ > in accordance with the commands received at the inputs <CS,RD,OE>. And also the information can be read in accordance with the commands <CS,RD,OE> from the registers specified in the address _{_{<A 0 ,A 1 >}} to the output binary string <O ₀ , O ₁ ,O ₂ > .

Представим логическую схему рис. 3.27 [7] в обобщенном виде фиг. 13, оставив только важные для понимания элементы: внешние выводы (входы, выходы, адреса, управляющие сигналы) и триггеры памяти. Let's imagine the logic diagram of Fig. 3.27 [7] in a generalized form of Fig. 13, leaving only the elements that are important for understanding: external pins (inputs, outputs, addresses, control signals) and memory triggers.

Прототип рис. 3.27 [7] не позволяет записывать бинарные строки по номерам позиций в ячейках памяти (регистрах) фиг.13 и 14 (по сути, по столбцам Ст.0, Ст.1, Ст2 фиг. 14).Prototype fig. 3.27 [7] does not allow writing binary strings by position numbers in memory cells (registers) of Fig.13 and 14 (in fact, by columns St.0, St.1, St2 of Fig. 14).

Решаемой технической проблемой является возможность обработки ключевых слов документов и представлении результатов обработки в наглядной и сжатой форме – с точностью до страниц документов. Это позволяет быстро, без чтения документов из БД оценить соответствие документа сформулированному запросу. При этом окончательное решение в рамках анализа найденной страницы, пользователь примет самостоятельно или интерактивно дополнит/изменит ключевые слова и увидит результаты изменений на экране. The technical problem to be solved is the possibility of processing the keywords of documents and presenting the results of processing in a visual and concise form - accurate to the pages of documents. This allows you to quickly, without reading the documents from the database, assess the compliance of the document with the formulated request. In this case, the final decision within the framework of the analysis of the page found, the user will make independently or interactively add / change the keywords and see the results of the changes on the screen.

Техническим результатом, обеспечиваемым заявляемыми техническими решениями, является аппаратная реализация процесса представления упорядоченного множества списков бинарных строк фиксированной длины <Str₁,Str₂,…,Str_k> в табличную форму, где каждая строка Str_j является столбцом таблицы. Обеспечивается возможность построчного чтения и анализа строк таблицы - регистровое отображение <R₁,R₂,…, R_n> - которое также является множеством бинарных строк, когда каждому номеру j регистра R_j соответствует j-ая позиция бита в списках <Str₁,Str₂,…,Str_k>,. Это позволяет исключить операцию по пересечению списков <Str₁,Str₂,…,Str_k>, относящихся к различным ключевым словам, считая, что каждый i-ый список Str_i связан с i-ым ключевым словом, а номер j-го бита в строке Str_i отражает номер документа/страницы, что ускоряет процесс отбора документов/страниц документов, т.к. вся строка <R₁,R₂,…, R_n> относится к одному j-ому документу/странице. The technical result provided by the claimed technical solutions is a hardware implementation of the process of presenting an ordered set of lists of binary strings of fixed length <Str ₁ ,Str ₂ ,…,Str _k > into a tabular form, where each line Str _j is a table column. The possibility of line-by-line reading and analysis of table rows is provided - register mapping <R ₁ ,R ₂ ,…, R _n > - which is also a set of binary strings, when each number j of register R _j corresponds to the j-th bit position in the lists <Str ₁ , Str ₂ ,…, Str _k >,. This allows us to eliminate the operation of crossing lists <Str ₁ ,Str ₂ ,…,Str _k > related to different keywords, considering that each i-th list Str _i is associated with the i-th keyword, and the number of the j-th bit in the line Str _i reflects the number of the document / page, which speeds up the process of selecting documents / pages of documents, because the entire line <R ₁ ,R ₂ ,…, R _n > refers to one j-th document/page.

Техническим результатом способа является предложенная структура данных, включающая таблицу ключевых слов (индексный словарь), таблицу документов, таблицу бинарных списков и таблицу бинарных строк документов. Предложенная структура данных позволяет проводить поиск документов по множествам ключевых слов с точностью до страниц документов, без чтения самих документов из БД и представление результата поиска в виде Таблицы ответа, что облегчает процедуру отбора документов в соответствии с заданными критериями. The technical result of the method is the proposed data structure, including a table of keywords (index dictionary), a table of documents, a table of binary lists and a table of binary strings of documents. The proposed data structure allows you to search for documents by a set of keywords up to document pages, without reading the documents themselves from the database and presenting the search result in the form of an Answer Table, which facilitates the procedure for selecting documents in accordance with specified criteria.

Также техническими результатами являются:Also technical results are:

- При поиске документов не требуется чтение самих документов из БД, при этом выбираются для просмотра только страницы, точно удовлетворяющие требованиям запроса без чтения документов;- When searching for documents, reading the documents themselves from the database is not required, while only pages that exactly meet the requirements of the query are selected for viewing without reading the documents;

- Возможность интерактивного взаимодействия с системой и постоянное циклическое уточнение или изменение запроса путем добавления и исключения ключевых слов в процессе анализа страниц, при этом документы подставляются или исключаются в список или из списка обрабатываемых документов, а слова – в список ключевых слов для поиска документов;- The possibility of interactive interaction with the system and constant cyclic refinement or change of the query by adding and excluding keywords in the process of page analysis, while documents are substituted or excluded from the list or from the list of processed documents, and words - into the list of keywords for searching documents;

- Процесс поиска документов хорошо масштабируется и распараллеливается;- The document search process is highly scalable and parallelizable;

- Способ позволяет определять приоритет в списке ключевых слов и придавать важность каждому ключевому слову, система может уточнять и предлагать варианты исходя из собираемой статистики, в том числе и персональной для построения индивидуальных моделей пользователя;- The method allows you to determine the priority in the list of keywords and give importance to each keyword, the system can refine and offer options based on the collected statistics, including personal ones for building individual user models;

- Задание различных множеств на множестве ключевых слов {Si} (например, множество витаминов, кислот, деревьев и т.д.) с целью уточнения смыслов документов и поиска дополнительных вариантов ответов;- Setting different sets on the set of keywords {Si} (for example, a set of vitamins, acids, trees, etc.) in order to clarify the meanings of documents and search for additional answers;

- Фиксация длины бинарных строк для номеров страниц документа на уровне 128 байт позволяет отказаться от сортировки списков страниц и реализовать простой механизм наглядного анализа страниц документов;- Fixing the length of binary strings for document page numbers at the level of 128 bytes allows you to refuse from sorting page lists and implement a simple mechanism for visual analysis of document pages;

- Фиксация длины бинарных строк на уровне 128 кБайт и более для больших БД (более десятков миллионов документов) при реализации DM памяти позволяет отказаться от сортировки списков документов. В этом случае длина бинарной строки должна соответствовать максимальному количеству документов в БД. т.к. номер бита в бинарной строке должен соответствовать номеру документа в БД. Например, бинарная строка 128 кБайт соответствует 1 млн. документов;- Fixing the length of binary strings at the level of 128 kB or more for large databases (more than tens of millions of documents) when implementing DM memory allows you to refuse to sort the lists of documents. In this case, the length of the binary string must correspond to the maximum number of documents in the database. because the bit number in the binary string must match the document number in the database. For example, a binary string of 128 kB corresponds to 1 million documents;

- Возможность параллельной обработки множеств (десятков и сотен) ключевых слов;- Possibility of parallel processing of sets (tens and hundreds) of keywords;

- Свободный просмотр результатов обработки ключевых слов в любой последовательности – последовательность задает индивидуальную важность ключевых слов для пользователя и определяется простой перестановкой столбцов в таблице ответа, что позволяет при отборе страниц учитывать эти приоритеты и при необходимости отбрасывать менее важные слова;- Free viewing of the results of processing keywords in any sequence - the sequence sets the individual importance of keywords for the user and is determined by a simple rearrangement of the columns in the response table, which allows you to take these priorities into account when selecting pages and, if necessary, discard less important words;

- Итоговые отобранные пользователем таблицы ответов являются удобным средством для построения прикладных формальных алгоритмов анализа документов, построения различных метрик и пространств классификации документов, построения персональных моделей информационного пространства пользователя;- The final user-selected answer tables are a convenient tool for building applied formal algorithms for document analysis, building various metrics and document classification spaces, building personal models of the user's information space;

- Точность обработки ограничена страницей документа, при этом страница будет выведена пользователю для его окончательной оценки на предмет ее пертинентности, кроме того при необходимости, можно легко дополнить обработку ключевых слов, в объеме одной страницы.- The accuracy of processing is limited to the page of the document, while the page will be displayed to the user for his final assessment of its pertinence, in addition, if necessary, you can easily supplement the processing of keywords in the amount of one page.

Сущность заявленного способа состоит в том, что пользователь формирует запрос и указывает ключевые слова и логические операции с ними. Затем происходит выделение ключевых слов в запросе в блоке обработки ключевых слов запроса, использования инвертированного индекса c программами формирования, развития и сопровождения работы инвертированного индекса, при этом упомянутые программы взаимодействуют c блоком индексации, который взаимодействует c базой данных, программами выборки списков по ключевым словам, отличающийся тем, что:The essence of the claimed method is that the user forms a request and specifies keywords and logical operations with them. Then the keywords in the query are selected in the query keywords processing block, the inverted index is used with programs for the formation, development and maintenance of the inverted index, while the mentioned programs interact with the indexing block, which interacts with the database, programs for selecting lists by keywords, characterized in that:

- программы формируют вспомогательные таблицы: ключевых слов, документов, бинарных строк номеров документов и бинарных строк, дополняющие инвертированный индекс, при этом- programs form auxiliary tables: keywords, documents, binary strings of document numbers and binary strings that supplement the inverted index, while

- таблица ключевых слов содержит список ключевых слов, каждое ключевое слово имеет ссылку на бинарную строку документов и ссылку на строку таблицы документов, а также здесь же указывается дополнительная информация: количество документов, где используется ключевое слово, другие данные о ключевом слове: термин, сокращение, множество или элемент множества; - the keyword table contains a list of keywords, each keyword has a link to a binary string of documents and a link to a document table string, and additional information is also indicated here: the number of documents where the keyword is used, other data about the keyword: term, abbreviation , a set or an element of a set;

- таблица документов содержит списки пар, связанных с ключевым словом, пара - это номер документа, в котором используется ключевое слово и ссылка на бинарную строку в таблицы бинарных строк;- the document table contains lists of pairs associated with the keyword, the pair is the number of the document in which the keyword is used and a link to the binary string in the binary string table;

- таблица бинарных строк содержит список бинарных строк фиксированной длины, каждая строка связана с номером документа из таблицы документов, при этом каждый бит бинарной строки – соответствует одной странице текста документа, номер бита в строке соответствует номеру страницы этого документа, а 1 или 0, стоящие на определенной позиции, указывают на присутствие или отсутствие заданного ключевого слова на данной странице;- the table of binary strings contains a list of binary strings of a fixed length, each string is associated with the document number from the document table, while each bit of the binary string corresponds to one page of the document text, the bit number in the string corresponds to the page number of this document, and 1 or 0 standing at a certain position, indicate the presence or absence of a given keyword on this page;

- таблица бинарных строк документов состоит из множества бинарных строк фиксированной длины, при этом каждый номер бита в бинарной строке соответствует номеру документа, а 1 в этом бите означает, что в документе с данным номером встречается заданное ключевое слово;- the table of binary strings of documents consists of a set of binary strings of a fixed length, with each bit number in the binary string corresponding to the document number, and 1 in this bit means that the specified keyword occurs in the document with this number;

- после выборки списков по ключевым словам из инвертированного индекса, осуществляют обработку списков документов, представленных в виде бинарных строк в блоке обработки результирующего списка, при этом бинарные строки загружаются для обработки в двойную память, в которой числовое значение строки равна 2ⁿ , где n – разрядность регистра памяти, в случае, если все заданные ключевые слова встречаются в документе с номером равным номеру строки в двойной памяти, это позволяет не прибегать к операции пересечения списков, при этом- after selecting the lists by keywords from the inverted index, the lists of documents presented as binary strings in the processing block of the resulting list are processed, while the binary strings are loaded for processing into double memory, in which the numerical value of the string is 2 ⁿ , where n is bit depth of the memory register, if all the specified keywords occur in a document with a number equal to the line number in double memory, this allows you not to resort to the operation of crossing lists, while

- в блоке обработки страниц документов по ключевым словам формируется для каждого документа таблица ответа, при этом происходит загрузка бинарных строк из таблицы бинарных строк в двойную память, их сортировка и анализ, при этом в таблице ответа каждому столбцу соответствует свое ключевое слово, столбцы упорядочены в соответствии с их важностью, причем каждый столбец ключевого слова соответствует бинарной строке таблицы бинарных строк, таблица ответа позволяет отразить содержащиеся в документе ключевые слова с точностью до страницы без загрузки самого документа из базы данных.- in the block for processing pages of documents by keywords, an answer table is formed for each document, while binary strings are loaded from the table of binary strings into double memory, they are sorted and analyzed, while in the answer table each column has its own keyword, the columns are ordered in according to their importance, with each keyword column corresponding to a binary string of the binary string table, the response table allows the keywords contained in the document to be displayed up to the page without loading the document itself from the database.

Сущность заявленного устройства состоит в том, что двойная память для организации поиска документов в прикладных неструктурированных базах данных представляет собой логическую схему памяти, обеспечивающую возможность записи данных по заданному адресу ячейки память, а также возможность чтения из ячеек памяти по заданному адресу ячейки записанных данных и вывод считанных данных в выходные линии в соответствии с сигналами управления CS, RD, OE и СМ, отличающаяся тем, что: The essence of the claimed device is that the dual memory for organizing the search for documents in applied unstructured databases is a logical memory circuit that provides the ability to write data at a given memory cell address, as well as the ability to read from memory cells at a given cell address of the recorded data and output read data into the output lines in accordance with the control signals CS, RD, OE and CM, characterized in that:

- содержит второй канал ввода данных, обеспечивающий запись битовых строк в заданный бит (номер столбца) ячейки памяти во все ячейки памяти одновременно, при этом длина битовой строки равна количеству ячеек памяти, а количество и разрядность ячеек памяти ограничено технологическими возможностями микроэлектроники; - contains a second data input channel that provides writing bit strings to a given bit (column number) of a memory cell to all memory cells simultaneously, while the length of the bit string is equal to the number of memory cells, and the number and capacity of memory cells is limited by the technological capabilities of microelectronics;

- вход СМ, выполненный с возможностью переключения работы схемы из режима обычной памяти - записи входных данных в ячейки памяти, в режим двойной памяти – записи входных данных в заданный бит каждой ячейки памяти;- SM input configured to switch the operation of the circuit from conventional memory mode - writing input data to memory cells, to dual memory mode - writing input data to a given bit of each memory cell;

- содержит блок адресации номеров бита в ячейках памяти (номера столбцов), задающих общий номер бита для всех ячеек памяти, в которые осуществляется запись данных;- contains a block for addressing bit numbers in memory cells (column numbers) that specifies a common bit number for all memory cells into which data is written;

- после блока адресации ячеек установлен переключатель, который пропускает или блокирует сигнал адреса ячейки памяти в зависимости от управляющего сигнала СМ;- a switch is installed after the cell addressing block, which passes or blocks the signal of the address of the memory cell, depending on the control signal CM;

- содержит логические переключатели, установленные на всех входах триггеров памяти, для переключения канала поступления адресов с адресов ячеек памяти на адреса битов в ячейках памяти (столбцов), в зависимости от управляющего сигнала СМ;- contains logical switches installed on all inputs of memory triggers to switch the channel for receiving addresses from addresses of memory cells to addresses of bits in memory cells (columns), depending on the control signal SM;

- содержит логические переключатели, установленные на всех входах поступления данных триггеров памяти, для переключения каналов получения данных в зависимости от управляющего сигнала СМ;- contains logical switches installed on all data inputs of memory triggers to switch data receiving channels depending on the control signal CM;

- выход блока управления записью/чтением CS, RD, OE, соединен с выходом блока выбора номера бита (адреса столбца).- the output of the write/read control block CS, RD, OE, is connected to the output of the bit number selection block (column address).

Краткое описание чертежей.Brief description of the drawings.

На фиг. 1 показана обобщенная схема способа организации поиска документов в больших БД; на фиг. 2 – структура данных рабочих таблиц; на фиг. 3 – бинарная строка; на фиг. 4 – пример таблицы ответа документов, по столбцам бинарные строки номеров документов ASDi; на фиг. 5 – пример таблицы ответа страниц, по столбцам показаны строки Str_i соответствующие ключевым словам; на фиг. 6 – пример уплотненной и упорядоченной таблицы ответа; на фиг. 7 – пример таблицы ответа, £=0,2; на фиг. 8 – пример фрагмента анализируемого текста; на фиг. 9 – блок схема этапов обработки анализируемого текста примера 1; на фиг. 10 – заполнение таблиц примера 1; на фиг. 11 – общий вид условной таблицы ответа; на фиг. 12 – пример таблицы из шести бинарных столбцов длиною в 12 бит каждая; на фиг. 13 – логическая блок-схема для памяти 4 х 3; на фиг. 14 – обобщенная блок-схема схема памяти 4х3, показанной на фиг. 13; на фиг. 15 – обобщенная блок-схема схема памяти на r ячеек памяти разрядностью в n бит; на фиг. 16 – обобщенная блок-схема схема DM памяти 4х4; на фиг. 17 – обобщенная блок-схема схема памяти на r ячеек памяти разрядностью в n бит; на фиг. 18 - пример исполнения логической схемы DM памяти 4х4.In FIG. 1 shows a generalized scheme of the method for organizing the search for documents in large databases; in fig. 2 – data structure of working tables; in fig. 3 – binary string; in fig. 4 - an example of a document response table, binary strings of ASDi document numbers in columns; in fig. 5 - an example of a page response table, the columns show the lines Str _i corresponding to the keywords; in fig. 6 is an example of a compacted and ordered answer table; in fig. 7 – an example of the answer table, £=0.2; in fig. 8 - an example of a fragment of the analyzed text; in fig. 9 - block diagram of the stages of processing the analyzed text of example 1; in fig. 10 - filling in the tables of example 1; in fig. 11 - general view of the conditional answer table; in fig. 12 - an example of a table of six binary columns with a length of 12 bits each; in fig. 13 - logical block diagram for memory 4 x 3; in fig. 14 is a generalized block diagram of the 4x3 memory circuit shown in FIG. 13; in fig. 15 - a generalized block diagram of a memory diagram for r memory cells with a capacity of n bits; in fig. 16 is a generalized block diagram of a 4x4 memory DM diagram; in fig. 17 - a generalized block diagram of a memory diagram for r memory cells with a capacity of n bits; in fig. 18 is an example of a 4x4 memory DM logic circuit.

Перечень обозначений фиг. 13 - 17:The list of designations of Fig. 13 - 17:

I₀, I₁, I₂, I₃ – входные данные для записи в регистры; I ₀ , I ₁ , I ₂ , I ₃ - input data for writing to registers;

A₀, A₁- два входа для адресации ячеек памяти; A ₀ , A ₁ - two inputs for addressing memory cells;

CS – выбор элемента памяти; CS – memory element selection;

RD – чтение (позволяет отличать считывание от записи); RD - reading (allows you to distinguish between reading and writing);

OE – разрешение выдачи данных, OE - permission to issue data,

J₀, J₁, J₂, J₃ – входные данные для записи столбцов; J ₀ , J ₁ , J ₂ , J ₃ - input data for writing columns;

CL₀, CL₁- два входа для адресации столбцов;CL ₀ , CL ₁ - two inputs for column addressing;

Ст.0, Ст.1, Ст.2, Ст.3 – столбцы (номера бит в ячейке памяти); St.0, St.1, St.2, St.3 - columns (bit numbers in a memory cell);

O₀, O₁, O₂, O₃ – выходные данные для чтения из ячеек памяти.O ₀ , O ₁ , O ₂ , O ₃ - output data for reading from memory cells.

Разрешение вывода данных CS·RD·OE.Data output resolution CS·RD·OE.

Осуществление способа. The implementation of the method.

Способ организации поиска документов в больших БД (фиг. 1) включает следующие типовые этапы:The method for organizing the search for documents in large databases (Fig. 1) includes the following typical steps:

- Пользователь (4), являющийся заинтересованным потребителем информации, формирует запрос (5) на языке близком к SQL, при этом пользователь указывает ключевые слова и логические операции с ними, причем запрос преимущественно ограничивается простым перечислением слов;- The user (4), who is an interested consumer of information, forms a query (5) in a language close to SQL, while the user specifies keywords and logical operations with them, and the query is mainly limited to a simple enumeration of words;

- В блоке обработки ключевых слов запроса (6) происходит выделение ключевых слов в запросе.- In the query keywords processing block (6), the keywords in the query are highlighted.

- Программы формирования, развития и сопровождения работы инвертированного индекса формируют инвертированный индекс (3), при этом упомянутые программы взаимодействуют с блоком индексации (2), который взаимодействует БД (1), выбирают списки документов по ключевым словам (7), проводят пересечение списков ключевых слов на предмет поиска множества документов, в которых одновременно встречаются заданные ключевые слова и выполняют над ними заданные логические условия, получают результирующий список. Этот список документов уже является промежуточным результатом поиска и может выводиться пользователю. В настоящее время поисковые системы дополняют обработку этого списка полнотестовым поиском всех документов этого списка, что позволяет при выводе информации на экран текста документов подсвечивать цветом все найденные ключевые слова, что облегчает пользователю его просмотр. - Programs for the formation, development and maintenance of the inverted index form an inverted index (3), while the mentioned programs interact with the indexing unit (2), which interacts with the database (1), select lists of documents by keywords (7), cross the lists of keywords words in order to search for a set of documents in which the specified keywords are simultaneously found and the specified logical conditions are performed on them, the resulting list is obtained. This list of documents is already an intermediate search result and can be displayed to the user. Currently, search engines complement the processing of this list with a full-test search for all documents in this list, which allows, when displaying information on the screen of the text of documents, to highlight all found keywords in color, which makes it easier for the user to view it.

Как отмечено на схеме блоки фиг. 1 с первого по седьмой предусматривают типовые функции построения и использования инвертированных индексов в существующих информационно-поисковых системах.As noted in the block diagram of FIG. 1 from the first to the seventh provide typical functions for building and using inverted indexes in existing information retrieval systems.

- В блоке обработки результирующего списка документов (8) из результирующего списка по номеру документа из Таблицы документов (12 фиг. 2) считывается адрес бинарной строки, а затем и сама бинарная строка документа из Таблицы бинарных строк (13 фиг. 2) по каждому ключевому слову. - In the processing block of the resulting list of documents (8), from the resulting list, by the document number from the Table of Documents (12 Fig. 2), the address of the binary string is read, and then the binary string of the document from the Table of Binary Strings (13 Fig. 2) for each key word.

- Блок обработки страниц документов по ключевым словам (9) формирует для каждого документа таблицу ответа (фиг. 4), при этом происходит загрузка бинарных строк в двойную память, далее DM (фиг.17), их сортировка и анализ в соответствии с заданными параметрами по смягчению или ужесточению соответствия ответов запросу. Пользователь может в интерактивном режиме выбирать и просматривать страницы документа, изменять запрос, важность ключевых слов и другие параметры отбора страниц.- The block for processing document pages by keywords (9) forms an answer table for each document (Fig. 4), while loading binary strings into double memory, then DM (Fig. 17), sorting and analyzing them in accordance with the specified parameters to soften or tighten the compliance of responses to a request. The user can interactively select and view pages of the document, change the query, keyword importance, and other page selection parameters.

- Выводится итоговый список отобранных документов (10) в качестве ответа на поданный запрос.- The final list of selected documents (10) is displayed as a response to the submitted request.

В существующих поисковых системах осуществляется выборка списков документов, по ключевым словам, (блок 7, фиг.1). Далее программно проводится операция пересечения этих списков на предмет нахождения результирующего списка документов в котором в каждом документе присутствовали все ключевые слова. In existing search engines, lists of documents are selected by keywords (block 7, figure 1). Further, the operation of crossing these lists is programmatically carried out in order to find the resulting list of documents in which all the keywords were present in each document.

В предлагаемом варианте возможно проведение операции пересечения списков документов в DM памяти. Для этого необходимо представлять списки документов в виде бинарных списков фиг. 3 также, как и в случае с бинарными списками страниц Str_i, только большего объема (в миллионы бит). При этом потребуется DM память значительно большего размера, что выразится в ее стоимости. Например, для обработки бинарных списков страниц достаточна длина бинарной строки в 128 байт, а для обработки бинарных списков документов потребуется строка порядка 128 Кбайт и более, для единовременной загрузки списка на 1 млн. документов. В этом случае номер бита в бинарной строке будет соответствовать номеру документа, в котором ключевое слово встретилось. Тогда обработка списка, например, в 14 миллионов документов приведет к 14-ти циклическим загрузкам DM памяти. Все, проводимые операции в DM памяти полностью идентичны операциям со списками страниц документов. В этом случае, как осуществляют в системах сейчас, документ выбирают из Таблицы ответов документов (15) и вносят в результирующий список документов блока 7 фиг. 1 в случае полного присутствия всех ключевых слов строке – документе, например, на фиг. 4 это документы 8 и 11 в строках 8 и 11. In the proposed version, it is possible to carry out the operation of crossing the lists of documents in the DM memory. To do this, it is necessary to represent lists of documents in the form of binary lists of FIG. 3 as well as in the case of binary lists of pages Str _i , only larger (in millions of bits). This will require a much larger DM memory, which will be reflected in its cost. For example, for processing binary lists of pages, a binary string of 128 bytes is sufficient, and for processing binary lists of documents, a string of about 128 KB or more is required to download a list of 1 million documents at a time. In this case, the bit number in the binary string will correspond to the document number in which the keyword occurs. Then processing a list of, for example, 14 million documents will result in 14 cyclic DM memory loads. All operations carried out in DM memory are completely identical to operations with lists of document pages. In this case, as is currently done in systems, the document is selected from the Document Response Table (15) and entered into the resulting list of documents in block 7 of FIG. 1 in the case of the full presence of all keywords in a line - a document, for example, in Fig. 4 is documents 8 and 11 on lines 8 and 11.

БД (1) является хранилищем всех накопленных документов. Формат и структура данных задается выбранной типовой системой управления данными. Сопровождается всем необходимым программным обеспечением: средства записи документов, манипулирования данными, чтения данных, языки работы с данными (описания данных, манипулирования данными, запроса).DB (1) is a repository of all accumulated documents. The format and structure of the data is specified by the selected generic data management system. It is accompanied by all the necessary software: tools for writing documents, manipulating data, reading data, languages for working with data (data descriptions, data manipulation, query).

Программы формирования, развития и сопровождения работы инвертированного индекса (3) осуществляют запись, чтение, сортировки, внесение изменений, формирование рабочей структуры данных и формирование рабочих векторов ключевых слов (указатели на документы), разделение или объединение индексов, включение или исключение слов, анализ и т.д.Programs for the formation, development and maintenance of the work of the inverted index (3) carry out writing, reading, sorting, making changes, forming a working data structure and forming working vectors of keywords (pointers to documents), separating or combining indices, including or excluding words, analyzing and etc.

Предлагаемый способ ориентирован, прежде всего, на поиск в печатных изданиях фактографической информации в профессиональных библиотеках (например, биология, генная инженерия, фармакология - информация по лекарствам, механизмам их воздействия на организм, способах лечения болезней и т.д.). The proposed method is focused primarily on searching printed publications for factual information in professional libraries (for example, biology, genetic engineering, pharmacology - information on drugs, mechanisms of their effects on the body, methods of treating diseases, etc.).

Пусть Ω = {Dj} - множество документов Dj, где j=1,m. Здесь m – количество документов, например, 700’000 документов и более. Документом является книга, журнал, статьи, архивные документы, протоколы и результаты исследований и т.д., т.е. любое печатное издание, удобное для обработки и вывода на экран компьютера.Let Ω = {Dj} be the set of documents Dj, where j=1,m. Here m is the number of documents, for example, 700,000 documents or more. A document is a book, journal, articles, archival documents, protocols and research results, etc., i.e. any printed publication that is convenient for processing and displaying on a computer screen.

В отличие от прототипа [2] предлагаемый способ включает следующую структуру данных, которая используется в блоках 8, 9 и 10: таблицы ключевых слов (11), таблицы документов (12), таблицы бинарных строк (13) и таблицы бинарных строк документов (14). Структура упомянутых таблиц показана на фиг.2. Unlike the prototype [2], the proposed method includes the following data structure, which is used in blocks 8, 9 and 10: keyword tables (11), document tables (12), binary string tables (13) and document binary string tables (14 ). The structure of said tables is shown in Fig.2.

Таблица ключевых слов (11) содержит список из r строк. Каждая строка содержит: The keyword table (11) contains a list of r lines. Each line contains:

- ключевое слово {Si}_i=1,r, - keyword {Si} _i=1,r ,

- число {Сj}_j=1,k, указывающее количество документов, в которых встречается данное ключевое слово, - number {Сj} _j=1,k indicating the number of documents in which the given keyword occurs,

- адрес ссылки {ASD_i}_i=1,n на строку из таблицы бинарных строк документов (14), - link address {ASD _i } _i=1,n to a line from the table of binary strings of documents (14),

- адрес ссылки {Ad_i}_i=1,n на строку из таблицы документов (12). - address of the link {Ad _i } _i=1,n to a line from the table of documents (12).

При этом слов больше, чем ссылок. Например, слово “дом” и слова “домик”, “домом” и др. могут иметь одну общую ссылку.There are more words than links. For example, the word “house” and the words “house”, “house”, etc. may have one common reference.

Таблица ключевых слов (11) может быть разбита на произвольное количество разделов - подтаблиц. Эти разделы связаны с семантикой ключевых слов: слова литературные, слова предметной области, сокращения, синонимы, наименования и т.д. Таблица ключевых слов (11) может быть дополнена столбцами, отражающими свойства слов и их различную группировку: синонимы, антонимы, различные классы (множества слов), тематические термины (математика, химия биология и т.д.), семантические классы, искусственные метрики и т.д.The table of keywords (11) can be divided into an arbitrary number of sections - sub-tables. These sections are related to the semantics of keywords: literary words, domain words, abbreviations, synonyms, denominations, etc. The table of keywords (11) can be supplemented with columns reflecting the properties of words and their various groupings: synonyms, antonyms, various classes (sets of words), thematic terms (mathematics, chemistry, biology, etc.), semantic classes, artificial metrics, and etc.

Таблица документов (12) содержит t строк {<Sp>_i}_i=1,t, где каждая строка Sp_i состоит из списка пар:The document table (12) contains t lines {<Sp> _i } _i=1,t , where each line Sp _i consists of a list of pairs:

Sp_i = {<Nd₁, A_st1>₁, <Nd₂, A_st2>₂, <Nd₃, A_st3>₃, . . . <Ndt, A_stt> _t}_i,Sp _i = {<Nd ₁ , A _st1 > ₁ , <Nd ₂ , A _st2 > ₂ , <Nd ₃ , A _st3 > ₃ , . . . <Ndt, A _stt > _t } _i ,

где Nd_j, j=1,t - номер документа, в котором используется ключевое слово из таблицы ключевых слов (11); where Nd _j , j=1,t is the number of the document that uses the keyword from the keywords table (11);

A_stri - адрес ссылки на бинарную строку в таблицы бинарных строк (13).A _stri is the address of the link to the binary string in the binary string tables (13).

В каждой строке таблицы документов (12) пары упорядочены в соответствии с номерами документов Nd_j. Число Сi из таблицы ключевых слов (11) указывает на количество пар в таблице документов (12), связанных с заданным ключевым словом. В таблице документов (12) перечислены все номера документов, в которых встречаются ключевые слова из таблицы ключевых слов (11). In each row of the document table (12), the pairs are ordered according to the document numbers Nd _j . The Ci number from the keyword table (11) indicates the number of pairs in the document table (12) associated with the given keyword. The document table (12) lists all document numbers that contain keywords from the keyword table (11).

Это позволяет оценить информативность каждого ключевого слова и установить границу отсечения излишних ключевых слов или построить новые. Слово, которое встречается в 95% документов не позволяет эффективно различать документы и выделить среди них требуемое, которое относят к стоп-словам. Новые ключевые слова могут образовывать и сочетания существующих ключевых слов.This allows you to evaluate the information content of each keyword and set a cutoff limit for redundant keywords or build new ones. A word that occurs in 95% of documents does not allow one to effectively distinguish between documents and select among them the required one, which is referred to as stop words. New keywords can also form combinations of existing keywords.

Таблица бинарных строк (13) содержит список из m бинарных строк фиксированной длины {Str_i}_i=1,m. Структура бинарной строки Str_i показана на фиг. 3, где каждая клетка – это один бит, соответствующий одной странице текста документа, номер бита в строке соответствует номеру страницы этого документа, а 1 или 0, стоящие на определенной позиции, указывают на присутствие или отсутствие заданного слова на данной странице.Binary string table (13) contains a list of m binary strings of fixed length {Str _i } _i=1,m . The structure of the binary string Str _i is shown in FIG. 3, where each cell is one bit corresponding to one page of the text of the document, the number of bits in the line corresponds to the page number of this document, and 1 or 0 standing at a certain position indicate the presence or absence of a given word on this page.

Пусть документ Dj имеет 2 млн. знаков, что соответствует примерно 1000 страницам текста. Тогда ему будет соответствовать бинарная строка длиной в 1000 Бит или 128 байт. Количество бит в бинарной строке будет больше (кратно степени 2) – появляются резервные биты. Предполагается, что книг с большим количеством страниц крайне мало, а встречающиеся разбиваются на тома (возможно искусственное разбиение).Let the document Dj have 2 million characters, which corresponds to about 1000 pages of text. Then it will correspond to a binary string with a length of 1000 Bits or 128 bytes. The number of bits in the binary string will be greater (a multiple of the power of 2) - there are reserve bits. It is assumed that there are very few books with a large number of pages, and those that occur are divided into volumes (artificial partitioning is possible).

Предлагаемая структура данных является открытой и всегда может быть дополнена необходимыми для обработки параметрами, отражающие частотные, семантические, логические, морфологические и др. характеристики слов и документов. Это отражается в увеличении размерности таблиц и/или включении дополнительных, отражающие дополнительные характеристики документов и слов. Например, Таблица 11 может пополниться ссылками на дополнительную таблицу двойных/тройных ключевых слов, сокращениями и т.д. Структура может быть дополнена таблицей множеств/подмножеств ключевых слов, например, множество витаминов с перечнем витаминов, множество деревьев с перечнем видов и т.д. Таблицы 12 и 13 могут также включать дополнительные параметры, тем более в каждой бинарной строке есть резерв в 24 байта.The proposed data structure is open and can always be supplemented with parameters necessary for processing, reflecting the frequency, semantic, logical, morphological and other characteristics of words and documents. This is reflected in the increase in the dimension of tables and / or the inclusion of additional ones, reflecting additional characteristics of documents and words. For example, Table 11 can be supplemented with links to an additional table of double/triple keywords, abbreviations, etc. The structure can be supplemented with a table of sets/subsets of keywords, for example, a set of vitamins with a list of vitamins, a set of trees with a list of species, etc. Tables 12 and 13 may also include additional parameters, especially since each binary string has a reserve of 24 bytes.

Этапы формирования таблиц (11), (12) и (13) показаны на фиг. 8, 9, 10.The steps for generating tables (11), (12) and (13) are shown in FIG. 8, 9, 10.

При формировании запроса (5) (фиг. 1) пользователь указывает список ключевых слов <S1,S2,…,Sk>, упорядоченных по их важности для пользователя. Возможно построение классического языка запросов с использованием логических символов И, ИЛИ, НЕ. Ключевые слова задает сам пользователь исходя из личного представления о требуемом содержании искомого документа. Степень важности – индивидуальное восприятие пользователем задаваемых им ключевых слов (числовое значение или просто порядок слов) ключевых слов можно:When forming a query (5) (Fig. 1), the user specifies a list of keywords <S1,S2,…,Sk>, ordered by their importance to the user. It is possible to build a classical query language using logical symbols AND, OR, NOT. The keywords are set by the user himself based on his personal idea of the required content of the document being searched for. Degree of importance - the user's individual perception of the keywords he sets (numerical value or just word order) of keywords can be:

- задавать линейной или иной функцией, которую можно рассчитывать строя индивидуальные модели предметной области для каждого пользователя в отдельности;- set by a linear or other function that can be calculated by building individual models of the subject area for each user separately;

- самому пользователю при их перечислении (вводе) расстановки слов:- the user himself when listing (entering) the arrangement of words:

- автоматически исходя из частотных характеристик слов.- automatically based on the frequency characteristics of words.

Степень важности позволяет отбрасывать наименее важные слова и больше сосредотачиваться на сочетаниях более важных слов. Количество задаваемых ключевых слов ограничено только логикой и уровнем восприятия пользователя условно можно остановиться на 64 словах, хотя для отдельных приложений может потребоваться и значительно больше - при обработке множеств ключевых слов.The degree of importance allows you to discard the least important words and focus more on combinations of more important words. The number of keywords that can be set is limited only by logic and the level of user perception, it is conditionally possible to stop at 64 words, although for individual applications much more may be required when processing multiple keywords.

На основе запроса по каждому документу строится таблица ответа (16) (фиг. 5), в которой каждому столбцу соответствует свое ключевое слово.Based on the query for each document, a response table (16) (Fig. 5) is built, in which each column has its own keyword.

Столбцы упорядочены в соответствии с их важностью. Каждый столбец ключевого слова соответствует бинарной строке таблицы бинарных строк (13). Поэтому при формировании таблицы ответа программа блока (9) считывает и собирает по каждому документу и по каждому ключевому слову в таблицу ответа (16) фиг. 5 готовые бинарные строки фиксированной длины из таблицы бинарных строк (13) и располагает их в соответствии с заданным пользователем порядком важности в столбцах таблицы ответа (16) фиг. 5. При этом порядок записи слов такой, что чем левее (условно, по умолчанию) в таблице ответа расположено слово, тем оно важнее. Слово S_i+1важнее слова S_i (фиг. 5).The columns are ordered according to their importance. Each keyword column corresponds to a binary string of the binary string table (13). Therefore, when forming the response table, the block program (9) reads and collects for each document and for each keyword in the response table (16) of Fig. 5 prepared fixed-length binary strings from the binary string table (13) and arranges them in accordance with the user-specified order of importance in the columns of the response table (16) of FIG. 5. At the same time, the order of writing words is such that the more to the left (conditionally, by default) the word is located in the answer table, the more important it is. The word S _i+1 is more important than the word S _i (Fig. 5).

На фиг. 5 показан пример таблицы ответа (16) на один документ из m страниц по шестнадцать ключевым словам. Каждая i-я строка таблицы ответа (16) фиг. 5 отражает все ключевые слова, которые встретились на данной i-ой странице рассматриваемого документа Dj и представляет собой шестнадцатибитную строку, которая читается как целое число в диапазоне от 0 до 65536 или 2¹⁶.In FIG. Figure 5 shows an example of a response table (16) for one document of m pages with sixteen keywords. Each i-th line of the answer table (16) of Fig. 5 reflects all the keywords that were found on the given i-th page of the considered document Dj and is a sixteen-bit string that is read as an integer in the range from 0 to 65536 or 2 ¹⁶ .

Эти числа (битовая строка по каждой странице) хорошо интерпретируются и показывают, какие ключевые слова встретились на данной странице (строке). Максимальное значение числа соответствует строке, полностью заполненной единицами, т.е. все ключевые слова встретились на данной странице. Это позволяет строить простые, понятные и адаптивные алгоритмы:These numbers (bit string for each page) are well interpreted and show what keywords were found on a given page (row). The maximum value of the number corresponds to a line completely filled with ones, i.e. all keywords met on this page. This allows you to build simple, understandable and adaptive algorithms:

а) Просмотр таблицы и выделение страниц у которых есть совпадение по все ключевым словам – бинарное число строки равно 2ⁿ, где n – количество ключевых слов или 65536 для 2¹⁶. Документ, в котором больше таких страниц больше соответствует запросу, с учетом заданных логических конструкций. В большинстве запросов пользователей (до 98%) они не прибегают к логическим условиям поиска, а ограничиваются перечислением ключевых слов;a) Viewing the table and highlighting pages that have a match for all keywords - the binary number of the line is 2 ⁿ , where n is the number of keywords or 65536 for 2 ¹⁶ . A document with more of these pages is more relevant to the query, given the given logical constructs. In most user queries (up to 98%), they do not resort to logical search conditions, but limit themselves to listing keywords;

б) Возможно уплотнение таблицы для визуального анализа выбранных документов. Для визуализации исключаются нулевые строки или строки с заданным пользователем низким уровнем отсева. Оставшиеся, не исключенные строки, могут упорядочиваться в зависимости от количества 1 и от заданного порядка по степени важности ключевых слов. Пример преобразованной таблицы ответа (16), приведенной на фиг. 5 показан на фиг. 6 Возможны включения любых алгоритмов обработки текста на странице: выделения слов на расстоянии, фраз, обработки перестановок слов, появления пояснений, синонимов и т.д.b) It is possible to compact the table for visual analysis of selected documents. Zero rows or rows with a user-specified low dropout level are excluded for rendering. The remaining, not excluded, rows can be ordered depending on the number of 1 and on the given order in terms of the importance of the keywords. An example of the converted response table (16) shown in FIG. 5 is shown in FIG. 6 It is possible to enable any text processing algorithms on the page: highlighting words at a distance, phrases, processing permutations of words, the appearance of explanations, synonyms, etc.

в) Подбор ключевых слов путем исключения минимальных сочетаний наименее важных слов с целью получения полного заполнения остаточного количества слов единицами. Например, для таблицы ответа (16) на фиг. 5 если исключают ключевые слова S1, S6, S8, S14, то получают документ, в котором на странице 1 будет полное совпадение остаточных ключевых слов при минимальных исключениях. Соответствующее число будет равно 2¹². Возможно использование различных семантических алгоритмов, учитывающих семантические значения слов, особенности предметной области документов и требований пользователя.c) Selection of keywords by excluding the minimum combinations of the least important words in order to obtain a complete filling of the residual number of words with units. For example, for the answer table (16) in FIG. 5 if the keywords S1, S6, S8, S14 are excluded, then a document is obtained in which on page 1 there will be a complete match of the residual keywords with minimal exceptions. The corresponding number would be 2 ¹² . It is possible to use various semantic algorithms that take into account the semantic meanings of words, the features of the subject area of documents and user requirements.

г) Индивидуальная настройка просмотра. При большом количестве ключевых слов в запросе (оценка индивидуальна обычно более 20 и зависит от области знаний и квалификации пользователя) пользователь может указать уровень отсечения £ страниц, приведенных в таблице ответа (16) фиг. 5. Здесьd) Individual viewing settings. With a large number of keywords in the request (the score is individual, usually more than 20 and depends on the user's area of knowledge and qualifications), the user can specify the pruning level £ of the pages listed in the answer table (16) of Fig. 5. Here

0 < £ ≤ 1,0 < £ ≤ 1,

где число £ отражает уровень требования на соответствие просматриваемых материалов запросу пользователя. where the number £ reflects the level of requirement for the compliance of the viewed materials with the user's request.

Страницы, на которых доля количества ключевых слов ниже заданного уровня £ х n не будут включаться в результирующую таблицу ответа (16) (результат округляется). Здесь n – количество ключевых слов в запросе. Например, для таблицы ответа (16) фиг. 5 с 16-тью ключевыми словами он может указать уровень отсечения £ = 0,2 (16х0,2=3,2 или 3 ключевых слова), тогда таблица ответа (16) фиг. 5 будет выглядеть так, как показано на фиг. 7.Pages on which the share of the number of keywords is below the given level £ x n will not be included in the resulting answer table (16) (the result is rounded). Here n is the number of keywords in the query. For example, for the answer table (16) of FIG. 5 with 16 keywords, he can indicate the cutoff level t = 0.2 (16x0.2=3.2 or 3 keywords), then the answer table (16) of FIG. 5 will look as shown in Fig. 7.

д) на любом этапе возможен просмотр результатов, добавление новых ключевых слов, исключение существующих, определение различного порядка важности ключевых слов (порядок определяет систему предпочтений, изменяет расчетные метрики, классификацию документов и объем выводимой информации).e) at any stage, it is possible to view the results, add new keywords, exclude existing ones, determine a different order of importance of keywords (the order determines the system of preferences, changes the calculated metrics, classification of documents and the amount of information displayed).

Примеры конкретного выполнения.Examples of specific implementation.

Пример 1. Этапы формирования служебных таблиц 1, 2 и 3 на примере фрагмента текста книги (фиг. 8). Обобщенная типовая схема обработки документа с ее описанием и примерами показана на фиг. 9.Example 1. Stages of formation of service tables 1, 2 and 3 on the example of a fragment of the text of the book (Fig. 8). A generalized typical document processing scheme with its description and examples is shown in FIG. 9.

1) на этапе чтения документа выписывают всех ключевые слова этого документа S1, S2, S3, . . . Sn во временный файл W:1) at the stage of reading the document, write out all the keywords of this document S1, S2, S3, . . . Sn to temporary file W:

Файл W = {S1= “ВВЕДЕНИЕ”, S2=“Данные”, S3=“зарубежной”, S3= “литературы”, S4=“многочисленные”, S5=“исследования”, S6= “грелиновой”,…}. Строчные и прописные буквы одинаковы, не различимы.File W = {S1="INTRODUCTION", S2="Data", S3="foreign", S3="literature", S4="numerous", S5="studies", S6="ghrelin",…}. Uppercase and lowercase letters are the same, indistinguishable.

2) Обработка слова на предмет требований к формированию словаря.2) Processing the word for the requirements for the formation of a dictionary.

3) В строку таблицы ключевых слов (11) записывают ключевые слова из временного файла W. Если слово новое, то в таблицу ключевых слов (11) дописывается строка. Далее информация дописывается в таблицу документов (12) и таблицу бинарных строк (13). 3) Key words from the temporary file W are written into the line of the keyword table (11). If the word is new, then a line is added to the keyword table (11). Further, the information is added to the table of documents (12) and the table of binary strings (13).

Строка в таблице ключевых слов (11) (фиг. 9): {<Введение,1,1>} Первая 1 – количество документов в которых встретилось это слово. Поскольку загружают первый документ, то ставится единица. Вторая 1 – адрес ссылки на первый список в таблице документов (12).The line in the table of keywords (11) (Fig. 9): {<Introduction,1,1>} The first 1 is the number of documents in which this word was found. Since the first document is loaded, a unit is set. The second 1 is the address of the link to the first list in the document table (12).

4) В первый список Sp1 таблицы документов (12) записывают номер документа (в таблице документов (12) это первый документ) и адрес ссылки на бинарную строку в таблице бинарных строк (13):4) In the first list Sp1 of the document table (12), the document number is written (in the document table (12) this is the first document) and the address of the link to the binary string in the binary string table (13):

Список Sp1 – {<1,1> }. Здесь первая 1 – номер документа, а вторая 1 – первая бинарная строка.The list Sp1 is {<1,1> }. Here the first 1 is the document number, and the second 1 is the first binary string.

5) В первую бинарную строку таблицы бинарных строк (13) ставят 1 на первом бите, что означает – данное слово из Таблицы 11 (Введение) встретилось на первой странице. В бинарной строке каждый номер бита соответствует номеру страницы в документе (например, бинарная строка <1,0,0,0,1,1,0,0,0,…> указывает, что заданное слово встретилось на первой, пятой и шестой страницах).5) In the first binary string of the table of binary strings (13) put 1 on the first bit, which means that this word from Table 11 (Introduction) was found on the first page. In a binary string, each bit number corresponds to a page number in the document (for example, the binary string <1,0,0,0,1,1,0,0,0,…> indicates that the given word was found on the first, fifth, and sixth pages ).

6) Чтение следующего слова из списка временного файла W и переход к выполнению пункта 2. При этом осуществляется контроль за длиной списка слов в файле W. При окончании обработки всех слов из файла W переходим к чтению нового документа, т.е. к выполнению пункта 6.6) Reading the next word from the list of temporary file W and proceeding to step 2. In this case, the length of the list of words in file W is controlled. When all words from file W are processed, we proceed to reading a new document, i.e. to fulfill point 6.

7) Осуществляют контроль за окончанием списка обрабатываемых документов. Единицы в первой позиции всех показанных бинарных строк фиг. 10 таблицы бинарных строк (13) указывают на то, что все ключевые слова из таблицы ключевых слов (11) находятся на первой странице документа.7) Monitor the end of the list of processed documents. The ones in the first position of all the binary strings shown in FIG. The 10 binary string tables (13) indicate that all keywords from the keyword table (11) are on the first page of the document.

Пример 2. Таблица ответа (16) фиг. 5 может быть сформирована (по столбцам) в микрочипе двойной памяти DM. Каждый столбец соответствует одному ключевому слову, а адрес регистра – номеру страницы документа. Таким образом, одна загрузка DM позволяет обработать один документ из БД (1). Множество ключевых слов заданной предметной области можно оптимизировать, задавая границу отсечения ключевых слов в общем словаре таблицы ключевых слов (11). Например, устанавливается граница по частоте, с которой встречается данное ключевое слово в документах (всей БД или раздела БД) не более чем в 20% документах. Ключевые слова, которые встречаются чаще, исключаются из таблицы ключевых слов (11). Example 2 Response table (16) of FIG. 5 can be formed (by columns) in the DM dual memory microchip. Each column corresponds to one keyword, and the register address corresponds to the page number of the document. Thus, one DM loading allows processing one document from the database (1). The set of keywords of a given subject area can be optimized by setting the cutoff limit for keywords in the general vocabulary of the keyword table (11). For example, a limit is set on the frequency with which this keyword occurs in documents (of the entire database or a section of the database) in no more than 20% of documents. Keywords that occur more often are excluded from the keywords table (11).

Можно устанавливать границы отбора документов и вывода страниц на экран. Например, выводятся только те страницы документа, где количество ключевых слов на странице более установленной границы, например, 80% (в случае определения веса ключевых слов, то граница отсечения может быть и в виде числа). Можно установить требование вывода страниц на которых встречаются все ключевые слова 100%, т.е. отбираются страницы (для примера фиг. 5) с числом равным 65536 для 16 ключевых слов.You can set boundaries for selecting documents and displaying pages on the screen. For example, only those pages of the document are displayed where the number of keywords on the page is more than the set limit, for example, 80% (in the case of determining the weight of keywords, the cutoff limit can also be in the form of a number). You can set the requirement to display pages on which all keywords occur 100%, i.e. pages are selected (for the example of Fig. 5) with a number equal to 65536 for 16 keywords.

Можно задавать произвольные семантические правила: устанавливают обязательные и заменяемые слова, определяют обязательное присутствие не разделяемых сочетаний слов (например, «концентрация кислорода», «увеличение проникновения углекислоты»), задают расстояние (количество символов) между словами.You can set arbitrary semantic rules: set mandatory and replaceable words, determine the mandatory presence of unseparable combinations of words (for example, “oxygen concentration”, “increase in carbon dioxide penetration”), set the distance (number of characters) between words.

Реализовывать все классические логические операции над ключевыми словами И, ИЛИ, НЕ-И, НЕ-ИЛИ. Implement all the classic logical operations on the keywords AND, OR, NAND, NOR.

Пример 3. Результат обработки документа выводится в виде таблицы ответа (16) фиг. 6. Здесь страницы, на которых не встретились ключевые слова, не указаны. Все приведенные в таблице страницы упорядочены в соответствии с количеством встретившихся ключевых слов, если встретилось одинаковое количество ключевых слов, то страницы указываются в порядке с важностью ключевых слов, а далее в арифметическом порядке номеров страниц.Example 3. The result of processing the document is displayed in the form of a response table (16) of Fig. 6. Here the pages on which the keywords were not found are not indicated. All pages in the table are ordered according to the number of keywords encountered, if the same number of keywords were encountered, then the pages are listed in order of the importance of the keywords, and then in arithmetic order of page numbers.

Это позволяет наглядно увидеть всю общую картину соответствия документа сформулированному запросу. При наведении курсора на:This allows you to visually see the whole overall picture of the compliance of the document with the formulated request. When the cursor is on:

- индекс столбца (Si), на экране возникает полное наименование ключевого слова. - column index (Si), the full name of the keyword appears on the screen.

- номер страницы, на экране появляется весь текст данной страницы, и пользователь может просмотреть ее на предмет соответствия своим потребностям.- page number, the entire text of this page appears on the screen, and the user can view it for compliance with their needs.

- все ключевые слова, на выводимой на экран странице, выделяются цветом, при этом важность слов отражается в цвете. Оттенок цвета слова отражается в частотном спектре цвета слова (синий соответствует низкой важности, а красный – наиболее важное слово).- all keywords on the displayed page are highlighted in color, while the importance of words is reflected in color. The hue of a word's color is reflected in the frequency spectrum of the word's color (blue corresponds to low importance and red to the most important word).

Дополнительная индивидуальная настройка доступа. Пользователь может указать уровень отсечения £ страниц, приведенных в Таблице ответа (16) фиг. 5. ЗдесьAdditional individual access control. The user can specify a clipping level £ of the pages shown in the Response Table (16) of FIG. 5. Here

0 < £ ≤ 10 < £ ≤ 1

Число £ отражает уровень требования на соответствие просматриваемых материалов запросу пользователя. Страницы, на которых доля количества ключевых слов ниже заданного уровня £ х n не будут включаться в результирующую таблицу ответа (16) фиг. 5 (результат округляется). Здесь n – количество ключевых слов в запросе. Например, для таблицы фиг. 5 с 16-тью ключевыми словами он может указать уровень отсечения £ = 0,2 (16х0,2=3,2 или 3 ключевых слова), тогда таблица ответа (16) фиг. 5 будет выглядеть так, как показано на фиг. 7.The number £ reflects the level of requirement for the compliance of the viewed materials with the user's request. Pages on which the proportion of the number of keywords is below the given level £ x n will not be included in the resulting answer table (16) of FIG. 5 (the result is rounded). Here n is the number of keywords in the query. For example, for the table of Fig. 5 with 16 keywords, he can indicate the cutoff level t = 0.2 (16x0.2=3.2 or 3 keywords), then the answer table (16) of FIG. 5 will look as shown in Fig. 7.

Для дополнительного анализа информации на найденной странице можно использовать все традиционные алгоритмы обработки текстов (элементы семантического анализа).For additional analysis of information on the found page, you can use all traditional text processing algorithms (elements of semantic analysis).

Найденная выделенная в соответствии с критериями отбора страница документа вводится на экран для ее оценки пользователем и рассмотрения возможных вариантов уточнения – включение дополнительных ключевых слов, исключение заданных, формулировка новых вариантов сочетаний слов.The page of the document found, selected in accordance with the selection criteria, is entered on the screen for its evaluation by the user and consideration of possible options for clarification - the inclusion of additional keywords, the exclusion of the given ones, the formulation of new variants of word combinations.

По сути, таблица ответа (16) является основой для проектирования логической схемы DM памяти.In fact, the answer table (16) is the basis for designing the memory DM logic circuit.

Осуществление устройства.Implementation of the device.

Двойная память Double Memory (далее DM) позволяет обрабатывать независимые списки бинарных строк Str₁, Str₂, . . . Str_n. Для наглядности бинарные строки представлены в виде условной таблицы (17) (фиг. 11), где столбцами являются бинарные строки, а строками – номера битов в бинарных строках. Производится операция распознавания побитного уровня (строки таблицы) пересечения бинарных строк.Double Memory Double Memory (hereinafter DM) allows processing independent lists of binary strings Str ₁ , Str ₂ , . . . Str _n . For clarity, binary strings are presented in the form of a conditional table (17) (Fig. 11), where the columns are binary strings, and the strings are the numbers of bits in binary strings. An operation is performed to recognize the bit-level (row of the table) intersection of binary strings.

Здесь бинарная строка Str₁ = <01001000001…1>. Длина строки равна m бит. В представленной таблице количество строк указывается по количеству бит в наиболее длинной строке из обрабатываемого списка строк, короткие строки дополняются нулевыми битами. Here the binary string Str ₁ = <01001000001…1>. The string length is m bits. In the presented table, the number of lines is indicated by the number of bits in the longest line from the processed list of lines, short lines are padded with zero bits.

Рассмотрим i-ую строку данной таблицы. В ней показано - в каких столбцах на i-ой позиции бита находятся единицы, а в каких строках стоят нули. Например, в строке 11 для всех столбцов в 11-ой позиции стоят единицы (считают, что в строках с 6 по n-1 стоят единицы). Целочисленное (в виде десятичного числа) представление строки позволяет «зашифровать» всю двоичную строку. Так для одиннадцатой строки — это число будет равно 2ⁿ.Consider the i-th row of this table. It shows in which columns there are ones at the i-th bit position, and in which rows there are zeros. For example, in line 11 for all columns in the 11th position there are units (it is considered that in lines from 6 to n-1 there are units). The integer (as a decimal) representation of the string allows you to "encrypt" the entire binary string. So for the eleventh line - this number will be equal to 2 ⁿ .

Пример этой же таблицы (17) сформированной из 6 столбцов длиной в 12 бит представлен на фиг. 12. Та же одиннадцатая строка (при всех единицах в строке) ее целочисленное значение будет равно 2⁶ = 64. Целочисленное значение третьей строки равно 4.An example of the same table (17) formed from 6 columns with a length of 12 bits is shown in Fig. 12. The same eleventh line (with all ones in the line) its integer value will be 2 ⁶ = 64. The integer value of the third line is 4.

Таким образом, считывая строку таблицы по полученному целому числу, можно точно знать в каких позициях стоят единицы. Причем пользователь, задавая определенный порядок расположения столбцов в таблице задает условную степень важности столбцов, которая обозначена символом ⸖. Тогда столбец с номером (j+1) ⸖ j будем интерпретировать как столбец Str_j+1важнее столбца Str_j. Это позволяет использовать семантические алгоритмы предварительной оценки страниц заявленного способа.Thus, by reading the row of the table by the resulting integer, you can know exactly what positions the units are in. Moreover, the user, setting a certain order of the columns in the table, sets the conditional degree of importance of the columns, which is indicated by the symbol ⸖. Then the column with number (j+1) ⸖ j will be interpreted as the column Str _j+1 is more important than the column Str _j . This allows the use of semantic algorithms for preliminary evaluation of pages of the claimed method.

Логическая блок-схема для памяти 4 х 3 известна из прототипа [7] стр.200 и представлена на фиг. 13. Обобщенная схема этой же классической памяти, состоящей из 4 регистров (ячеек памяти), каждый регистр которой состоит из 3 бит, показана на фиг. 14. Реальные схемы построены также с той лишь разницей, что разрядность ячеек памяти (регистров) может составлять 8, 16, 32 или 64 бита, а количество ячеек памяти от сотен тысяч и более (схема многократно увеличена), но для понимания логики всех процессов представленных на примерах и обобщениях достаточно.The logical block diagram for memory 4 x 3 is known from the prototype [7] page 200 and is shown in FIG. 13. A generalized scheme of the same classical memory, consisting of 4 registers (memory cells), each register of which consists of 3 bits, is shown in FIG. 14. Real circuits are also built with the only difference that the capacity of memory cells (registers) can be 8, 16, 32 or 64 bits, and the number of memory cells is hundreds of thousands or more (the circuit is multiplied), but to understand the logic of all processes presented by examples and generalizations is enough.

Далее приводим обобщенную схему памяти - это позволяет не сосредотачиваться на элементном исполнении триггеров и переключающих элементов, т.к. они могут выполняться многовариантно в зависимости от предпочтений разработчика и применяемой технологии, но сохраняет логику всех основных функции памяти – записи/чтения строки в ячейку памяти (регистр) по заданному адресу ячейки памяти. Next, we give a generalized memory scheme - this allows us not to focus on the elemental execution of triggers and switching elements, since they can be executed in multiple ways depending on the preferences of the developer and the technology used, but retains the logic of all the main memory functions - writing / reading a string to a memory cell (register) at a given memory cell address.

На обобщенной схеме фиг. 14 зафиксированы только важные элементы - 12 триггеров памяти, каждый триггер хранит 1 бит информации, входы/выходы и управляющие сигналы. Каждый триггер может находиться в одном из двух состояний 1 или 0. Триггеры выстроены в структуру – 4 строки (регистры) по 3 триггера в каждой строке. Предполагается, что входная информация, поступающая в виде бинарных строк, записывается в ячейки памяти (регистры - строки). Регистры пронумерованы и их номера называются адресами ячеек памяти (регистров). Логика процесса такова – бинарная строка <I₀, I₁, I₂> (здесь значение I_i равно 0 или 1, i=0,1,2) записывается в регистр по адресу бинарной строки <A₀,A₁> в соответствии с командами, поступающими на входы <CS,RD,OE>. А также информация может считываться в соответствии с командами <CS,RD,OE> из регистров, указанных в адресе <A₀,A₁> в выходную бинарную строку <O₀,O₁,O₂>.In the generalized diagram of Fig. 14 only important elements are fixed - 12 memory triggers, each trigger stores 1 bit of information, inputs / outputs and control signals. Each trigger can be in one of two states 1 or 0. Triggers are arranged in a structure - 4 lines (registers) with 3 triggers in each line. It is assumed that the input information coming in the form of binary strings is written to memory cells (registers - strings). Registers are numbered and their numbers are called addresses of memory cells (registers). The logic of the process is as follows - the binary string <I ₀ , I ₁ , I ₂ > (here the value of I _i is equal to 0 or 1, i=0,1,2) is written to the register at the address of the binary string _{_{<A 0 ,A 1 >}} in accordance with commands coming to inputs <CS,RD,OE>. Also, information can be read in accordance with the commands <CS,RD,OE> from the registers specified in the address _{_{<A 0 ,A 1 >}} to the output binary string <O ₀ ,O ₁ ,O ₂ >.

На фиг. 15 показана обобщенная логическая схема классической памяти для r ячеек памяти (регистров), где каждый регистр состоит из n бит, предназначенных для записи входных сигналов I₀,I₁,…,I_n и чтения выходных сигналов O₀,O₁, …,O_n .In FIG. 15 shows a generalized classical memory logic diagram for r memory cells (registers), where each register consists of n bits, designed to write input signals I ₀ ,I ₁ ,…,I _n and read output signals O ₀ ,O ₁ , …, O _n .

Примеры микросхем оперативной памяти известны из аналога [7] на рис. 3.30 стр. 204. Examples of RAM chips are known from the analogue [7] in Fig. 3.30 p. 204.

На фиг. 16 показана предлагаемая обобщенная блок-схема схема примера DM памяти 4х4. Каждый горизонтальный ряд состоит из 4 триггеров, составляющих одно слово. Показаны 4 ячейки памяти (регистра или 4 слова). In FIG. 16 shows a proposed generalized block diagram of an example of a 4x4 memory DM. Each horizontal row consists of 4 triggers that make up one word. 4 memory locations (register or 4 words) are shown.

В отличие от прототипа [7], DM может работать, как и классическая память – записывать и читать слова по регистрам (их адресам), а также дополнительно позволяет записывать бинарные строки Str₁, Str₂, . . . Str₄ подаваемые последовательно на входы <J₀, J₁, J₂, J₃>. Здесь бинарная строка Str_i, = <J₀, J₁, J₂, J₃>_i где i=1,4 записывается в i-ый столбец по адресу из Cт.i из списка адресов Cт.0, Ст.1, Ст.2, Ст.3, указывая адрес столбца на входах <CL₀, Cl₁>. Информация при этом читается стандартно - по регистрам (словам) на выходы <O₀, O₁, O₂, O₃>. Здесь вход СМ (Change Memory) – переключает режим работы памяти с обычной на DM. В режиме обычной памяти она работает в соответствии со схемой классической памяти фиг.13.Unlike the prototype [7], DM can work like classical memory - write and read words by registers (their addresses), and additionally allows you to write binary strings Str ₁ , Str ₂ , . . . Str ₄ applied sequentially to the inputs <J ₀ , J ₁ , J ₂ , J ₃ >. Here the binary string Str _i , = <J ₀ , J ₁ , J ₂ , J ₃ > _i where i=1,4 is written in the i-th column at the address from St.i from the list of addresses St.0, St.1, Art.2, Art.3, indicating the address of the column at the inputs <CL ₀ , Cl ₁ >. The information is read in the standard way - by registers (words) to the outputs <O ₀ , O ₁ , O ₂ , O ₃ >. Here, the CM (Change Memory) input switches the memory operation mode from normal to DM. In conventional memory mode, it operates in accordance with the classical memory scheme of Fig.13.

Пример обобщенной логической схемы DM памяти на r регистров по n бит каждый показан на фиг. 17. Показан дополнительный вход данных J₀,J₁,…,J_m , где m ≤ r. Для удобства восприятия логической схемы можно рассматривать таблицу ответа фиг. 11, где строки соответствуют регистрам памяти, а столбцы – бинарным строкам. Разумеется, что DM также, как и обычная память может выполняться с регистрами любых размеров 16, 32, 64 бит и более, а также с любым количеством ячеек памяти (регистров) от 1024 (как в примере со страницами) до сотен миллионов – для обработки больших списков документов.An example of a generalized DM memory logic circuit with r registers of n bits each is shown in FIG. 17. Additional data input J ₀ ,J ₁ ,…,J _m is shown, where m ≤ r . For the convenience of perceiving the logic circuit, the answer table of FIG. 11, where rows correspond to memory registers and columns correspond to binary strings. Of course, DM, like conventional memory, can be executed with registers of any size 16, 32, 64 bits or more, as well as with any number of memory cells (registers) from 1024 (as in the example with pages) to hundreds of millions - for processing large lists of documents.

Описание работы.Description of work.

Пример исполнения DM памяти 4 х 4 фиг. 18 и ее обобщенный вид показан фиг. 17. Показано, что вход СМ – переключает работу схемы из режима обычной памят7и в режим DM. В режиме обычной памяти (1 на выходе CM) логическая схема работает также, как и схема, изображенная на фиг. 13. В дополненном DM режиме памяти (0 на выходе CM) логическая схема переключается на ввод данных со входов J₀, J₁, J₂, J₃ см. изображенная на фиг. 18 и фиг. 17. Логическая схема рис.13 дополнена см. фиг. 18:Example of execution of DM memory 4 x 4 fig. 18 and its generalized view is shown in FIG. 17. It is shown that the CM input switches the operation of the circuit from the ordinary memory mode7 and into the DM mode. In conventional memory mode (1 at CM output), the logic circuit operates in the same way as the circuit shown in FIG. 13. In DM augmented memory mode (0 at CM output), the logic circuit switches to data input from inputs J ₀ , J ₁ , J ₂ , J ₃ , see shown in FIG. 18 and FIG. 17. The logic diagram of Fig. 13 has been supplemented, see Fig. 18:

- блоком адресации столбцов CL₀ и CL₁, задающих текущий адрес столбца, в который будет осуществлена запись данных J₀, J₁, J₂, J₃. - column addressing block CL ₀ and CL ₁ , specifying the current address of the column in which the data J ₀ , J ₁ , J ₂ , J ₃ will be written.

- в блоке адресации регистров А₀ и А₁, на выходе логических элементов И установлены драйверы, которые в зависимости от управляющего сигнала СМ (1- пропускают сигнал адреса регистра, 0 – блокирует сигнал адреса регистра).- in the addressing block of registers A ₀ and A ₁ , at the output of logic elements AND, drivers are installed, which, depending on the control signal CM (1-pass the register address signal, 0-blocks the register address signal).

- мультиплексорами, установленными на входе C синхронизирующего сигнала триггера (сигнал открывает вход данных триггера) на всех входах триггеров памяти. Мультиплексоры переключают канал поступления адресов с адресов регистра на адреса столбцов (точки Т₀, Т₁, Т₂, Т₃) в зависимости от управляющего сигнала СМ. - multiplexers installed at input C of the trigger synchronizing signal (the signal opens the trigger data input) on all memory trigger inputs. Multiplexers switch the channel for receiving addresses from register addresses to column addresses (points T ₀ , T ₁ , T ₂ , T ₃ ) depending on the control signal SM.

- мультиплексорами, установленными на всех входах поступления данных D триггеров памяти. Управляющий сигнал СМ переключает каналы получения данных от I₀, I₁, I₂, I₃ (при СМ=1) на J₀, J₁, J₂, J₃ (при СМ=0) на всех входах D триггеров памяти. - multiplexers installed on all inputs of data D memory triggers. The control signal CM switches the channels for receiving data from I ₀ , I ₁ , I ₂ , I ₃ (when CM=1) to J ₀ , J ₁ , J ₂ , J ₃ (when CM=0) at all inputs D memory triggers.

- выход элемента И после блока управления записью/чтением (CS, RD, OE) продлен точкой Т₄ до блока выбора адреса столбца.- the output of the AND element after the write / read control block (CS, RD, OE) is extended by the point T ₄ to the column address selection block.

При переключении входа CM (0 на выходе) в режим DM памяти адресация столбцов осуществляется адресными линиями CL0 и CL1 в верхней части схемы, аналогично адресации регистров, а входные линии I₀, I₁, I₂, I₃ запираются сигналом CM на мультиплексорах, установленных перед входом сигнала D на триггер. Для выбора столбца памяти, внешняя логика должна установить сигнал CS в 1, а также установить сигнал RD в 1 для чтения и в 0 для записи. Адресные линии столбцов должны указывать, в какой из четырех 4-разрядных столбцов нужно записывать информацию. При считывании все входные линии для данных не используются. При записи биты, находящиеся на входных линиях J₀, J₁, J₂, J₃ для данных, загружаются в выбранный столбец памяти; выходные линии при этом не используются.When switching the CM input (0 at the output) to the DM memory mode, the columns are addressed by the address lines CL0 and CL1 in the upper part of the circuit, similarly to register addressing, and the input lines I ₀ , I ₁ , I ₂ , I ₃ are locked by the CM signal on the multiplexers, set before the signal input D to the trigger. To select a memory column, external logic must set the CS signal to 1, and also set the RD signal to 1 for reading and 0 for writing. The column address lines must indicate in which of the four 4-bit columns the information is to be written. When reading, all data input lines are not used. When writing, the bits located on the input lines J ₀ , J ₁ , J ₂ , J ₃ for data are loaded into the selected memory column; output lines are not used.

Память, изображенная на фиг. 18 работает следующим образом. Четыре вентиля И для выбора столбцов в верхней части схемы формируют декодер. Далее входные инверторы расположены так, что каждый вентиль запускается определенным адресом. Каждый вентиль приводит в действие линию выбора столбцов. Когда микросхема должна производить запись, вертикальная линия CSˑ RD получает значение 1, запуская один из четырех вентилей записи - точки Т₀, Т₁, Т₂, Т₃. Выбор вентиля зависит от того, какая именно линия выбора столбца равна 1. Выходной сигнал вентиля записи приводит в действие все сигналы С (вход триггера) для выбранного столбца, загружая входные данные в триггеры этого столбца. Запись производится только в том случае, если сигнал CS равен 1, а RD – 0, при этом записывается только столбец, выбранный адресами CL0 и CL1.The memory shown in Fig. 18 works as follows. The four AND gates for selecting columns at the top of the circuit form a decoder. Next, the input inverters are arranged so that each gate is triggered by a specific address. Each gate drives a column selection line. When the microcircuit is to write, the vertical line CSˑ RD gets the value 1, starting one of the four write gates - points T ₀ , T ₁ , T ₂ , T ₃ . The choice of gate depends on which column select line is 1. The output of the write gate activates all C signals (trigger input) for the selected column, loading the input data into the flip-flops for that column. A write is made only if the CS signal is 1 and the RD signal is 0, and only the column selected by addresses CL0 and CL1 is written.

Процесс считывания аналогичен стандартному процессу считывания по схеме фиг. 13 – CM переключается в режим обычной памяти (1 на выходе), указывается адрес регистра (А₀, А₁), линия CSˑRD принимает значение 0, поэтому все вентили записи блокируются, и ни один из триггеров не меняется. Вместо этого линия выбора слов запускает вентили И, связанные с битами Q (выход триггера) выбранного слова. Таким образом, выбранное слово передает свои данные в 4-входовые вентили ИЛИ, расположенные в нижней части схемы, а остальные три слова выдают 0. Следовательно, выход вентилей ИЛИ идентичен значению, сохраненному в данном слове. Остальные три слова никак не влияют на выходные данные.The reading process is similar to the standard reading process of FIG. 13 - CM switches to regular memory mode (1 at the output), the register address is indicated (A ₀ , A ₁ ), the CSˑRD line becomes 0, so all write gates are blocked, and none of the flip-flops changes. Instead, the word select line fires the AND gates associated with the Q (trigger output) bits of the selected word. Thus, the selected word feeds its data to the 4-input OR gates located at the bottom of the circuit, and the other three words output 0. Therefore, the output of the OR gates is identical to the value stored in that word. The remaining three words do not affect the output in any way.

Таким образом, после записи столбцов в память DM обеспечивается возможность считывать информацию, хранящуюся в регистрах DM по схеме – по адресам регистров считываем содержание регистров. В результате содержимое i-го регистра показывает всю i-ую позицию в бинарных списках, записанных по столбцам, а целочисленное значение регистра содержит всю эту информацию в сжатом виде, что полностью соответствует таблице ответа (16) фиг. 5 для примера четырех ключевых слов и четырех страницах документа.Thus, after writing the columns to the DM memory, it is possible to read the information stored in the DM registers according to the scheme - we read the contents of the registers using the addresses of the registers. As a result, the contents of the i-th register shows the entire i-th position in the binary lists written in columns, and the integer value of the register contains all this information in a compressed form, which fully corresponds to the answer table (16) of Fig. 5 for an example of four keywords and four document pages.

DM память может изготавливаться в различных вариантах. Как оптимальный вариант, она может содержать 1024 регистров (слов) на 64 бита (столбцов – количество ключевых слов). Память может устанавливаться автономно или в качестве кэш-памяти с процессором, что ускорит работу алгоритма.DM memory can be manufactured in various versions. As an optimal option, it can contain 1024 registers (words) by 64 bits (columns - the number of keywords). The memory can be installed stand-alone or as a cache memory with the processor, which will speed up the algorithm.

Модификации логической схемы:Logic modifications:

- использование одних и тех же каналов для входных данных для записи в регистры (на примере I₀, I₁, I₂, I₃) и для входных данных для записи в столбцы (на примере J₀, J₁, J₂, J₃). Поскольку эти канала одновременно не используются. - use of the same channels for input data for writing to registers (for example, I ₀ , I ₁ , I ₂ , I ₃ ) and for input data for writing to columns (for example, J ₀ , J ₁ , J ₂ , J ₃ ). Since these channels are not used at the same time.

- увеличение разрядности регистров для работы с большим количеством ключевых слов - 1024 регистра по 128 бит;- increasing the bit depth of registers to work with a large number of keywords - 1024 registers of 128 bits;

- для специальных приложений - произвольное количество регистров (десятки миллионов) для работы с большим количеством документов.- for special applications - an arbitrary number of registers (tens of millions) to work with a large number of documents.

- Построение нескольких входных портов (групп регистров) для параллельной работы с несколькими списками.- Construction of multiple input ports (groups of registers) for parallel work with multiple lists.

Таким образом в заявке предлагается подход, основанный на поиске с точностью до страницы. Дополнительно к спискам документов в которых встречается ключевое слово формируются бинарные списки строк документов (фиг. 3), а их обработка заменяет этап полнотекстового анализа документа. Анализ документа заменен страничным просмотром Таблицы ответа Фиг. 4.Thus, the application proposes a page-based search approach. In addition to the lists of documents in which the keyword occurs, binary lists of document strings are formed (Fig. 3), and their processing replaces the full-text document analysis stage. Document parsing is replaced by paging of the Response Table of FIG. 4.

Техническая задача решается благодаря тому, что:The technical problem is solved due to the fact that:

- бинарные списки страниц документа являются списками фиксированной длины, что удобно для обработки;- binary lists of document pages are lists of fixed length, which is convenient for processing;

- все страницы документа по заданному ключевому слову представлены в одной бинарной строке;- all pages of the document for a given keyword are presented in one binary string;

- аппаратная реализация DM памяти позволяет исключить операции сортировки страниц различных ключевых слов; - hardware implementation of DM memory allows to exclude operations of sorting pages of various keywords;

- DM память позволяет одновременно обрабатывать множество (десятки) ключевых слов (бинарных списков), на предмет их пересечения. DM обеспечивает быстрое преобразование множества бинарных списков заданных ключевых слов в постраничное отображение документа (каждый регистр памяти в страницу документа) для быстрого просмотра и отбора страниц документа с заданными ключевыми словами без проведения операции по пересечению списков;- DM memory allows you to simultaneously process many (tens) of keywords (binary lists) for their intersection. DM provides fast conversion of a set of binary lists of given keywords into a page-by-page display of a document (each memory register into a page of a document) for quick viewing and selection of document pages with given keywords without performing a list crossing operation;

- прикладная интерпретация бинарных строк ничем не ограничена. Результат при этом может считываться в виде целых чисел (строка Таблицы ответа фиг. 4), зависящих от: бинарных значений чисел, расположенных в одинаковых позициях бинарных строк, последовательности записи бинарных строк в DM память (порядка ключевых слов в запросе);- Applied interpretation of binary strings is not limited by anything. The result in this case can be read in the form of integers (the line of the Answer Table of Fig. 4), depending on: binary values of numbers located in the same positions of binary strings, the sequence of writing binary strings to DM memory (the order of keywords in the request);

- просмотр найденной страницы не затруднит пользователя своим объемом;- viewing the found page will not complicate the user with its volume;

- в поисковом алгоритме не понадобится строить различные сложные логические конструкции, такие как, расстояние между ключевыми словами, порядок упоминания ключевых слов, их применение в рамках предложения, абзаца и т.д. Все ключевые слова на странице подсвечиваются фоном и этого достаточно пользователю для ее предметной оценки. - in the search algorithm, it will not be necessary to build various complex logical structures, such as the distance between keywords, the order in which keywords are mentioned, their use within a sentence, paragraph, etc. All keywords on the page are highlighted in the background and this is enough for the user to evaluate it in detail.

Claims

1. Способ для реализации в двойной памяти поиска документов в прикладных базах неструктурированных данных, включающий формирование запроса пользователем в двойную память, при этом пользователь указывает ключевые слова и логические операции с ними, выделение ключевых слов в запросе в блоке обработки ключевых слов запроса, использования инвертированного индекса c программами формирования, развития и сопровождения работы инвертированного индекса, при этом упомянутые программы взаимодействуют c блоком индексации, который взаимодействует c базой данных, программами выборки списков по ключевым словам, отличающийся тем, что:1. A method for implementing a double-memory search for documents in applied unstructured data bases, including the formation of a query by the user in double memory, while the user specifies keywords and logical operations with them, the selection of keywords in the query in the query keywords processing block, the use of an inverted index with programs for the formation, development and maintenance of the inverted index, while the mentioned programs interact with the indexing unit that interacts with the database, programs for selecting lists by keywords, characterized in that:

- программы формируют в двойной памяти вспомогательные таблицы: ключевых слов, документов, бинарных строк номеров документов и бинарных строк, дополняющие инвертированный индекс, при этом- programs form auxiliary tables in double memory: keywords, documents, binary strings of document numbers and binary strings, supplementing the inverted index, while

- таблица ключевых слов содержит список ключевых слов, каждое ключевое слово имеет ссылку на бинарную строку документов и ссылку на строку таблицы документов, а также здесь же указывается дополнительная информация: количество документов, где используется ключевое слово, другие данные о ключевом слове: термин, сокращение, множество или элемент множества;- the keyword table contains a list of keywords, each keyword has a link to a binary document string and a link to a document table string, and additional information is also indicated here: the number of documents where the keyword is used, other data about the keyword: term, abbreviation , a set or an element of a set;

- после выборки списков по ключевым словам из инвертированного индекса, осуществляют обработку списков документов, представленных в виде бинарных строк в блоке обработки результирующего списка, при этом бинарные строки загружаются для обработки в двойную память, в которой числовое значение строки равна 2n , где n – разрядность регистра памяти, в случае, если все заданные ключевые слова встречаются в документе с номером равным номеру строки в двойной памяти, это позволяет не прибегать к операции пересечения списков, при этом- after selecting the lists by keywords from the inverted index, the lists of documents presented as binary strings in the processing block of the resulting list are processed, while the binary strings are loaded for processing into double memory, in which the numerical value of the string is 2n , where n is the bit length memory register, if all the given keywords occur in a document with a number equal to the line number in double memory, this allows you not to resort to the operation of crossing lists, while

- в блоке обработки страниц документов по ключевым словам формируется для каждого документа таблица ответа, при этом происходит загрузка бинарных строк из таблицы бинарных строк в двойную память, их сортировка и анализ, при этом в таблице ответа каждому столбцу соответствует свое ключевое слово, столбцы упорядочены в соответствии с их важностью, причем каждый столбец ключевого слова соответствует бинарной строке таблицы бинарных строк, таблица ответа позволяет отразить содержащиеся в документе ключевые слова с точностью до страницы без загрузки самого документа из базы данных.- in the block for processing pages of documents by keywords, an answer table is formed for each document, while binary strings are loaded from the table of binary strings into double memory, they are sorted and analyzed, while in the answer table each column corresponds to its own keyword, the columns are ordered in according to their importance, with each keyword column corresponding to a binary string of the binary string table, the response table allows the keywords contained in the document to be displayed up to the page without loading the document itself from the database.

2. Двойная память для организации поиска документов в прикладных неструктурированных базах данных по п. 1, представляющая собой логическую схему памяти, обеспечивающую возможность записи данных по заданному адресу ячейки память, а также возможность чтения из ячеек памяти по заданному адресу ячейки записанных данных и вывод считанных данных в выходные линии в соответствии с сигналами управления CS, RD, OE и СМ, отличающаяся тем, что: 2. Dual memory for organizing the search for documents in applied unstructured databases according to claim 1, which is a logical memory circuit that provides the ability to write data at a given address of a memory cell, as well as the ability to read from memory cells at a given address of a recorded data cell and output read data to the output lines in accordance with the control signals CS, RD, OE and CM, characterized in that:

- содержит второй канал ввода данных, обеспечивающий запись битовых строк в заданный бит ячейки памяти во все ячейки памяти одновременно, при этом длина битовой строки равна количеству ячеек памяти, а количество и разрядность ячеек памяти ограничено технологическими возможностями микроэлектроники; - contains a second data input channel that provides the recording of bit strings in a given bit of a memory cell in all memory cells simultaneously, while the length of the bit string is equal to the number of memory cells, and the number and capacity of memory cells is limited by the technological capabilities of microelectronics;

- содержит блок адресации номеров бита в ячейках памяти, задающих общий номер бита для всех ячеек памяти, в которые осуществляется запись данных;- contains a block for addressing bit numbers in memory cells that define a common bit number for all memory cells into which data is written;

- после блока адресации ячеек устанавливается переключатель, который пропускает или блокирует сигнал адреса ячейки памяти в зависимости от управляющего сигнала СМ;- a switch is installed after the cell addressing block, which passes or blocks the signal of the address of the memory cell, depending on the control signal SM;

- содержит логические переключатели, установленные на всех входах триггеров памяти, для переключения канала поступления адресов с адресов ячеек памяти на адреса битов в ячейках памяти, в зависимости от управляющего сигнала СМ;- contains logical switches installed on all inputs of memory triggers to switch the channel for receiving addresses from addresses of memory cells to bit addresses in memory cells, depending on the control signal SM;

- содержит логические переключатели, установленные на всех входах поступления данных триггеров памяти, для переключения каналов получения данных в зависимости от управляющего сигнала СМ;- contains logical switches installed on all data inputs of memory triggers for switching data receiving channels depending on the control signal SM;

- выход блока управления записью/чтением CS, RD, OE, соединен с выходом блока выбора номера бита.- the output of the write/read control unit CS, RD, OE is connected to the output of the bit number selection unit.