CN117389954B - Online multi-version document content positioning method, device, equipment and medium - Google Patents

Online multi-version document content positioning method, device, equipment and medium Download PDF

Info

Publication number
CN117389954B
CN117389954B CN202311711792.9A CN202311711792A CN117389954B CN 117389954 B CN117389954 B CN 117389954B CN 202311711792 A CN202311711792 A CN 202311711792A CN 117389954 B CN117389954 B CN 117389954B
Authority
CN
China
Prior art keywords
document
model
elements
chain
version
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311711792.9A
Other languages
Chinese (zh)
Other versions
CN117389954A (en
Inventor
廉蔺
李驰
文治恒
周梓龙
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Huizhi Xingchuang Technology Co ltd
Original Assignee
Hunan Huizhi Xingchuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Huizhi Xingchuang Technology Co ltd filed Critical Hunan Huizhi Xingchuang Technology Co ltd
Priority to CN202311711792.9A priority Critical patent/CN117389954B/en
Publication of CN117389954A publication Critical patent/CN117389954A/en
Application granted granted Critical
Publication of CN117389954B publication Critical patent/CN117389954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application relates to a method, a device, equipment and a medium for positioning contents of online multi-version documents. The method comprises the following steps: and constructing a document position model, constructing a document element index structure model according to the document position model, acquiring the content of the document element to be searched in the multi-version document, inquiring the position of the document element in the sequencing array index space according to the document element, obtaining an element position chain through a pointer, traversing the element position chain and the document position model, and obtaining the position of the document element to be searched in the multi-version document. The method can establish the association between the elements and the document, not only improves the retrieval precision, but also can be applied to large-scale element indexing, and provides interpretability for the indexing result.

Description

Online multi-version document content positioning method, device, equipment and medium
Technical Field
The present disclosure relates to the field of search and indexing technologies, and in particular, to a method, an apparatus, a device, and a medium for locating content of an online multi-version document.
Background
In an internetworking environment (including local area networks, the internet, mobile internet, etc.), different users may have "different versions" of the same document ". The "same document" herein means: authors, titles, and content are the same literature; the "different formats" herein refer to: there may be differences in typesetting format, document type, etc.
For the same electronic digital document, different users may have different formats, for example: prePrint (PrePrint): the literature is not yet published in formal publications, but for the purpose of communicating with the colleagues, versions are voluntarily published first in academic conferences or through the internet; pre-publishing (Online First): the documents are confirmed to be published through a review flow, but the versions which are preferentially published on the network are realized in order to achieve the purpose of quick propagation; rear printing plate (PostPrint): the release version of the official publication after the document is subjected to the review and audit process is also called as print publication; identification plate: during the document transmission process, the database manufacturer adds marked versions such as electronic watermarks, signatures and the like to the document.
However, in the current multi-version document management, the document itself is basically only associated, and the position mapping of each element in the document is not performed, which limits the co-processing capability between multi-version documents.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, apparatus, device and medium for locating content of online multi-version documents.
An online multi-version document content locating method, the method comprising:
constructing a document position model; wherein the document position model is used for describing document elements contained in the document and position information of each document element;
constructing a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the position information;
acquiring the content of a document element to be retrieved in a multi-version document, inquiring the position of the document element in the index space of the sequencing array according to the document element, and obtaining the element position chain through a pointer;
and traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
In one embodiment, the method further comprises: determining document elements in the document; the document element is a component for managing layout correlation; numbering the document elements from the starting position of the document to obtain a document element sequence; marking the initial position of the document element in the document element sequence by adopting a special mark, and binarizing the marked document element sequence to obtain a binary data stream; and determining the position of the current document element according to the byte number of the special mark from the starting position in the binary data stream, thereby obtaining a document position model for describing the document elements contained in the document and the position information of each document element.
In one embodiment, the method further comprises: encoding the document element; associating the documents by adopting an association identification mode to obtain a document set; the document set comprises a plurality of documents with the same content but different document formats; for each format document, dividing the document into a digital number as a format number of the document, and establishing a mapping relation table of the format number and the document format; carrying out hash processing on the document elements, and carrying out digital coding on hash processing results to obtain a first digital number; and connecting the first digital number with the format number to obtain the digital code of the document element.
In one embodiment, the method further comprises: acquiring the number of all document elements in the document set, and arranging the document elements in an ascending order or a descending order to obtain a document element sequence; establishing a mapping relation between the digital codes of the document elements and an ordering array index space through a linear function, so that when the digital codes are input into the linear function, the positions of the document elements in the ordering array index space can be output; the linear function is obtained through training of a linear regression model.
In one embodiment, the method further comprises: by linking each element in the sequence of document elements to a pointer, an association with the chain of element positions is established.
In one embodiment, the method further comprises: calculating the position information of the document of each document element in the document set through the document position model; for each document, carrying out hash processing on the content of the document, digitizing the hash processing result to obtain a second digital code, and connecting the format number with the second digital code to obtain a document identifier of each document; and acquiring the format type of each document, wherein for each document element, the document identification, the format type, the URL corresponding to the pre-acquired document and the position information form an element position chain of the document element.
In one embodiment, the method further comprises: acquiring the content of a document element to be searched in a multi-version document, and inquiring the digital code of the document element to be searched; inputting the digital codes of the document elements to be searched into the linear function to obtain the positions of the document elements to be searched in the sorting array index space; according to the position in the array index space in which the document element to be searched is being sequenced, linking to an element position chain of the document element to be searched; traversing the element position chain, and outputting the document identification, the format type, the URL corresponding to the pre-acquired document and the position information; acquiring a document corresponding to the document element from the URL, and obtaining a binary data stream corresponding to the document according to the document position model; and inquiring according to the position information to obtain a document element to be searched, and performing deserialization on the binary data stream to obtain the position and the content of the document element to be searched in the multi-version document.
An online multi-version document content locating apparatus, the apparatus comprising:
the document position model building module is used for building a document position model; wherein the document position model is used for describing document elements contained in the document and position information of each document element;
the document element index structure model building module is used for building a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the position information;
the retrieval module is used for acquiring the content of the document element to be retrieved in the multi-version document, inquiring the position of the document element in the ordering array index space according to the document element, and obtaining the element position chain through a pointer; and traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
constructing a document position model; wherein the document position model is used for describing document elements contained in the document and position information of each document element;
constructing a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the position information;
acquiring the content of a document element to be retrieved in a multi-version document, inquiring the position of the document element in the index space of the sequencing array according to the document element, and obtaining the element position chain through a pointer;
and traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
constructing a document position model; wherein the document position model is used for describing document elements contained in the document and position information of each document element;
constructing a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the position information;
acquiring the content of a document element to be retrieved in a multi-version document, inquiring the position of the document element in the index space of the sequencing array according to the document element, and obtaining the element position chain through a pointer;
and traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
According to the method, the device, the computer equipment and the storage medium for positioning the content of the online multi-version document, through designing the double-layer index structure, the element index structure is used for indexing the elements, and the element position chain is used for indexing the elements in a positioning way, so that the indexing capability of all the elements of the document is realized, the association between the elements and the inside of the document is established, the retrieval precision is improved, the method and the device can be applied to large-scale element indexes, and the explanatory property is provided for indexing results.
Drawings
FIG. 1 is a flow diagram of an online multi-version document content localization method in one embodiment;
FIG. 2 is a schematic diagram of a document location model in one embodiment;
FIG. 3 is a schematic diagram of a document element index structure model in one embodiment;
FIG. 4 is a block diagram of an online multi-version document content locating apparatus in one embodiment;
fig. 5 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided an online multi-version document content locating method, comprising the steps of:
step 102, constructing a document position model.
The document location model is used to describe document elements contained in a document and location information of each document element. It will be appreciated that the document elements are part of the document, and in general, for different versions of the document, although substantially all the same document elements are included, the location of each document element may be different due to different published forms and different published times, and therefore this step describes the document elements by carding out all the document elements in the document and then describing the document elements by location information.
And 104, constructing a document element index structure model according to the document position model.
The document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information and the position information corresponding to each document element.
And 106, acquiring the content of the document element to be retrieved in the multi-version document, inquiring the position of the document element in the index space of the sequencing array according to the document element, and obtaining an element position chain through a pointer.
After the element index structure model and the document position model are established, indexing can be performed based on document elements to be retrieved, so that an element position chain is obtained.
And step 108, traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
In the online multi-version document content positioning method, by designing the double-layer index structure, wherein the element index structure is used for indexing the elements, and the element position chain is used for indexing the elements, the indexing capability of all the elements of the document is realized, the association between the elements and the inside of the document is established, the retrieval precision is improved, the method can be applied to large-scale element indexes, and the interpretation is provided for the indexing result.
In one embodiment, the step of modeling the location of the document comprises: determining document elements in the document; the document element is a component for managing layout correlation; numbering the document elements from the starting position of the document to obtain a document element sequence; marking the initial position of the document element in the document element sequence by adopting a special mark, and binarizing the marked document element sequence to obtain a binary data stream; and determining the position of the current document element according to the byte number of the special mark from the starting position in the binary data stream, thereby obtaining a document position model for describing the document elements contained in the document and the position information of each document element.
Specifically, as shown in fig. 2, a schematic diagram of a document location model is provided, and the operation steps of constructing the document location model are as follows:
s1.1: given a documentThe element set in the use sense is +.>The elements in the sense of use comprise titles, paragraphs, pictures of each levelComponents in documents, tables, formulas, references, etc. for managing layout correlations;
s1.2: reading from the beginning of the document, the sequence number of the first document element is set to be 1, and the rest document elements are similarly numbered, so that the document elements in the whole document can form a document element sequence according to the sequence orderWherein->A sequence number representing the occurrence;
s1.3: at each document elementMarking the beginning of->The binary data after the marking is serialized is a special marking symbol;
s1.4: will complete the documentSerializing to form a binary data stream +.>In->Each special mark is found and the special mark distance +.>Byte count of start position->
S1.5:Is the element->In document->Position in document->The element position model of (2) can be expressed as
In one embodiment, as shown in FIG. 3, the specific idea of constructing a document element index structure model is as follows:
s2.1: given a set of documents
S2.2: after layout association, document collection is identifiedIn total +.>Different documents, i.e. for a certain +.>Document, set->All documents in the list are the same type of documents, namely, the documents have different formats and the content is the same;
s2.3: the multi-format document element index integral model is to index the positions of each element in the same kind of document in different formats by using an index mechanism so as to facilitate the associated inquiry;
s2.4: encoding each element, the encoded information comprising which document the encoded information belongs to and the identification code of the element in such document;
s2.5: based on the coding value of the element, an element index structure is established
S2.6: the position information of each element in different format documents in the documents refers to the document element position model, a linked list is built for each element, the item in the linked list is the position information of the element in a certain document, and the format type is contained, which forms an element position chain
S2.7: element index structureAnd element position chain->And the multi-format document element index integral model is formed together.
In one embodiment, the document elements are also required to be encoded, and the documents are associated in an association identification mode to obtain a document set; the document set comprises a plurality of documents with the same content but different document formats; for each format document, dividing the document into a digital number as the format number of the document, and establishing a mapping relation table of the format number and the document format; carrying out hash processing on the document elements, and carrying out digital coding on hash processing results to obtain a first digital number; and connecting the first digital number with the format number to obtain the digital code of the document element.
In one embodiment, the specific steps for encoding the document elements may be:
s3.1: to build element index structureThe document elements need to be encoded;
s3.2: for a document collectionCarrying out association recognition by adopting the existing association recognition method, wherein documents belonging to the same category form a set +.>Wherein->Is->The documents in the collection are different in format and the content is the same;
s3.3: for each kind of document, a digital number is assigned,/>Number->The format coding of the documents is that;
s3.4: establishing and storing a document-to-category mapping tableNamely, a document is given, and the category number of the document can be inquired through a document-category mapping relation table;
s3.5: for each document element, hash it to obtain a hash valueAnd then (2) to->Conversion to the first number ∈ ->
S3.6: the digital encoding of each document element isWherein->Is a connectorNumber such that each element forms a unique numerical code.
In one embodiment, the number of all the document elements in the document set is obtained, and the document elements are arranged in an ascending order or a descending order to obtain a document element sequence; the mapping relation between the digital codes of the document elements and the index space of the sequencing array is established through the linear function, so that the positions of the document elements in the index space of the sequencing array can be output when the digital codes are input into the linear function; wherein the linear function is obtained through training of a linear regression model.
Specifically, the method can be realized by the following steps:
s4.1: for a document collectionThe element index structure construction is to construct a data structure which is convenient for quick retrieval for each element of the same kind of document;
s4.2: let the number of all document elements beWherein 1 document element of the same type of document with different types of formats is counted and sorted according to the ascending or descending order of the numerical code of the document elements to form a document element sequence->
S4.3: for the purpose ofMapping is established->I.e. +.>Space for the digital coding of document elements, +.>For ordering array index space, it can be ascending array index space or descending array index spaceA compartment;
s4.4: mapping forTraining by using a linear regression model to construct a linear functionThe function input is the digital code of the document element, and the output is the position of the document element in the index space of the sequencing array;
s4.5: in a document element sequenceEach element is linked to a pointer that points to a linked list (chain of element positions).
In one embodiment, the location information of each document element in the document set is calculated by a document location model; for each document, carrying out hash processing on the content of the document, digitizing the hash processing result to obtain a second digital code, and connecting the format number with the second digital code to obtain a document identification of each document; and obtaining the format type of each document, and for each document element, forming an element position chain of the document element by the document identification, the format type, the URL corresponding to the pre-obtained document and the position information.
Specifically, the implementation steps may be:
s5.1: for each document element of the same kind of documentCalculating the position information of each document in the category through a document element position model;
s5.2: kinds of designCommon->Individual documents, which are grouped as +.>Wherein->For the document set->The total number of the types of the documents identified by the association identification;
s5.3: for each document elementCalculate it at each +.>Position->
S5.4: for each documentThe content is hashed to obtain a hash value +.>And then (2) to->Conversion to a second digital code->
S5.5: each documentIs encoded as +.>Wherein->For the connection symbol, each document thus forms the document identification +.>
S5.6: for each documentGiven its layout type->Such as PrePrint (PrePrint), prePrint (Online First), post print (PostPrint), and the like;
s5.7: for each document elementConstructing a linked list, wherein each linked list item contains the information of the document in which the linked list item is positioned, and the information comprises、/>The document URL and->The linked list is the element position chain.
In one embodiment, acquiring the content of a document element to be retrieved in a multi-version document, and inquiring the digital code of the document element to be retrieved; inputting the digital codes of the document elements to be searched into a linear function to obtain the positions of the document elements to be searched in the ordering array index space; according to the position in the array index space in which the document element to be searched is being sequenced, linking to an element position chain of the document element to be searched; traversing element position chains, and outputting document identification, format type, URL (uniform resource locator) corresponding to a pre-acquired document and position information; acquiring a document corresponding to the document element from the URL, and obtaining a binary data stream corresponding to the document according to the document position model; and obtaining the document element to be searched according to the position information query, and after the binary data stream is deserialized, obtaining the position and the content of the document element to be searched in the multi-version document.
Specifically, the method comprises the following implementation steps:
s1: let the user apply to document elements in a document(e.g., title, paragraph, picture, form, formula, etc.) and need to query its content in other formats of documents of the same category;
s2: inquiring the category number of the document through a mapping relation table
S3: document elementPerforming hash to obtain hash value +.>And then (2) to->Conversion to the first digital code->
S4: will beInput of a linear function as query condition>In which the ascending ordinal group where it is located is returned +.>Position->
S5: querying ascending number group elementsLinked element position chain->
S6: traversingOutput document URL, layout type, document identification and element +.>In the position of the document->
S7: obtaining the document from the document URL, binary data streamAfter that, the seek distance starts at +.>The bytes of length, and the next whole bytes are deserialized, the result is the starting point of the content of the user query in other format documents of the same type of document.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, as shown in FIG. 4, there is provided an online multi-version document content locating apparatus comprising: a document location model construction module 402, a document element index structure model construction module 404, and a retrieval module 406, wherein:
a document location model construction module 402 for constructing a document location model; wherein the document position model is used for describing document elements contained in the document and position information of each document element;
a document element index structure model construction module 404, configured to construct a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the position information;
the retrieval module 406 is configured to obtain contents of a document element to be retrieved in a multi-version document, query its position in the ordered array index space according to the document element, and obtain the element position chain through a pointer; and traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document.
In one embodiment, the document location model construction module 402 is further configured to determine document elements in a document; the document element is a component for managing layout correlation; numbering the document elements from the starting position of the document to obtain a document element sequence; marking the initial position of the document element in the document element sequence by adopting a special mark, and binarizing the marked document element sequence to obtain a binary data stream; and determining the position of the current document element according to the byte number of the special mark from the starting position in the binary data stream, thereby obtaining a document position model for describing the document elements contained in the document and the position information of each document element.
In one embodiment, the document element index structure model building module 404 is further configured to encode the document element; associating the documents by adopting an association identification mode to obtain a document set; the document set comprises a plurality of documents with the same content but different document formats; for each format document, dividing the document into a digital number as a format number of the document, and establishing a mapping relation table of the format number and the document format; carrying out hash processing on the document elements, and carrying out digital coding on hash processing results to obtain a first digital number; and connecting the first digital number with the format number to obtain the digital code of the document element.
In one embodiment, the document element index structure model building module 404 is further configured to obtain the number of all document elements in the document set, and arrange the document elements in an ascending order or a descending order to obtain a document element sequence; establishing a mapping relation between the digital codes of the document elements and an ordering array index space through a linear function, so that when the digital codes are input into the linear function, the positions of the document elements in the ordering array index space can be output; the linear function is obtained through training of a linear regression model.
In one embodiment, the document element index structure model building module 404 is further configured to establish an association with the chain of element positions by linking each element in the sequence of document elements to a pointer.
In one embodiment, the document element index structure model building module 404 is further configured to calculate the location information of the document of each document element in the document collection through the document location model; for each document, carrying out hash processing on the content of the document, digitizing the hash processing result to obtain a second digital code, and connecting the format number with the second digital code to obtain a document identifier of each document; and acquiring the format type of each document, wherein for each document element, the document identification, the format type, the URL corresponding to the pre-acquired document and the position information form an element position chain of the document element.
In one embodiment, the retrieving module 406 is further configured to obtain contents of the document element to be retrieved in the multi-version document, and query a digital code of the document element to be retrieved; inputting the digital codes of the document elements to be searched into the linear function to obtain the positions of the document elements to be searched in the sorting array index space; according to the position in the array index space in which the document element to be searched is being sequenced, linking to an element position chain of the document element to be searched; traversing the element position chain, and outputting the document identification, the format type, the URL corresponding to the pre-acquired document and the position information; acquiring a document corresponding to the document element from the URL, and obtaining a binary data stream corresponding to the document according to the document position model; and inquiring according to the position information to obtain a document element to be searched, and performing deserialization on the binary data stream to obtain the position and the content of the document element to be searched in the multi-version document.
For specific limitations on the online multi-version document content locating device, reference may be made to the above limitation on the online multi-version document content locating method, and no further description is given here. The various modules in the online multi-version document content locating apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an online multi-version document content localization method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (9)

1. An online multi-version document content locating method, the method comprising:
constructing a document position model; wherein the document location model is used for describing document elements contained in the document and element locations of each document element;
constructing a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the element position;
acquiring the content of a document element to be retrieved in a multi-version document, inquiring the position of the document element in the index space of the sequencing array according to the document element, and obtaining the element position chain through a pointer;
traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document;
the building of the document location model comprises:
determining document elements in the document; the document element is a component for managing layout correlation;
numbering the document elements from the starting position of the document to obtain a document element sequence;
marking the initial position of a document element in the document element sequence by adopting a special mark, binarizing the document, and obtaining a binary data stream;
and determining the element position of the current document element according to the byte number of the binarized special mark from the starting position in the binary data stream, thereby obtaining a document position model for describing the document elements contained in the document and the element position of each document element.
2. The method of claim 1, further comprising, prior to constructing a document element index structure model from the document location model:
encoding the document element;
encoding the document element, comprising:
for a document set, associating the documents in an association identification mode, wherein the documents belonging to the same category form a set; the collection comprises a plurality of documents with the same content but different document formats;
for each kind of document, assigning a digital number as a document kind code of the document, and establishing a mapping relation table of the document kind code and the document;
carrying out hash processing on the document element, and carrying out digital coding on the hash processing result to obtain element identification coding;
and connecting the element identification code with the document type code to obtain the digital code of the document element.
3. The method of claim 2, wherein the step of constructing an element index structure comprises:
acquiring the number of all document elements in the document set, and arranging the document elements in an ascending order or a descending order to obtain a document element sequence;
establishing a mapping relation between the digital codes of the document elements and an ordering array index space through a linear function, so that when the digital codes are input into the linear function, element positions of the document elements in the ordering array index space are output; the linear function is obtained through training of a linear regression model.
4. A method according to claim 3, characterized in that the method further comprises:
by linking each element in the sequence of document elements to a pointer, an association with the chain of element positions is established.
5. A method according to claim 3, wherein the step of constructing a chain of element positions comprises:
calculating the element position of each document element in the document set through the document position model;
for each document, carrying out hash processing on the content of the document, digitizing the hash processing result to obtain a document identification code, and connecting the document type code with the document identification code to obtain a document identification of each document;
and acquiring the format type of each document, and for each document element, constructing an element position chain of the document element by the document identification, the format type, the URL corresponding to the pre-acquired document and the element position.
6. The method of claim 5, wherein obtaining the content of the document element to be retrieved in the multi-version document, querying its position in the ordered array index space according to the document element, and obtaining the element position chain by a pointer, traversing the element position chain and the document position model, and obtaining the position of the document element to be retrieved in the multi-version document, comprises:
acquiring the content of a document element to be searched in a multi-version document, and inquiring the digital code of the document element to be searched;
inputting the digital codes of the document elements to be searched into the linear function to obtain the positions of the document elements to be searched in the sorting array index space;
according to the position in the array index space in which the document element to be searched is being sequenced, linking to an element position chain of the document element to be searched;
traversing the element position chain, and outputting the document identification, the format type, the URL corresponding to the pre-acquired document and the element position;
acquiring a document corresponding to the document element from the URL, and obtaining a binary data stream corresponding to the document according to the document position model;
and obtaining the document element to be searched according to the element position query, and obtaining the position and the content of the document element to be searched in the multi-version document after deserializing the binary data stream.
7. An on-line multi-version document content locating apparatus, the apparatus comprising:
the document position model building module is used for building a document position model; wherein the document location model is used for describing document elements contained in the document and element locations of each document element;
the document element index structure model building module is used for building a document element index structure model according to the document position model; the document element index structure model comprises an element index structure and an element position chain, wherein elements in the element index structure point to a linked list in the element position chain through pointers; the element index structure is obtained by mapping document elements contained in the document set into an array according to the sequence, and mapping the array to a sequence array index space; the element position chain is constructed according to the document information corresponding to each document element and the element position;
the retrieval module is used for acquiring the content of the document element to be retrieved in the multi-version document, inquiring the position of the document element in the ordering array index space according to the document element, and obtaining the element position chain through a pointer; traversing the element position chain and the document position model to obtain the position of the document element to be retrieved in the multi-version document;
the document position model construction module is also used for determining document elements in the document; the document element is a component for managing layout correlation; numbering the document elements from the starting position of the document to obtain a document element sequence; marking the initial position of a document element in the document element sequence by adopting a special mark, binarizing the document, and obtaining a binary data stream; and determining the element position of the current document element according to the byte number of the binarized special mark from the starting position in the binary data stream, thereby obtaining a document position model for describing the document elements contained in the document and the element position of each document element.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202311711792.9A 2023-12-13 2023-12-13 Online multi-version document content positioning method, device, equipment and medium Active CN117389954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311711792.9A CN117389954B (en) 2023-12-13 2023-12-13 Online multi-version document content positioning method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311711792.9A CN117389954B (en) 2023-12-13 2023-12-13 Online multi-version document content positioning method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117389954A CN117389954A (en) 2024-01-12
CN117389954B true CN117389954B (en) 2024-03-29

Family

ID=89468873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311711792.9A Active CN117389954B (en) 2023-12-13 2023-12-13 Online multi-version document content positioning method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117389954B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5953723A (en) * 1993-04-02 1999-09-14 T.M. Patents, L.P. System and method for compressing inverted index files in document search/retrieval system
US6938204B1 (en) * 2000-08-31 2005-08-30 International Business Machines Corporation Array-based extensible document storage format
JP2012173796A (en) * 2011-02-17 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval device using ranking function generating device having margin generation function, document retrieval method using ranking function generating device having margin generation function, and document retrieval program using ranking function generating device having margin generation function
EP2722778A1 (en) * 2012-10-17 2014-04-23 Thomson Licensing Method and apparatus for retrieving a media file of interest
CN111045988A (en) * 2018-10-12 2020-04-21 伊姆西Ip控股有限责任公司 File searching method, equipment and computer program product
CN114722139A (en) * 2022-03-11 2022-07-08 中国人民解放军国防科技大学 Space-time multi-attribute index method capable of self-adaptive dynamic expansion and retrieval method thereof
CN115438633A (en) * 2022-09-30 2022-12-06 湖南汇智兴创科技有限公司 Cross-document online discussion processing method, interaction method, device and equipment
CN115687566A (en) * 2022-09-30 2023-02-03 中国人民解放军93114部队 Method and device for full-text retrieval and retrieval result display
CN115994232A (en) * 2023-03-21 2023-04-21 湖南汇智兴创科技有限公司 Online multi-version document identity authentication method, system and computer equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120233205A1 (en) * 2008-03-07 2012-09-13 Inware, Llc System and method for document management
US10459900B2 (en) * 2016-06-15 2019-10-29 International Business Machines Corporation Holistic document search

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5953723A (en) * 1993-04-02 1999-09-14 T.M. Patents, L.P. System and method for compressing inverted index files in document search/retrieval system
US6938204B1 (en) * 2000-08-31 2005-08-30 International Business Machines Corporation Array-based extensible document storage format
JP2012173796A (en) * 2011-02-17 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval device using ranking function generating device having margin generation function, document retrieval method using ranking function generating device having margin generation function, and document retrieval program using ranking function generating device having margin generation function
EP2722778A1 (en) * 2012-10-17 2014-04-23 Thomson Licensing Method and apparatus for retrieving a media file of interest
CN111045988A (en) * 2018-10-12 2020-04-21 伊姆西Ip控股有限责任公司 File searching method, equipment and computer program product
CN114722139A (en) * 2022-03-11 2022-07-08 中国人民解放军国防科技大学 Space-time multi-attribute index method capable of self-adaptive dynamic expansion and retrieval method thereof
CN115438633A (en) * 2022-09-30 2022-12-06 湖南汇智兴创科技有限公司 Cross-document online discussion processing method, interaction method, device and equipment
CN115687566A (en) * 2022-09-30 2023-02-03 中国人民解放军93114部队 Method and device for full-text retrieval and retrieval result display
CN115994232A (en) * 2023-03-21 2023-04-21 湖南汇智兴创科技有限公司 Online multi-version document identity authentication method, system and computer equipment

Also Published As

Publication number Publication date
CN117389954A (en) 2024-01-12

Similar Documents

Publication Publication Date Title
WO2020186786A1 (en) File processing method and apparatus, computer device and storage medium
CN108932236B (en) File management method and device
US11798208B2 (en) Computerized systems and methods for graph data modeling
CN110866018B (en) Steam-massage industry data entry and retrieval method based on label and identification analysis
CN110362542B (en) Nuclear power station file encoding method and device, computer equipment and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN111506608A (en) Method and device for comparing structured texts
CN111651453A (en) User historical behavior query method and device, electronic equipment and storage medium
CN114548059A (en) Method and device for managing structured data, storage medium and electronic equipment
CN105393245A (en) Method and computer for indexing and searching structures
CN108921193B (en) Picture input method, server and computer storage medium
CN117389954B (en) Online multi-version document content positioning method, device, equipment and medium
CN111190896B (en) Data processing method, device, storage medium and computer equipment
CN115186188A (en) Product recommendation method, device and equipment based on behavior analysis and storage medium
CN116783587A (en) Data storage for list-based data searching
CN113641523A (en) Log processing method and device
CN113434413A (en) Data testing method, device and equipment based on data difference and storage medium
CN116776854B (en) Online multi-version document content association method, device, equipment and medium
CN113535734B (en) Data storage method, data query method and computing device
CN111339566B (en) Block summarization method, device, computer equipment and storage medium
CN114996588B (en) Product recommendation method, device, equipment and storage medium based on double-tower model
KR101734428B1 (en) Patent Information Processing System and Method on Using Spread Sheet Typed Files
CN116028448A (en) Identification code determining method, device, equipment and storage medium of electronic file
CN116910017A (en) Database migration method, device, electronic equipment and medium
CN116561181A (en) Data query method, device, computer equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant