CN111949765B - Semantic-based similar text searching method, system, device and storage medium - Google Patents

Semantic-based similar text searching method, system, device and storage medium Download PDF

Info

Publication number
CN111949765B
CN111949765B CN202010843746.4A CN202010843746A CN111949765B CN 111949765 B CN111949765 B CN 111949765B CN 202010843746 A CN202010843746 A CN 202010843746A CN 111949765 B CN111949765 B CN 111949765B
Authority
CN
China
Prior art keywords
text
semantic
split
target
semantic features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010843746.4A
Other languages
Chinese (zh)
Other versions
CN111949765A (en
Inventor
卓民
杨楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Kaniu Technology Co ltd
Original Assignee
Shenzhen Kaniu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Kaniu Technology Co ltd filed Critical Shenzhen Kaniu Technology Co ltd
Priority to CN202010843746.4A priority Critical patent/CN111949765B/en
Publication of CN111949765A publication Critical patent/CN111949765A/en
Application granted granted Critical
Publication of CN111949765B publication Critical patent/CN111949765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a semantic-based similar text searching method, a semantic-based similar text searching system, semantic-based similar text searching equipment and a semantic-based similar text searching storage medium. The method comprises the following steps: acquiring a target text; splitting the target text to obtain a plurality of first split texts; searching first semantic features of each first split text in a semantic feature table generated based on a preset database; acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts; and obtaining similar texts similar to the target text from the preset database according to the target semantic features. The embodiment of the invention realizes that the accuracy of similar text search is improved by combining the semantics.

Description

Semantic-based similar text searching method, system, device and storage medium
Technical Field
The embodiment of the invention relates to text technology, in particular to a semantic-based similar text searching method, a semantic-based similar text searching system, semantic-based similar text searching equipment and a semantic-based similar text searching storage medium.
Background
With the development of internet technology and the coming of the informatization age, the way of people to acquire information is more and more varied, and the application scenes of searching similar texts are particularly wide.
In the existing general method for searching similar texts, keywords, synonyms and paraphraseology are used as keys, and an index library (Bag Of Words, BOW) Of word bags is built for the articles and then is queried. But such searches do not incorporate semantics or context and the search results are not accurate.
Disclosure of Invention
The embodiment of the invention provides a similar text searching method, a system, equipment and a storage medium based on semantics, which are used for realizing the improvement of the accuracy of similar text searching by combining the semantics.
To achieve the object, an embodiment of the present invention provides a semantic-based similar text search method, which includes:
acquiring a target text;
Splitting the target text to obtain a plurality of first split texts;
searching first semantic features of each first split text in a semantic feature table generated based on a preset database;
acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
And obtaining similar texts similar to the target text from the preset database according to the target semantic features.
Further, the obtaining the target text includes:
Acquiring the precision requirement input by a user;
Determining a semantic radius according to the precision requirement;
The searching the first semantic feature of each first split text in the semantic feature table generated based on the preset database comprises the following steps:
and searching the first semantic features of each first split text in a semantic feature table generated based on a preset database based on the semantic radius.
Further, before the target text is obtained, the method includes:
acquiring training texts in a preset database;
Splitting the training text to obtain a plurality of second split texts;
Inputting each second split text into a preset neural network model to obtain second semantic features of each second split text, wherein the second semantic features are matrixes formed by occurrence probabilities of the rest second split texts of the second split text within a preset semantic radius;
acquiring training semantic features of the training text, wherein the training semantic features are average values of a plurality of second semantic features;
And generating a semantic feature table according to the second semantic features.
Further, the inputting each of the second split texts into a preset neural network model to obtain the second semantic feature of each of the second split texts includes:
converting the second split text into a third split text based on one-hot coding;
and inputting each third split text into a preset neural network model to obtain a second semantic feature of each second split text.
Further, the obtaining, according to the target semantic feature, similar text similar to the target text from the preset database includes:
and obtaining similar texts similar to the target text from the preset database according to the target semantic features and the training semantic features.
Further, the obtaining similar text similar to the target text from the preset database according to the target semantic feature and the training semantic feature includes:
Obtaining similar semantic features, wherein the similar semantic features are training semantic features with the difference value between the training semantic features and the target semantic features being smaller than a first threshold value;
and obtaining similar texts from the preset database according to the similar semantic features.
Further, the neural network model is a Skip-Gram model based on Word2 vec.
In one aspect, an embodiment of the present invention further provides a semantic-based similar text search system, where the system includes:
the text acquisition module is used for acquiring a target text;
the text splitting module is used for splitting the target text to obtain a plurality of first split texts;
the feature searching module is used for searching first semantic features of each first split text in a semantic feature table generated based on a preset database;
The feature acquisition module is used for acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
and the text searching module is used for acquiring similar texts similar to the target text from the preset database according to the target semantic features.
In another aspect, an embodiment of the present invention further provides a computer device, including: one or more processors; and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method as provided by any of the embodiments of the present invention.
In yet another aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as provided by any of the embodiments of the present invention.
According to the embodiment of the invention, the target text is acquired; splitting the target text to obtain a plurality of first split texts; searching first semantic features of each first split text in a semantic feature table generated based on a preset database; acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts; and obtaining a similar text similar to the target text from the preset database according to the target semantic features, so that the problem is solved and the effect is realized.
Drawings
FIG. 1 is a flow chart of a semantic-based similar text search method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for generating a semantic feature table according to a second embodiment of the present invention;
FIG. 3 is a flow chart of a semantic-based similar text search method according to a second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a semantic-based similar text search system according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration and not of limitation. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Furthermore, the terms "first," "second," and the like, may be used herein to describe various directions, acts, steps, or elements, etc., but these directions, acts, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, a first module may be referred to as a second module, and similarly, a second module may be referred to as a first module, without departing from the scope of the application. Both the first module and the second module are modules, but they are not the same module. The terms "first," "second," and the like, are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the embodiments of the present application, the meaning of "plurality" is at least two, for example, two, three, etc., unless explicitly defined otherwise.
Example 1
As shown in fig. 1, a first embodiment of the present invention provides a semantic-based similar text searching method, which includes:
s110, acquiring a target text.
S120, splitting the target text to obtain a plurality of first split texts.
In this embodiment, a target text is obtained first, where the target text may be an electronic book, a web page news, a journal or a patent, and the like, specifically, a text of which a user needs to obtain a similar text, and after the target text is obtained, the target text needs to be split to obtain a plurality of first split texts, where the first split text may be a word or a single word, and preferably, a method of barking and word segmentation is used to split the target text into a plurality of first split texts.
S130, searching first semantic features of each first split text in a semantic feature table generated based on a preset database.
S140, acquiring target semantic features of the target text, wherein the target semantic features are average values of the first semantic features of the plurality of first split texts.
In this embodiment, after a plurality of first split texts are obtained, the first semantic feature of each first split text may be found in a semantic feature table generated based on a preset database, where the semantic feature table is generated by training using text data in the preset database, and each word corresponds to a unique semantic feature, so that the first semantic feature of each first split text may be found in the semantic feature table. Further, the first semantic features of each first split text are summed and averaged to obtain an average value of all the first semantic features, and the average value is used as a target semantic feature of the whole target text, so that the text feature based on the semantic of the target text, namely the target semantic feature, is obtained.
S150, obtaining similar texts similar to the target text from the preset database according to the target semantic features.
In this embodiment, after the target semantic feature of the target text is obtained, a similar text similar to the target text may be obtained from a preset database according to the target semantic feature, and the closer the semantic feature of the two texts is, the more similar the two texts are described.
According to the embodiment of the invention, the target text is acquired; splitting the target text to obtain a plurality of first split texts; searching first semantic features of each first split text in a semantic feature table generated based on a preset database; acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts; and obtaining a similar text similar to the target text from the preset database according to the target semantic features, so that the problem is solved and the effect is realized.
Example two
As shown in fig. 2 and fig. 3, a second embodiment of the present invention provides a semantic-based similar text searching method, and the second embodiment of the present invention is further explained based on the first embodiment of the present invention.
In this embodiment, as shown in fig. 2, before executing the similar text searching method based on semantics, a semantic feature table generated based on a preset database needs to be obtained, which specifically includes:
S210, acquiring training texts in a preset database.
S220, splitting the training texts to obtain a plurality of second split texts.
S230, converting the second split text into a third split text based on one-hot coding.
S240, inputting each third split text into a preset neural network model to obtain second semantic features of each second split text, wherein the second semantic features are matrixes formed by occurrence probabilities of the rest second split texts of the second split text within a preset semantic radius.
S250, acquiring training semantic features of the training text, wherein the training semantic features are average values of a plurality of second semantic features.
S260, generating a semantic feature table according to the second semantic features.
In this embodiment, in order to obtain the semantic feature table, training of the neural network model is required to be performed by using a large number of training texts in a preset database. Firstly, training texts in a preset database are acquired, wherein the training texts are multiple, each training text is split to obtain multiple second split texts, in order to adapt to a neural network model, the second split texts are also required to be converted into third split texts based on one-hot (independent hot) codes, and then the third split texts can be input into the preset neural network model, wherein the neural network model is a Skip-Gram model based on Word2vec, so that a semantic radius is required to be defined by a user during input, and the Skip-Gram model can output a unique matrix formed by occurrence probability of the rest second split texts of the second split texts within the preset semantic radius, wherein the matrix is the second semantic feature of the second split texts. Finally, the second semantic features and the corresponding second split text can be used for generating a semantic feature table, and the second semantic features in one training text can be summed and averaged to obtain training semantic features of the training text, so that the training semantic features can be used for subsequent searching and comparison of users.
Preferably, multiple semantic radii can be input during training to perform multiple training to address subsequent search requirements of different precision.
For example, one of the training texts includes "Chengdu, jinshengcheng, abbreviated as Rong, located in the middle east of Sichuan province, china, west of Sichuan basin. The method comprises the steps of firstly splitting the second split texts into second split texts such as Chengdu, jinzhen, short, rong, located in China, sichuan province, middle east, at the ground, sichuan basin and west by using a Jiba word segmentation method, converting the second split texts into third split texts based on one-hot coding, and inputting each third split text into a preset neural network model to obtain second semantic features of each second split text.
Specifically, when the second split text is "chinese", the semantic radius is set to be 2, and then the lattice phrase of the second split text based on the rest of the second split texts within the semantic radius in this case is shown in table 1, the corresponding second semantic features are shown in table 2, and the lattice phrase and the corresponding second semantic features are in one-to-one correspondence, that is, 0.87 is the probability that the "paste" appears in the first two digits of "chinese", 0.68 is the probability that the "paste" appears in the first digit of "chinese", 0.94 is the probability that the "Sichuan province" appears in the second digit of "chinese", 0.78 is the probability that the "middle eastern" appears in the second digit of "chinese", and the second semantic features shown in table 2 can be used to represent the second split text of "chinese".
TABLE 1
Minced meat Is positioned at
Sichuan province Middle eastern part
TABLE 2
0.87 0.68
0.94 0.78
TABLE 3 Table 3
Jinzhengheng city Short for short Minced meat Is positioned at
Sichuan province Middle eastern part Sichuan province Middle eastern part
TABLE 4 Table 4
0.68 0.75 0.87 0.68
0.94 0.78 0.94 0.78
Further, setting the semantic radius to be 4, at this time, the "chinese" is based on the lattice phrase of the rest of the second split text in the semantic radius as shown in table 3, the corresponding second semantic feature is shown in table 4, wherein 0.68 is the probability that the "jinsheng" appears in the first four digits of "chinese", 0.75 is the probability that the "short" appears in the first three digits of "chinese", 0.87 is the probability that the "paste" appears in the first two digits of "chinese", 0.68 is the probability that the "in the first digit of" chinese ", 0.94 is the probability that the" Sichuan province "appears in the second digit of" chinese ", 0.78 is the probability that the" middle eastern "appears in the second two digits of" chinese ", 0.67 is the probability that the" at the second three digits of "chinese", 0.61 is the probability that the "Sichuan basin" appears in the second four digits of "chinese", and the second semantic feature as shown in table 4 can be more precisely used to represent the second split text of "chinese".
In this embodiment, as shown in fig. 3, the semantic-based similar text searching method includes:
S310, acquiring a target text.
S320, obtaining the precision requirement input by the user.
S330, determining the semantic radius according to the precision requirement.
S340, splitting the target text to obtain a plurality of first split texts.
S350, searching first semantic features of each first split text in a semantic feature table generated based on a preset database based on the semantic radius.
S360, acquiring target semantic features of the target text, wherein the target semantic features are average values of the first semantic features of the plurality of first split texts.
In this embodiment, the splitting of the target text is the same as the splitting of the training text, and the embodiments of the present invention are not described herein again. The larger the semantic radius of the target text is, the higher the searching precision is, the higher the representation capability of the corresponding semantic features is, for example, the lower semantic radius can be used for early coarse recall, the higher semantic radius can be used for some tasks with higher precision requirements, for example, fine sorting, but the semantic radius cannot exceed the semantic radius in training, after the user inputs the precision requirements, the corresponding semantic radius can be confirmed, if the user inputs various precision requirements, the multiple semantic radii can be also determined, and different semantic features are trained for different semantic radii in the semantic feature table, and finally, the multiple target semantic features based on the different semantic radii can be obtained.
S370, obtaining similar semantic features, wherein the similar semantic features are training semantic features with differences from the target semantic features smaller than a first threshold.
S380, obtaining similar texts from the preset database according to the similar semantic features.
In this embodiment, after obtaining the target semantic features of the target text, the target text may be subjected to a similar search in a preset database to obtain training semantic features of training texts in the preset database, and training semantic features with a difference value smaller than a first threshold value from the target semantic features are searched to be used as the similar semantic features, and then training texts similar to the target text may be obtained from the preset database according to the similar semantic features to be used as the similar texts, thereby completing the similarity search of the target text.
Preferably, after each target text completes the search, the target text and the corresponding target semantic features are stored in a preset database as training texts for subsequent similar searches.
Example III
As shown in fig. 4, the third embodiment of the present invention provides a similar text search system 100 based on semantics, where the similar text search system 100 based on semantics provided in the third embodiment of the present invention can execute the similar text search method based on semantics provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method. The semantic-based similar text search system 100 includes a text retrieval module 200, a text splitting module 300, a feature lookup module 400, a feature retrieval module 500, and a text search module 600.
Specifically, the text obtaining module 200 is configured to obtain a target text; the text splitting module 300 is configured to split the target text to obtain a plurality of first split texts; the feature searching module 400 is configured to find a first semantic feature of each of the first split texts in a semantic feature table generated based on a preset database; the feature acquisition module 500 is configured to acquire a target semantic feature of the target text, where the target semantic feature is an average value of first semantic features of a plurality of first split texts; the text search module 600 is configured to obtain, from the preset database, a similar text similar to the target text according to the target semantic feature.
In this embodiment, the semantic-based similar text search system 100 further includes an accuracy determination module 700 and a text training module 800. The neural network model is a Skip-Gram model based on Word2 vec.
Specifically, the accuracy determining module 700 is configured to obtain an accuracy requirement input by a user; and determining a semantic radius according to the precision requirement. The feature searching module 400 is specifically configured to find, based on the semantic radius, a first semantic feature of each of the first split texts in a semantic feature table generated based on a preset database. The text training module 800 is configured to obtain training text in a preset database; splitting the training text to obtain a plurality of second split texts; inputting each second split text into a preset neural network model to obtain second semantic features of each second split text, wherein the second semantic features are matrixes formed by occurrence probabilities of the rest second split texts of the second split text within a preset semantic radius; acquiring training semantic features of the training text, wherein the training semantic features are average values of a plurality of second semantic features; and generating a semantic feature table according to the second semantic features. The text training module 800 is specifically configured to convert the second split text into a third split text based on one-hot encoding; and inputting each third split text into a preset neural network model to obtain a second semantic feature of each second split text.
Further, the text search module 600 is specifically configured to obtain, from the preset database, a similar text similar to the target text according to the target semantic feature and the training semantic feature. The text search module 600 is specifically further configured to obtain similar semantic features, where the similar semantic features are training semantic features with a difference value from the target semantic features being smaller than a first threshold; and obtaining similar texts from the preset database according to the similar semantic features.
Example IV
Fig. 5 is a schematic structural diagram of a computer device 12 according to a fourth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in FIG. 5, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the methods provided by embodiments of the present invention:
acquiring a target text;
Splitting the target text to obtain a plurality of first split texts;
searching first semantic features of each first split text in a semantic feature table generated based on a preset database;
acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
And obtaining similar texts similar to the target text from the preset database according to the target semantic features.
Example five
The fifth embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the methods as provided by all the embodiments of the present application:
acquiring a target text;
Splitting the target text to obtain a plurality of first split texts;
searching first semantic features of each first split text in a semantic feature table generated based on a preset database;
acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
And obtaining similar texts similar to the target text from the preset database according to the target semantic features.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the invention, the scope of which is determined by the scope of the appended claims.

Claims (9)

1. A semantic-based similar text search method, comprising:
acquiring training texts in a preset database;
Splitting the training text to obtain a plurality of second split texts;
Inputting each second split text into a preset neural network model to obtain second semantic features of each second split text, wherein the second semantic features are matrixes formed by occurrence probabilities of the rest second split texts of the second split text within a preset semantic radius;
acquiring training semantic features of the training text, wherein the training semantic features are average values of a plurality of second semantic features;
generating a semantic feature table according to the second semantic features;
acquiring a target text; the target text is a text of which the user needs to obtain a similar text; the target text comprises an electronic book, webpage news, journals and patents;
Splitting the target text to obtain a plurality of first split texts;
searching first semantic features of each first split text in the semantic feature table generated based on the preset database;
acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
And obtaining similar texts similar to the target text from the preset database according to the target semantic features.
2. The method of claim 1, wherein the obtaining the target text comprises:
Acquiring the precision requirement input by a user;
Determining a semantic radius according to the precision requirement;
The searching the first semantic feature of each first split text in the semantic feature table generated based on the preset database comprises the following steps:
and searching the first semantic features of each first split text in a semantic feature table generated based on a preset database based on the semantic radius.
3. The method of claim 1, wherein said inputting each of the second split texts into a preset neural network model to obtain a second semantic feature of each of the second split texts comprises:
converting the second split text into a third split text based on one-hot coding;
and inputting each third split text into a preset neural network model to obtain a second semantic feature of each second split text.
4. The method according to claim 1, wherein the obtaining similar text from the preset database according to the target semantic feature, the similar text being similar to the target text comprises:
and obtaining similar texts similar to the target text from the preset database according to the target semantic features and the training semantic features.
5. The method of claim 4, wherein the obtaining similar text from the preset database that is similar to the target text according to the target semantic features and training semantic features comprises:
Obtaining similar semantic features, wherein the similar semantic features are training semantic features with the difference value between the training semantic features and the target semantic features being smaller than a first threshold value;
and obtaining similar texts from the preset database according to the similar semantic features.
6. The method of claim 1, wherein the neural network model is a Skip-Gram model based on Word2 vec.
7. A semantic-based similar text search system, comprising:
The text training module is used for acquiring training texts in a preset database; splitting the training text to obtain a plurality of second split texts; inputting each second split text into a preset neural network model to obtain second semantic features of each second split text, wherein the second semantic features are matrixes formed by occurrence probabilities of the rest second split texts of the second split text within a preset semantic radius; acquiring training semantic features of the training text, wherein the training semantic features are average values of a plurality of second semantic features; generating a semantic feature table according to the second semantic features;
the text acquisition module is used for acquiring a target text; the target text is a text of which the user needs to obtain a similar text; the target text comprises an electronic book, webpage news, journals and patents;
the text splitting module is used for splitting the target text to obtain a plurality of first split texts;
The feature searching module is used for searching first semantic features of each first split text in the semantic feature table generated based on the preset database;
The feature acquisition module is used for acquiring target semantic features of the target text, wherein the target semantic features are average values of first semantic features of a plurality of first split texts;
and the text searching module is used for acquiring similar texts similar to the target text from the preset database according to the target semantic features.
8. A computer device, comprising:
one or more processors;
storage means for storing one or more programs,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.
CN202010843746.4A 2020-08-20 2020-08-20 Semantic-based similar text searching method, system, device and storage medium Active CN111949765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010843746.4A CN111949765B (en) 2020-08-20 2020-08-20 Semantic-based similar text searching method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010843746.4A CN111949765B (en) 2020-08-20 2020-08-20 Semantic-based similar text searching method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN111949765A CN111949765A (en) 2020-11-17
CN111949765B true CN111949765B (en) 2024-06-14

Family

ID=73358906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010843746.4A Active CN111949765B (en) 2020-08-20 2020-08-20 Semantic-based similar text searching method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN111949765B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010072A (en) * 2021-04-27 2021-06-22 维沃移动通信(杭州)有限公司 Searching method and device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method
CN109740077A (en) * 2018-12-29 2019-05-10 北京百度网讯科技有限公司 Answer searching method, device and its relevant device based on semantic indexing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491518B (en) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 Search recall method and device, server and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method
CN109740077A (en) * 2018-12-29 2019-05-10 北京百度网讯科技有限公司 Answer searching method, device and its relevant device based on semantic indexing

Also Published As

Publication number Publication date
CN111949765A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
US20210264109A1 (en) Stylistic Text Rewriting for a Target Author
CN114595333B (en) Semi-supervision method and device for public opinion text analysis
CN109614625B (en) Method, device and equipment for determining title text relevancy and storage medium
CN113495900B (en) Method and device for obtaining structured query language statement based on natural language
KR102254612B1 (en) method and device for retelling text, server and storage medium
CN110647614A (en) Intelligent question and answer method, device, medium and electronic equipment
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
US20200012650A1 (en) Method and apparatus for determining response for user input data, and medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111291177A (en) Information processing method and device and computer storage medium
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN111259262A (en) Information retrieval method, device, equipment and medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN111738791B (en) Text processing method, device, equipment and storage medium
CN110704608A (en) Text theme generation method and device and computer equipment
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN111949765B (en) Semantic-based similar text searching method, system, device and storage medium
CN111191011B (en) Text label searching and matching method, device, equipment and storage medium
CN113139558A (en) Method and apparatus for determining a multi-level classification label for an article
CN114742062B (en) Text keyword extraction processing method and system
CN114970467B (en) Method, device, equipment and medium for generating composition manuscript based on artificial intelligence
CN111552780B (en) Medical scene search processing method and device, storage medium and electronic equipment
CN114065727A (en) Information duplication eliminating method, apparatus and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant