CN117076652B - Semantic text retrieval method, system and storage medium for middle phrases - Google Patents

Semantic text retrieval method, system and storage medium for middle phrases Download PDF

Info

Publication number
CN117076652B
CN117076652B CN202311337759.4A CN202311337759A CN117076652B CN 117076652 B CN117076652 B CN 117076652B CN 202311337759 A CN202311337759 A CN 202311337759A CN 117076652 B CN117076652 B CN 117076652B
Authority
CN
China
Prior art keywords
sentence
word
words
semantic
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311337759.4A
Other languages
Chinese (zh)
Other versions
CN117076652A (en
Inventor
申小维
田恩华
樊金斗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianqi Heima Information Technology Wuhan Co ltd
Tianqi Heima Information Technology Beijing Co ltd
Original Assignee
Tianqi Heima Information Technology Wuhan Co ltd
Tianqi Heima Information Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianqi Heima Information Technology Wuhan Co ltd, Tianqi Heima Information Technology Beijing Co ltd filed Critical Tianqi Heima Information Technology Wuhan Co ltd
Priority to CN202311337759.4A priority Critical patent/CN117076652B/en
Publication of CN117076652A publication Critical patent/CN117076652A/en
Application granted granted Critical
Publication of CN117076652B publication Critical patent/CN117076652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic text retrieval method, a semantic text retrieval system and a storage medium for middle phrases, which comprise the following steps: step S1: extracting target words in the short sentence text, performing one-time keyword retrieval by using the target words, positioning the first M results therein, and defining the first results as first results; step S2: performing word expansion on the target words based on the first results, obtaining expansion words of each target word, performing secondary keyword retrieval by using the target words and the corresponding expansion words, positioning the first J results, and defining the first J results as second results; step S3: semantic retrieval is carried out based on the middle short sentence text, semantic retrieval results are obtained, and feature values of each semantic retrieval result are calculated based on a first formula; step S4: sorting semantic search results based on the magnitudes of the feature values to obtain final search results; by the technical scheme, the problem of low accuracy in semantic retrieval of the middle short sentence text in the prior art is solved.

Description

Semantic text retrieval method, system and storage medium for middle phrases
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a semantic text retrieval method, a semantic text retrieval system and a storage medium for medium phrases.
Background
In the prior art, an intelligent patent similarity searching method based on semantic retrieval exists, generally, word segmentation is carried out on a target text, then synonym expansion is carried out on related keywords after word segmentation, finally retrieval is carried out according to the word segmentation result of related fields of the target text and the synonym expansion result, after a retrieval result is obtained, sorting is carried out according to the similarity between the retrieval result and the target text, and a final sorting result is returned to a user. For example, an application document with publication number of CN111126074a discloses a semantic expansion method for a search request, which includes receiving a search request sent by a user and preprocessing the search request to generate a keyword list, then judging whether the keyword list is a vocabulary in a vocabulary model, if yes, judging the search intention of the user according to the search request, finally matching a corresponding expansion policy for the keyword list according to the search intention, performing semantic expansion on the keyword list according to the expansion policy to generate an expansion word set, and for example, an application document with publication number of CN104199965B discloses a semantic information search method, which includes receiving a query word submitted by the user, and obtaining keywords contained in the query word through word segmentation processing; according to semantic relations among the keywords, query analysis is carried out, the semantic relations are converted into conceptual expressions, and texts to be retrieved are read from a storage medium by taking the units of the keywords as units; sentence segmentation is carried out on the text to be retrieved, and the read text is segmented into sentences and words; carrying out semantic analysis on the sentence to obtain the concept category of the sentence and the concept expression of the word; calculating the semantic distance between the obtained concept expression of the query word and the concept expression of the text to be searched; and (5) sequencing from the near to the far according to the semantic distance, and returning a query result.
For long sentence text, more keywords can be divided and corresponding keyword expansion can be performed more because the text is longer. However, for the text of the short-and-medium sentences, the number of the keywords which can be divided is small, so that the keywords which can be expanded are limited, and therefore, when the text length is short, a more accurate search result cannot be obtained based on semantic search.
Disclosure of Invention
In order to solve the problems, the invention provides a semantic text retrieval method, a semantic text retrieval system and a storage medium for medium short sentences, which are used for solving the problem that the semantic retrieval of medium short sentence texts in the prior art has lower accuracy.
In order to achieve the above object, the present invention provides a semantic text retrieval method for a middle phrase, comprising:
step S1: extracting target words in short sentence texts, performing primary keyword retrieval by using the target words, sorting the results of the primary keyword retrieval based on the number and frequency of occurrence of the target words, positioning the first M results after sorting, and defining the first results;
step S2: performing word expansion on the target words based on the first results, obtaining expansion words of each target word, performing secondary keyword retrieval by using the target words and the corresponding expansion words, sorting the results of the secondary keyword retrieval, positioning the first J results after sorting, and defining the first J results as second results;
Step S3: semantic retrieval is carried out based on the middle short sentence text, a semantic retrieval result is obtained, and the feature value of the ith semantic retrieval result is calculated based on a first formulaThe first formula is: />Wherein->For the similarity of the ith semantic search result and the middle phrase text,/I>For the similarity of the ith semantic search result and the jth second result,/th semantic search result>Weighting the j-th second result;
step S4: and sorting the semantic search results based on the magnitude of the characteristic values to obtain final search results.
Further, the similarity between the middle phrase text and the semantic search result is calculated based on the following steps:
splitting the middle short sentence text and the semantic search result into a plurality of first sentences and second sentences respectively, setting a statistical value, extracting a first sentence, and if the second sentence is the same as the extracted first sentence, adding 1 to the statistical value, continuously extracting a second sentence, continuously judging, and calculating the similarity of the middle short sentence text and the semantic search result based on a second formula after all the first sentences are extracted and judged completely The second formula is: />Wherein M is the statistic value, and M is the number of the first sentences.
Further, it is determined whether the first sentence is identical to the second sentence based on the steps of:
extracting all the target words included in the first sentence, generating a data set with the same number as the target words, correspondingly splitting the second sentence into a plurality of basic words based on the target words included in each data set, dividing the basic words into corresponding data sets, comparing the target words in the data set with each basic word, acquiring a first matching degree of the target words and each basic word, and selecting the target words and the basic words corresponding to the first matching degree to the greatest extent from the first matching degree, wherein the target words and the basic words are respectively defined as a first matching word and a second matching word;
the first matching word is used as a starting point to be expanded from left to right in the first sentence until the previous characters of other first matching words in the same sentence are expanded, a first expansion word is obtained, the second matching word is used as a starting point to be expanded from left to right in the second sentence until the previous characters of other second matching words in the same sentence are expanded, a second expansion word is obtained, and the first expansion word and the second expansion word are compared to obtain second matching degree of the first expansion word and the second expansion word;
Positioning the associated words of the first expansion words in the first sentence and the associated words of the second expansion words in the second sentence, respectively intercepting a first long sentence and a second long sentence in the first sentence and the second sentence, wherein the first long sentence comprises the first expansion words and the corresponding associated sentences, the second long sentence comprises the second expansion words and the corresponding associated sentences, and calculating a third matching degree between the first long sentence and the second long sentence;
calculating fluctuation values of the first sentence and the second sentence based on the first matching degree, the second matching degree and the third matching degree, and judging that the first sentence is different from the second sentence if the fluctuation values are smaller than a first threshold value.
Further, calculating the fluctuation value of the first sentence and the second sentence includes the steps of:
calculating a first numerical value of a kth pair of the first matching word and the second matching word in the first sentence and the second sentence based on a third formula and a fourth formula respectivelyAnd a second value->The third formula is:the fourth formula is: / >Wherein->For the k-th pair of said first matching degree of said first matching word and said second matching word,/for said first matching word, respectively>The second degree of matching of the first expansion word and the second expansion word generated for the kth pair of the first matching word and the second matching word,/for the first expansion word and the second expansion word>The third degree of matching between the first long sentence and the second long sentence generated for the kth pair of the first matching word and the second matching word;
calculating a comparison value of the first sentence and the second sentence based on a fifth formulaThe fifth formula is:wherein K is the number of the target words in the first sentence.
Further, performing word expansion on the target word includes the following steps:
splitting the first result into a plurality of paragraphs, dividing the paragraphs into a plurality of first types, wherein the paragraphs of the same first type comprise at least one same target word, comparing the paragraphs of the same first type, and defining a word as a first reference word of the target word if the word exists in at least a first number of the paragraphs;
re-dividing the paragraphs into a plurality of second types, wherein the target words included in the paragraphs of the same second type are completely different, comparing the paragraphs of the same second type, and defining the words as second reference words if there are words appearing in at least a second number of the paragraphs;
And eliminating the second reference word from the first reference word, setting the reserved first reference word as the expansion word of the target word, generating a search formula based on the target word and the expansion word, and performing the secondary keyword search based on the search formula.
The invention also provides a semantic text retrieval system for the middle short sentence, which is used for realizing the semantic text retrieval method for the middle short sentence, and comprises the following steps:
the primary retrieval module is used for extracting target words in short sentence texts, performing primary keyword retrieval by using the target words, sequencing results of the primary keyword retrieval based on the number and frequency of occurrence of the target words, positioning the first M results after sequencing, and defining the first results;
the secondary retrieval module is used for carrying out word expansion on the target words based on the first results, obtaining expansion words of each target word, carrying out secondary keyword retrieval by using the target words and the corresponding expansion words, sorting the results of the secondary keyword retrieval, positioning the first J results after sorting, and defining the first J results as second results;
The semantic retrieval module is used for carrying out semantic retrieval based on the middle short sentence text to obtain a semantic retrieval result, and calculating the feature value of the ith semantic retrieval result based on a first formulaThe first formula is: />Wherein, the method comprises the steps of, wherein,for the similarity of the ith semantic search result and the middle phrase text,/I>For the similarity of the ith semantic search result and the jth second result,/th semantic search result>And for the weight of the j-th second result, the semantic search module ranks the semantic search results based on the magnitude of the characteristic value to obtain a final search result.
The invention also provides a computer storage medium which stores program instructions, wherein the program instructions control equipment where the computer storage medium is located to execute the semantic text retrieval method for the middle phrases when running.
Compared with the prior art, the invention has the following beneficial effects:
the method comprises the steps of firstly extracting target words in a short sentence text, then carrying out word retrieval once by using the target words, thereby obtaining a plurality of retrieval results of the target words, and then locating the retrieval result which is relatively front as a first result; and then expanding the target word based on the first result to obtain an expanded word related to the target word, and searching through the target word and the expanded word, so that more and more related search results with the middle short sentence text can be obtained.
According to the method, after the secondary keyword is searched, a result which is relatively front is selected to be a second result, then the search is carried out based on the middle short sentence text, semantic search results which are close to the semantic of the second result are obtained, the characteristic value of each semantic search result is calculated through a first formula, in the first formula, the corresponding weight is set according to the ordering of the second result in the secondary keyword search, as the ordering of the second result is relatively front, the semantic search result with relatively large characteristic value is indicated to be relatively close to the content of the middle short sentence text, the semantic search result is indicated to be relatively close to most of the second result, or the semantic search result is indicated to be relatively close to the semantic of the middle short sentence text, and in any case, the relatively front ordering of the second result can enable related personnel to obtain the desired search result relatively rapidly.
Drawings
FIG. 1 is a flow chart of steps of a semantic text retrieval method for a mid-phrase of the present invention;
FIG. 2 is a schematic diagram of a system architecture for semantic text retrieval of medium phrases according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.
As shown in fig. 1, a semantic text retrieval method for a middle phrase includes:
step S1: extracting target words in the short sentence text, performing primary keyword retrieval by using the target words, sorting the results of the primary keyword retrieval based on the number and frequency of the target words, and locating the first M results after sorting to define a first result.
Specifically, after a short sentence text is input by a user, word segmentation is performed on the short sentence text, so as to obtain a plurality of independent words, then each word is identified, and a real meaning word in the words is obtained, wherein the real meaning word is a word with a definite vocabulary meaning, such as noun, verb, adjective and the like, and more text word segmentation and real meaning word identification methods, such as a character string matching algorithm, a word segmentation algorithm based on machine learning and the like, are provided, and are all the prior art and are not repeated herein. For example, the target words are obtained as data AND an encryption algorithm, then the words are searched once by using a searching type (data AND encryption algorithm), AND the searched results are ranked; the sorting method can firstly perform priority sorting according to the number of target words included in the search result, and sort the reappearance times of the same number of target words, for example, sort the search result including data and encryption algorithm at the same time before, and sort the search result with highest frequency of occurrence of data and encryption algorithm before again; after the sorting is completed, the first M search results are obtained as the first result, and M is set to 20 in this embodiment.
Step S2: and carrying out word expansion on the target words based on the first results, obtaining expansion words of each target word, carrying out secondary keyword retrieval by using the target words and the corresponding expansion words, sorting the results of the secondary keyword retrieval, positioning the first J results after sorting, and defining the first J results as second results.
After the first result is obtained, the expansion words of each target word are obtained based on the first result, then a search formula is generated by using the target word and the corresponding expansion word, a secondary keyword search is performed by using the search formula, the expansion words may be the upper word or the lower word of the target word or the word associated with the target word, a specific obtaining method of the expansion word, and a specific generation mode of the search formula are described later; after obtaining the search result, the search result is ranked based on the occurrence frequency of the target word and the expanded word, and the first J results after the ranking are positioned and defined as the second result, and J is set to 25 in this embodiment.
Step S3: semantic retrieval is carried out based on the middle short sentence text, a semantic retrieval result is obtained, and the feature value of the ith semantic retrieval result is calculated based on a first formulaThe first formula is: / >Wherein->For the similarity of the ith semantic search result and the middle phrase text,/the text is the text of the middle phrase>For the similarity of the ith semantic search result and the jth second result, +.>Is the weight of the j-th second result.
Step S4: and sorting semantic search results based on the magnitude of the characteristic values to obtain final search results.
After the second result is obtained, firstly, semantic retrieval is carried out based on the middle short sentence text, in the embodiment, the similarity of the middle short sentence text and the number of the same sentences in the retrieval result is determined by obtaining the number of the same sentences, and a specific calculation mode is introduced later; after the semantic search results are obtained, comparing the obtained search results with the second results to obtain the similarity of the search results and the second results, and finally calculating the characteristic value of each search result through a first formula; in the first formula, the weight of the second result is related to the ranking of the second result in the secondary keyword retrieval result, the weight is within the range of 0-0.5, for example, the ranking of the second result is 1, the weight of the second result is 0.5, the ranking of the second result is 2, and the weight of the second result is 0.4; after the feature values of the semantic search results are calculated through the first formula, sorting is performed based on the size of the feature values, and the semantic search results with larger feature values are sorted earlier.
The method comprises the steps of firstly extracting target words in a short sentence text, then carrying out word retrieval once by using the target words, thereby obtaining a plurality of retrieval results of the target words, and then locating the retrieval result which is relatively front as a first result; and then expanding the target word based on the first result to obtain an expanded word related to the target word, and searching through the target word and the expanded word, so that more and more related search results with the middle short sentence text can be obtained.
According to the method, after the secondary keyword is searched, a result which is relatively close to the secondary keyword is selected as a second result, then the search is carried out based on the middle short sentence text, semantic search results which are close to the semantic of the middle short sentence text are obtained, then the characteristic values of the semantic search results are calculated through a first formula, in the first formula, the corresponding weight is set according to the ordering of the second result in the secondary keyword search, and as the ordering of the second result is relatively close to the content of the middle short sentence text, the semantic search result with relatively large characteristic values is indicated to be relatively close to most of the second result, the semantic search result is indicated to be representative, or the semantic search result is indicated to be very close to the semantic of the middle short sentence text, and in any case, the relatively close to the first formula can enable related personnel to obtain the wanted search result relatively fast.
Particularly, the invention solves the problem of lower accuracy of semantic retrieval for the middle short sentence text in the prior art by the technical scheme.
In this embodiment, the similarity between the phrase text and the semantic search result is calculated based on the following steps:
dividing the middle short sentence text and the semantic search result into a plurality of first sentences and second sentences respectively, setting a statistical value, extracting the first sentences, judging whether the second sentences are the same as the extracted first sentences, if yes, adding 1 to the statistical value, continuously extracting the second first sentences for continuous judgment, and after all the first sentences are extracted and judged, calculating the similarity between the short sentence text and the semantic search result based on a second formulaThe second formula is: />Wherein M is a statistic value, and M is the number of the first sentences.
Specifically, the similarity between the middle short sentence text and one of the semantic search results is taken as an example for explanation, the middle short sentence text is firstly split into a plurality of first sentences, which are defined as a first sentence 1 and a first sentence 2, the semantic search result to be compared is split into a plurality of second sentences, which are defined as a second sentence 1, a second sentence 2 and a second sentence 3, the first sentence 1 is respectively compared with the second sentences 1-3, whether the second sentence is identical to the first sentence 1 is judged, and the judgment mode is introduced later; if the first sentence 1 is the same as the second sentence 1, the statistical value is added by 1, then the first sentence 2 is respectively compared with the first sentences 1-3, if the first sentence 2 is the same as the second sentence 2, the statistical value is added by 1, at the moment, the statistical value becomes 2, finally, the similarity is calculated through a second formula, wherein the similarity is 2/2 x 100% = 100%, and the similarity of the middle short sentence text and the semantic search result is 100%.
In the present embodiment, it is determined whether the first sentence is identical to the second sentence based on the following steps:
extracting all target words included in the first sentence, generating data sets with the same number as the target words, correspondingly splitting the second sentence into a plurality of basic words based on the target words included in each data set, dividing the basic words into corresponding data sets, comparing the target words in the data sets with each basic word, acquiring the first matching degree of the target words and each basic word, selecting the target words and the basic words corresponding to the maximum first matching degree from the first matching degree, and defining the target words and the basic words as the first matching words and the second matching words respectively.
Specifically, first sentence 1 is extracted from the short sentence text, because the short sentence text is identified, target words included in the first sentence 1 can be directly extracted, for example, target words 1 and target words 2 are extracted from the first sentence 1, then data set 1 and data set 2 are generated, the data set 1 includes the target words 1, the data set 2 includes the target words 2, then second sentence 1 is extracted from semantic retrieval results, then the second sentence 1 is split into a plurality of basic words based on the target words 1 in the data set 1, the second sentence 1 is assumed to be ' today ' weather is good ', the target words 1 contain two characters, and the second sentence 1 is split into ' today ' weather is not ' is not good ', then ' is good ' and the five words are compared, matching degrees between the two words are obtained, in particular, the two words are respectively obtained in contrast, cosine values between the two words are calculated, the cosine values are in the range of 0-1, and the cosine values are closer to the two words are the meaning of the words, and the existing technology is not similar to the existing technology; after the comparison is completed, extracting the word with the maximum cosine value with the target word 1 from the five basic words, setting the word as a second matching word, setting the corresponding target word 1 as a first matching word, and setting the cosine value between the two words as a first matching degree; similarly, the target word 2 in the data set 2 is also compared based on the method.
And in the first sentence, the first matching word is used as a starting point to be expanded from left to right until the previous character of other first matching words in the same sentence is expanded, the first expansion word is obtained, in the second sentence, the second matching word is used as a starting point to be expanded from left to right until the previous character of other second matching words in the same sentence is expanded, the second expansion word is obtained, and the first expansion word and the second expansion word are compared, so that the second matching degree of the first and second expansion words is obtained.
Specifically, the first sentence 1 is "good day", the first matching word in the first sentence 1 has "good", "day", and expands from left to right starting with "good", when expanding to "good", the first expansion word is "good" because the character before "one day" is reached, and then expansion is stopped, and similarly, if the second matching word corresponding to "good" in the second sentence 1 is "good", the second expansion word is "good" because of no character on the right side, and expands from left to right starting with good. The cosine value between the "good" and "good" word vectors, i.e. the second degree of matching, is then calculated.
Positioning the associated words of the first expansion words in the first sentence and the associated words of the second expansion words in the second sentence, respectively intercepting a first long sentence and a second long sentence in the first sentence and the second sentence, wherein the first long sentence comprises the first expansion words and corresponding associated sentences, the second long sentence comprises the second expansion words and corresponding associated sentences, and calculating a third matching degree between the first long sentence and the second long sentence;
specifically, if the first expansion word is the subject, the related word is the subject, for example, the first component drives the second component to rotate, the first expansion word is the first component to drive, then the corresponding related word is the second component, the generated first long sentence is the first component to drive the second component, the related word of the second expansion word is obtained in a similar way and the second sentence is generated, then the cosine value between the word vectors of the first long sentence and the second long sentence, namely, the third matching degree is calculated, in particular, the word vector of the first component to drive the second component can be generated based on the word vectors of the first component to drive the second component, and the merging mode of the vectors is well known in the art and is not repeated herein.
And calculating fluctuation values of the first sentence and the second sentence based on the first matching degree, the second matching degree and the third matching degree, and judging that the first sentence is different from the second sentence if the fluctuation values are smaller than a first threshold value.
Judging whether the first sentence and the second sentence are identical or not in a plurality of ways based on the first matching degree, the second matching degree and the third matching degree, for example, setting a fixed numerical value, determining that the first sentence and the second sentence are identical only when all the first matching degree, the second matching degree and the third matching degree between the first long sentence and the second long sentence are larger than the numerical value, and then judging that the first sentence and the second sentence are identical if all the long sentences in the first sentence and the second sentence are identical; however, this method is not adopted in this embodiment, because this method may occur, the first similarity of the two words is high, but the second similarity and the third similarity obtained by the expansion are low, and the average value thus obtained may be high, which may result in that the first long sentence and the second long sentence may be determined to be the same, such as a word ambiguous sentence.
In this embodiment, therefore, calculating the ripple value of the first sentence and the second sentence includes the steps of:
calculating a first numerical value of a kth pair of the first matching word and the second matching word in the first sentence and the second sentence based on the third formula and the fourth formula respectively And a second value->The third formula is: />The fourth formula is:wherein->For the k-th pair of the first matching degree of the first matching word and the second matching word, ++>Second degree of matching of first and second expansion words generated for kth pair of first and second match words, +.>And generating a third matching degree between the first long sentence and the second long sentence for the k-th pair of the first matching word and the second matching word.
For convenience of description, it is assumed that only a pair of first matching word and second matching word exists in the first sentence and the second sentence, the first similarity of the first matching word and the second matching word is 0.75, the second similarity of the first expansion word and the second expansion word generated by the first matching word and the second matching word is 0.6, the first long sentence and the second long sentence generated by the first expansion word and the second expansion word are 0.5, and then the first numerical value of the first matching word and the second matching word of the first pair is 0.6-0.75= -0.15, and the second numerical value of the first matching word and the second matching word of the first pair is 0.5-0.6= -0.1.
Calculating a comparison value of the first sentence and the second sentence based on the fifth formulaThe fifth formula is:wherein K is the target word in the first sentence Number of parts.
Then the comparison value of the first sentence and the second sentence is calculated by a fifth formula, in the present embodiment, firstly byCalculating the values corresponding to the first long sentence and the second long sentence, for example, in the embodiment, the calculation result is 0.75- (0.15+0.1) =0.5, if the first threshold is set to 0.9, it is determined that the first sentence is different from the second sentence, and since only a pair of matching words is included in the embodiment, k=1, the values of the first long sentence and the second long sentence are the fluctuation values of the first sentence and the second sentence; the principle of the fifth formula is explained below, if the first similarity between the first matching word and the second matching word is high, the first expansion word and the second expansion word generated by expansion of the first matching word are also high in similarity, then the first numerical value is small, which indicates that the semantics after expansion are still close, and similarly, if the second numerical value is small, which indicates that the semantics between the first long sentence and the second long sentence generated by the first expansion word and the second expansion word are also close, then under the condition that the variation numerical value is small, the first numerical value and the second numerical value are both small>The value of (2) will be small, which will have little effect on the first similarity; in addition, because the character length of the words is small, if the similarity of the two words is large, the two words are basically determined to express the same meaning, so that it is deduced that the expanded sentences are similar if the similarity does not change too much in the process of expanding the similar words into sentences, and if the similarity change amplitude is large only once in the expanding process, the two sentences are very likely to be dissimilar. Particularly, if a plurality of first matching words and second matching words exist in the first sentence and the second sentence, the first sentence and the second sentence are indicated to have a plurality of first long sentences and second long sentences, and thus, the similarity of the first sentence and the second sentence is determined by calculating the numerical value between the corresponding long sentences and then obtaining the average value, and the word ambiguity phenomenon can be avoided by the mode, so that the judgment is wrong.
In this embodiment, performing word expansion on the target word includes the following steps:
splitting the first result into a plurality of paragraphs, dividing the paragraphs into a plurality of first types, wherein the paragraphs of the same first type comprise at least one identical target word, comparing the paragraphs of the same first type, and defining the word as a first reference word of the target word if the word exists in at least a first number of the paragraphs.
Specifically, all the obtained first results are split in units of paragraphs, and then the paragraphs are divided into a plurality of first types, for example 7 paragraphs are divided into a first type 1, a first type 2 and a first type 3, wherein the first type 1 has a paragraph 1, a paragraph 2 and a paragraph 3, the first type 2 has a paragraph 4 and a paragraph 5, the first type 3 has a paragraph 6 and a paragraph 7, the paragraphs 1, the paragraph 2 and the paragraph 3 are divided into the same type because of the same target word 'sliding connection', and other types of division principles are the same; then, in comparing paragraphs 1, 2 and 3, the words existing therein, such as the three paragraphs including "runner," "spring," "bolt," "thread," are obtained, and if the first number is 2, the "runner," "spring," "bolt," "thread" is set as the first reference word of the "sliding connection" due to the fact that the words appear in the three paragraphs.
The paragraphs are subdivided into a plurality of second types, the paragraphs of the same second type comprising target words that are completely different, and the paragraphs of the same second type are compared, and if there are words present in at least a second number of paragraphs, the words are defined as second reference words.
Similarly, using the example described above, 7 paragraphs are divided into a second type 1, a second type 2, and a second type 3, with paragraph 1 and paragraph 4 in the second type 1, paragraph 2 and paragraph 5 in the second type 2, paragraph 3, paragraph 6, and paragraph 7 in the second type 3, with the target word "sliding connection" in the paragraph 1, and the target word "fixed connection" in the paragraph 4, and the two paragraphs do not have the same target word, thus dividing paragraph 1 and paragraph 4 into the same second type 1, and the other types of division principles are the same; and comparing the paragraph 1 and the paragraph 4 to obtain the words existing in the paragraphs, wherein if the two paragraphs comprise "driving" and "moving", the second number is set to be 1, and the two words appear twice, the "driving" and "moving" are set as second reference words.
And eliminating the second reference word from the first reference word, setting the reserved first reference word as an expansion word of the target word, generating a search formula based on the target word and the expansion word, and performing secondary keyword search based on the search formula.
If the first reference word with the target word comprises a 'driving', the 'driving' is removed, and the principle of the process is that if two paragraphs have the same target word, the two paragraphs are possibly both description and expansion of the word, so that the two paragraphs are set as expansion words of the target word; however, if the expansion word is also present in the paragraph including the different target words, it is indicated that the expansion word is a more general word, if the search is performed using the expansion word, the length of the search formula is enlarged, and a more accurate search result is not obtained, so that the expansion word is eliminated. In addition, the embodiment further expands the paraphrasing of the target word, and the paraphrasing of the target word can be obtained according to a preset paraphrasing library, for example, the paraphrasing of the slot is a blind slot, and the blind slot is obtained according to a specific paraphrasing library; then expanding the blind slot to obtain an expansion word of the blind slot; finally, the embodiment also automatically generates a search formula in which the format is (target word aor expansion word A1 OR expansion word AN) AND (target word bor expansion word B1 OR expansion word BN) AND (hyponym Z OR expansion word Z1 OR expansion word ZN), N is a positive integer greater than 0, AND directly automatically performs search based on the search after the generation.
As shown in fig. 2, the present invention further provides a semantic text retrieval system for a middle phrase, where the system is configured to implement a semantic text retrieval method for a middle phrase, and the system includes:
the primary retrieval module is used for extracting target words in the short sentence text, performing primary keyword retrieval by using the target words, sorting the results of the primary keyword retrieval based on the number and frequency of the target words, and defining the first M results after positioning and sorting as first results;
the secondary search module is used for carrying out word expansion on the target words based on the first results, obtaining expansion words of each target word, carrying out secondary keyword search by using the target words and the corresponding expansion words, sorting the results of the secondary keyword search, positioning the first J results after sorting, and defining the first J results as second results;
the semantic retrieval module is used for carrying out semantic retrieval based on the short sentence text, obtaining a semantic retrieval result, and calculating a characteristic value of an ith semantic retrieval result based on a first formulaThe first formula is: />Wherein->For the similarity of the ith semantic search result and the middle phrase text,/the text is the text of the middle phrase>For the similarity of the ith semantic search result and the jth second result, And for the weight of the j second result, the semantic search module ranks the semantic search results based on the magnitude of the characteristic value to obtain a final search result.
The invention also provides a computer storage medium which stores program instructions, wherein the device where the computer storage medium is located is controlled to execute the semantic text retrieval method for the medium phrases when the program instructions run.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a non-transitory computer readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, they should be considered as the scope of the disclosure as long as there is no contradiction between the combinations of the technical features.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (6)

1. A semantic text retrieval method for a mid-phrase, comprising:
step S1: extracting target words in short sentence texts, performing primary keyword retrieval by using the target words, sorting the results of the primary keyword retrieval based on the number and frequency of occurrence of the target words, positioning the first M results after sorting, and defining the first results;
Step S2: performing word expansion on the target words based on the first results, obtaining expansion words of each target word, performing secondary keyword retrieval by using the target words and the corresponding expansion words, sorting the results of the secondary keyword retrieval, positioning the first J results after sorting, and defining the first J results as second results;
step S3: semantic retrieval is carried out based on the middle short sentence text, a semantic retrieval result is obtained, and the feature value of the ith semantic retrieval result is calculated based on a first formulaThe first formula is: />Wherein->For the similarity of the ith semantic search result and the middle phrase text,/I>For the similarity of the ith semantic search result and the jth second result,/th semantic search result>Weighting the j-th second result;
step S4: sorting the semantic search results based on the magnitude of the characteristic values to obtain final search results;
calculating the similarity between the middle phrase text and the semantic search result based on the following steps:
splitting the middle short sentence text and the semantic search result into a plurality of first sentences and second sentences respectively, setting a statistical value, extracting a first sentence, and if the second sentence is the same as the extracted first sentence, adding 1 to the statistical value, continuously extracting a second sentence, continuously judging, and calculating the similarity of the middle short sentence text and the semantic search result based on a second formula after all the first sentences are extracted and judged completely The second formula is: />Wherein M is the statistic value, and M is the number of the first sentences.
2. The semantic text retrieval method for a mid-sentence according to claim 1, wherein the determination of whether the first sentence is identical to the second sentence is based on the steps of:
extracting all the target words included in the first sentence, generating a data set with the same number as the target words, correspondingly splitting the second sentence into a plurality of basic words based on the target words included in each data set, dividing the basic words into corresponding data sets, comparing the target words in the data set with each basic word, acquiring a first matching degree of the target words and each basic word, and selecting the target words and the basic words corresponding to the first matching degree to the greatest extent from the first matching degree, wherein the target words and the basic words are respectively defined as a first matching word and a second matching word;
the first matching word is used as a starting point to be expanded from left to right in the first sentence until the previous characters of other first matching words in the same sentence are expanded, a first expansion word is obtained, the second matching word is used as a starting point to be expanded from left to right in the second sentence until the previous characters of other second matching words in the same sentence are expanded, a second expansion word is obtained, and the first expansion word and the second expansion word are compared to obtain second matching degree of the first expansion word and the second expansion word;
Positioning the associated words of the first expansion words in the first sentence and the associated words of the second expansion words in the second sentence, respectively intercepting a first long sentence and a second long sentence in the first sentence and the second sentence, wherein the first long sentence comprises the first expansion words and the corresponding associated sentences, the second long sentence comprises the second expansion words and the corresponding associated sentences, and calculating a third matching degree between the first long sentence and the second long sentence;
calculating fluctuation values of the first sentence and the second sentence based on the first matching degree, the second matching degree and the third matching degree, and judging that the first sentence is different from the second sentence if the fluctuation values are smaller than a first threshold value.
3. A semantic text retrieval method for a mid-phrase according to claim 2, wherein calculating the surge value of the first and second phrases comprises the steps of:
calculating a first numerical value of a kth pair of the first matching word and the second matching word in the first sentence and the second sentence based on a third formula and a fourth formula respectively And a second value->The third formula is:the fourth formula is: />Wherein->For the k-th pair of said first matching degree of said first matching word and said second matching word,/for said first matching word, respectively>The second degree of matching of the first expansion word and the second expansion word generated for the kth pair of the first matching word and the second matching word,/for the first expansion word and the second expansion word>The third degree of matching between the first long sentence and the second long sentence generated for the kth pair of the first matching word and the second matching word;
calculating a comparison value of the first sentence and the second sentence based on a fifth formulaThe fifth formula is:wherein K is the number of the target words in the first sentence.
4. The semantic text retrieval method for a mid-phrase according to claim 1, wherein word expansion of the target word comprises the steps of:
splitting the first result into a plurality of paragraphs, dividing the paragraphs into a plurality of first types, wherein the paragraphs of the same first type comprise at least one same target word, comparing the paragraphs of the same first type, and defining a word as a first reference word of the target word if the word exists in at least a first number of the paragraphs;
Re-dividing the paragraphs into a plurality of second types, wherein the target words included in the paragraphs of the same second type are completely different, comparing the paragraphs of the same second type, and defining the words as second reference words if there are words appearing in at least a second number of the paragraphs;
and eliminating the second reference word from the first reference word, setting the reserved first reference word as the expansion word of the target word, generating a search formula based on the target word and the expansion word, and performing the secondary keyword search based on the search formula.
5. A semantic text retrieval system for mid-phrases, for implementing a semantic text retrieval method for mid-phrases according to any one of claims 1-4, comprising:
the primary retrieval module is used for extracting target words in short sentence texts, performing primary keyword retrieval by using the target words, sequencing results of the primary keyword retrieval based on the number and frequency of occurrence of the target words, positioning the first M results after sequencing, and defining the first results;
the secondary retrieval module is used for carrying out word expansion on the target words based on the first results, obtaining expansion words of each target word, carrying out secondary keyword retrieval by using the target words and the corresponding expansion words, sorting the results of the secondary keyword retrieval, positioning the first J results after sorting, and defining the first J results as second results;
The semantic retrieval module is used for carrying out semantic retrieval based on the middle short sentence text to obtain a semantic retrieval result, and calculating the feature value of the ith semantic retrieval result based on a first formulaThe first formula is: />Wherein->For the similarity of the ith semantic search result and the middle phrase text,/I>For the similarity of the ith semantic search result and the jth second result,/th semantic search result>The semantic search module sorts the semantic search results based on the magnitude of the characteristic value to obtain a final search result, wherein the semantic search module respectively splits the middle short sentence text and the semantic search result into a plurality of first sentences and second sentences when obtaining the similarity between the middle short sentence text and the semantic search result, sets a statistical value, extracts the first sentence, judges whether the second sentence is identical to the extracted first sentence, adds 1 to the statistical value, continuously extracts the second sentence to continuously judge, and calculates the similarity between the middle short sentence text and the semantic search result based on a second formula after all the first sentences are extracted and judged, wherein the similarity between the middle short sentence text and the semantic search result is calculated >The sum ofThe second formula is: />Wherein M is the statistic value, and M is the number of the first sentences.
6. A computer storage medium storing program instructions, wherein the program instructions, when executed, control a device in which the computer storage medium is located to perform a semantic text retrieval method for a mid-phrase according to any one of claims 1-4.
CN202311337759.4A 2023-10-17 2023-10-17 Semantic text retrieval method, system and storage medium for middle phrases Active CN117076652B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311337759.4A CN117076652B (en) 2023-10-17 2023-10-17 Semantic text retrieval method, system and storage medium for middle phrases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311337759.4A CN117076652B (en) 2023-10-17 2023-10-17 Semantic text retrieval method, system and storage medium for middle phrases

Publications (2)

Publication Number Publication Date
CN117076652A CN117076652A (en) 2023-11-17
CN117076652B true CN117076652B (en) 2023-12-29

Family

ID=88715675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311337759.4A Active CN117076652B (en) 2023-10-17 2023-10-17 Semantic text retrieval method, system and storage medium for middle phrases

Country Status (1)

Country Link
CN (1) CN117076652B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312688B (en) * 2023-11-29 2024-01-26 浙江大学 Cross-source data retrieval method, medium and device based on space-time asset catalogue

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562831A (en) * 2017-08-23 2018-01-09 中国软件与技术服务股份有限公司 A kind of accurate lookup method based on full-text search
JP2019074982A (en) * 2017-10-18 2019-05-16 三菱重工業株式会社 Information search device, search processing method, and program
CN112035598A (en) * 2020-11-03 2020-12-04 北京淇瑀信息科技有限公司 Intelligent semantic retrieval method and system and electronic equipment
CN113821646A (en) * 2021-11-19 2021-12-21 达而观科技(北京)有限公司 Intelligent patent similarity searching method and device based on semantic retrieval
CN115293154A (en) * 2021-07-30 2022-11-04 苏州七星天专利运营管理有限责任公司 Vocabulary extension method and system based on text retrieval

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562831A (en) * 2017-08-23 2018-01-09 中国软件与技术服务股份有限公司 A kind of accurate lookup method based on full-text search
JP2019074982A (en) * 2017-10-18 2019-05-16 三菱重工業株式会社 Information search device, search processing method, and program
CN112035598A (en) * 2020-11-03 2020-12-04 北京淇瑀信息科技有限公司 Intelligent semantic retrieval method and system and electronic equipment
CN115293154A (en) * 2021-07-30 2022-11-04 苏州七星天专利运营管理有限责任公司 Vocabulary extension method and system based on text retrieval
CN113821646A (en) * 2021-11-19 2021-12-21 达而观科技(北京)有限公司 Intelligent patent similarity searching method and device based on semantic retrieval

Also Published As

Publication number Publication date
CN117076652A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
Yang et al. A LSTM based model for personalized context-aware citation recommendation
US10409907B2 (en) Tabular data compilation
US20050097436A1 (en) Classification evaluation system, method, and program
CN117076652B (en) Semantic text retrieval method, system and storage medium for middle phrases
CN111191032B (en) Corpus expansion method, corpus expansion device, computer equipment and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN112560444A (en) Text processing method and device, computer equipment and storage medium
CN112818126B (en) Training method, application method and device for network security corpus construction model
Ahmed et al. Named entity recognition by using maximum entropy
CN110532569B (en) Data collision method and system based on Chinese word segmentation
CN112434158A (en) Enterprise label acquisition method and device, storage medium and computer equipment
Botev et al. Word importance-based similarity of documents metric (WISDM) Fast and scalable document similarity metric for analysis of scientific documents
Bender et al. Unsupervised estimation of subjective content descriptions
CN111368061A (en) Short text filtering method, device, medium and computer equipment
CN117194607A (en) Searching method and system based on natural language
Lugo et al. Segmenting search query logs by learning to detect search task boundaries
CN111723179B (en) Feedback model information retrieval method, system and medium based on conceptual diagram
CN116804998A (en) Medical term retrieval method and system based on medical semantic understanding
CN114254622B (en) Intention recognition method and device
CN113780006B (en) Training method of medical semantic matching model, medical knowledge matching method and device
CN115618054A (en) Video recommendation method and device
Khalaf et al. News retrieval based on short queries expansion and best matching
CN113191147A (en) Unsupervised automatic term extraction method, apparatus, device and medium
CN114328895A (en) News abstract generation method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant