CN113392196A

CN113392196A - Topic retrieval method and system based on multi-mode cross comparison

Info

Publication number: CN113392196A
Application number: CN202110622823.8A
Authority: CN
Inventors: 余胜泉; 陈鹏鹤; 刘杰飞; 徐琪; 陈玲; 卢宇
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-09-14
Anticipated expiration: 2041-06-04
Also published as: CN113392196B

Abstract

The invention provides a topic retrieval system and method based on multi-mode cross comparison, wherein the system comprises: the system comprises a question data analysis module, a question similarity calculation module and a result output module; the system comprises a question data analysis module, a question information preprocessing module and a question data processing module, wherein the question data analysis module is used for receiving question information input by a user, and carrying out preprocessing and structured arrangement; the question similarity calculation module is used for alternately calculating the similarity between the questions input by the user and the text representation and the picture representation of the questions in the question bank and calculating the comprehensive similarity of the questions and the picture representation in the question bank in a weighting manner; and the result output module is used for returning relevant information such as questions, answers and the like in the question bank with the comprehensive similarity larger than the preset subject threshold to the user. By the system, the retrieval result of each subject in the subject library can be more accurate.

Description

Topic retrieval method and system based on multi-mode cross comparison

Technical Field

The invention relates to the technical field of computers, in particular to a topic retrieval method and system based on multi-mode cross comparison.

Background

In recent years, with the development of internet and artificial intelligence technology, topic question-answering systems are greatly developed, and great help is provided for personalized teaching. It is becoming more and more important how to quickly and accurately retrieve the same or similar questions as the user input questions and then give answers in the question bank according to the user's questions.

Currently, the implementation of the topic retrieval system is generally implemented by comparing the text similarity between topics, a user transmits a text for describing topic information to the topic retrieval system, and the topic retrieval system compares the text similarity between the topic text input by the user and the topic text in the topic library, and then selects the topic with the highest similarity as a retrieval result to return to the user. If the topic information input by the user is in the form of a picture, comparing the similarity of the topics by comparing the similarity of the pictures.

The current text similarity calculation methods mainly fall into two categories. The text similarity calculation method based on characters and the text similarity calculation method based on vector space are respectively provided.

The text similarity calculation method based on characters, such as the traditional methods of editing distance, hamming distance, Jaccard, LCS and the like, evaluates the text similarity by directly comparing the same characters and their sequence relationship between two text strings.

Vector space-based text similarity calculation methods such as TF-IDF and BM25, and calculation of similarity between texts by cosine similarity after text is subjected to vector representation and direct comparison of similarity between texts by a neural network.

With the development of multimedia, users describe the title information in a text-only manner, and more situations are currently available in which the title information is described by text and combined with pictures.

Currently, the mainstream topic retrieval service and system in the market only support a retrieval mode facing text topics or picture topics, such as "easy search topic" (address: https:// www.xuesai.cn/souti /) ape search topic, and the like, as shown in fig. 1 and fig. 2.

The prior art is mainly concerned with the following problems, which are illustrated and described below.

(1) Only text content is input, and users often have situations where the words are not known or cannot be expressed clearly for subjects in the form of mathematics, physics, or speech-through-the-picture.

(2) Only picture content is input, and although correct answers can be returned, the answers may not meet the requirements of the user, for example, as shown in fig. 2, the user may not be familiar with the triangle rule of vector addition in the answers, and thus, even if the answers are seen, it may not be clear how to do the questions. Thus, if a user is allowed to input his or her needs, the user can be better assisted with a text.

(3) There is also a question-and-answer system that lets a user input a topic text portion and a topic picture portion at the same time, but it recognizes the topic picture portion as a text representation by a picture text recognition technology, and then directly splices the topic text portion and the recognized picture text content and compares the text similarity between topics. Due to the defects of the picture text recognition technology, the problem content recognition is wrong, for example, the same problem is different due to different light rays, angles and the like, and the finally recognized results are different, so that the original same problem is judged to be dissimilar due to the wrong picture text recognition.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a topic retrieval method and a system based on multi-mode cross comparison, which are characterized in that (1) input topics of a user are structurally arranged through a topic data analysis module, and (2) the similarity between the input topics of the user and candidate topics is calculated one by one through a topic similarity calculation module, and the information of the result is returned to the user.

In order to achieve the above object, the present invention is achieved by the following technical solutions.

According to an aspect of the present invention, a topic retrieval system based on multi-modal cross-comparison is provided, which includes: the system comprises a question data analysis module, a question similarity calculation module and a structure output module; wherein the content of the first and second substances,

the question data analysis module is used for receiving the question information input by the user and carrying out pretreatment;

the question similarity calculation module is used for calculating the similarity between the questions input by the user and the questions in the question bank;

and the result output module is used for returning the questions in the question bank with the similarity larger than a preset subject threshold value to the user.

Further, in the topic similarity calculation module, the method includes:

a. splicing the text after the title is cleaned and the text after the title content is cleaned to be used as text representation, and using the title picture text identification content as picture representation;

b. the text representation and the picture representation of the user input questions are represented by T1 and P1, the text representation and the picture representation of the questions in the question bank are represented by T2 and P2, the similarity between T1 and T2, between T1 and P2, between P1 and T2 and between P1 and P2 are calculated and are respectively represented by S1, S2, S3 and S4;

c. and calculating the comprehensive similarity s.

Further, the similarity S1 is calculated by using a Jaccard method; calculating similarities S2, S3, and S4 by cosine similarity; preferably, the topic picture text recognition is converted into a vector representation through a BERT model, the topic picture is converted into a vector through a LeNet convolutional network model, and then the two vectors are spliced to be used as a vectorized representation of the picture representation.

Further, the calculation formula of the comprehensive similarity s is as follows:

w is the subject weight.

Further, the subject weight is:

subject of the scientific discipline	Weight (w1, w2, w3, w4)	Subject of the scientific discipline	Weight (w1, w2, w3, w4)
				Chinese language	5.5，2，2，0.5	Physics of physics	5，1，1，3
History of	5，2，2，1	Mathematics, and	4，2，2，2
				geography	5，2，2，1	Chemistry	5，1，1，3
Politics	4，2，2，2	Biological organisms	4，2，2，2

Further, in the result output module, if the comprehensive similarity is greater than the subject threshold corresponding to the subject input by the user, the problem in the subject library is taken as a candidate problem.

Further, the subject threshold is:

subject of the scientific discipline	Threshold value	Subject of the scientific discipline	Threshold value
				Chinese language	0.8	Physics of physics	0.7
History of	0.8	Mathematics, and	0.5
				geography	0.7	Chemistry	0.5
Politics	0.6	Biological organisms	0.5

According to another aspect of the present invention, a topic retrieval method based on multi-modal cross-comparison is provided, which includes:

step 1, receiving question information input by a user;

step 2, calculating the similarity between the received questions and the questions in the question bank;

and 3, returning the questions in the question bank with the similarity larger than the subject threshold value to the user as candidate questions.

Further, the step 2 comprises:

a. acquiring text representation and picture representation of title information;

b. representing the text representation and the picture representation of the title by T1 and P1, representing the text representation and the picture representation of the title in the title library by T2 and P2, and then cross-comparing the similarity of four parts of content S1, S2, S3 and S4; wherein, S1 is calculated by adopting a Jaccard method; a cosine similarity calculation method is adopted when S2, S3 and S4 are calculated;

c. and calculating the comprehensive similarity s.

Further, in the step 3, the questions in the question bank with the comprehensive similarity greater than the subject threshold corresponding to the user input question are taken as candidate questions.

Compared with the prior art, the invention has the beneficial effects that:

(1) according to the invention, for topic retrieval, a cross comparison method based on multiple modes is adopted, and compared with the method of directly comparing the similarity of topic texts and comparing the similarity of topic pictures, the accuracy of topic comparison is greatly improved.

(2) When the comprehensive similarity calculation of the topics is carried out on the four parts of similarity of the cross comparison, different weight parameter combinations are set based on experience according to different subjects, and the accuracy of topic comparison is improved for the topic calculation of different subjects.

(3) According to the invention, the comprehensive similarity result of the topic comparison is compared with the preset threshold values corresponding to different subjects for judging the topic similarity, so that the accuracy of the topic comparison is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic interface diagram of a prior art question-answering system;

FIG. 2 is a schematic diagram of another interface of a prior art question-answering system;

FIG. 3 is a schematic diagram of a topic retrieval system according to one embodiment of the present invention;

FIG. 4 is a schematic flow chart of topic data parsing according to one embodiment of the invention;

FIG. 5 is a schematic diagram of a cross-comparison in accordance with one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, a technical solution in an embodiment of the present invention will be described in detail and completely with reference to the accompanying drawings in the embodiment of the present invention, and it is obvious that the described embodiment is a part of embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention provides a question searching system based on multi-mode cross comparison, which can be applied to a question answering system, receives questions input by a user, searches in a question bank, and returns the same or similar questions or answers to the user so as to help the user to complete a learning task. As shown in fig. 3, the system includes the following modules:

(1) a topic data analysis module: receiving subject information input by a user, wherein the subject information comprises subject information of subjects, subject titles, subject descriptions, subject pictures and subject type information. The topic data analysis module can be arranged on a front-end webpage or a mobile-end APP.

After receiving the topic data, the topic data analysis module firstly divides the topic data into topic text data and topic picture data, and respectively preprocesses the topic text data and the topic picture data. The topic data parsing module is shown in FIG. 4.

The preprocessing of the subject text data comprises text cleaning and word segmentation. The text cleaning work comprises unified coding processing, illegal character removal, emoji symbol removal, character symbols, HTML labels and other symbols, and invalid character removal.

The invalid characters are characters which are irrelevant to subject contents and are sorted after analyzing, counting and finishing subject data in a subject database, and the invalid characters mainly comprise the following characters:

thank you, solve online, ask, teacher's help, read, ask for help, do not, solve, each location, help, hard, you, trouble, explain, help me to say, give a message, want to help me to solve, question how to do, ask how to answer, help me, check, give a message, not understand, not clear, explain, help me to see, baituo, answer, hope teacher's help to see, i's go, do not know, how to write, how to explain, how to ask, how to list, how to do, detail, not give, do, not understand, not read, help analyze, ask, give a question, give a message to me, give a message, how to give a message, do not especially, not know, how to do, do not think, do, not think, give a message, thought, not know, point, give a message to do, do not know, help, do not particularly, do not understand, do not know, do not like, give a message, or a message, give a message, or a message, instruction, wish details, model, volume, middle exam, end of term, notice, version.

The topic picture data is a problem picture, picture text identification needs to be carried out on the problem picture, topic information in the picture is identified as text information through the existing picture identification service, and then the data are subjected to structured arrangement.

The topic text and the topic picture data are combined into topic data after being preprocessed, and the structure of the topic data is shown as follows, wherein picture information stored in the topic data is a picture address.

{

"query _ id": subject data unique identifier,

"query _ title _ original": title header original text information,

"query content original": title content original text information,

"query _ type": title type,

"query _ create _ time" title creation time,

"query _ subject" subject information,

"query _ pic" title picture information,

"query _ pic _ content": title picture text identification content,

"query _ title _ clean": the subject title cleared text,

"query _ content _ clean": title content clean-up text

}

Example (c):

{

"question_id":00063274-1008-4525-9b7d-9f297f72bde1,

"query _ title _ original" please help the teacher to see how do it about the title of "bad house name? ,

"query _ content _ original" analyzes the material in the question, asking for "mountain is not high, there is a immortal name. "what is the idea the author wants to express? ,

"query _ type" is a question and answer question,

"question_create_time":2019-06-27 20:58:28,

"query _ subject" in the language,

"question_pic":https://cs.101.com/v0.1/download/actions/direct？dentryId＝9e563a62-a96a-49e2-a238-a810b550b001&serviceName＝fep,

"query _ pic _ content": crude Chamber Ming ": mountain is not high, so it is named after immortal. If water is not deep, Longshiling can be used. The gaussian is a poor room, but wu De Xin. The green tongue coating and the green grass color. There are julians in the conversation, and there are no white dices in the past.

"query title clean" in relation to "hovel inscription",

"query _ content _ clean" analyzes the material in the question, asking for "mountain is not high, there is a immortal name. "what the author wants to express is

}

(2) And the question similarity calculation module is used for comparing the question data with the question data in the question bank one by one and returning the same and similar question information to the user. As shown in fig. 5, the specific comparison method is as follows:

a. and splicing the text after the topic titles are cleaned and the text after the topic contents are cleaned to be used as text representation, and using the topic picture text identification contents and the topic pictures as picture representation together.

b. The text representation and the picture representation of title 1 are represented by T1 and P1, the text representation and the picture representation of title 2 are represented by T2 and P2, and then the similarities of four parts of content, namely (T1, T2), (T1, P2), (P1, T2), (P1, P2), are cross-compared and represented by S1, S2, S3 and S4 respectively.

When the similarity S1 is calculated, namely the similarity between the text representations of the comparison topics, the Jaccard method is adopted for calculation, the text representation to be compared is subjected to word segmentation, stop words and punctuation marks are removed, the similarity is calculated by comparing the intersection and union of the words contained in the text, and a specific calculation formula is shown as follows.

Wherein J (A, B) is epsilon [0, 1 ].

When calculating the similarity S2 and S3, that is, the similarity between the text representation and the picture representation, vectorizing the text representation by a BERT model, and then vectorizing the picture representation, specifically: the text recognition of the title picture is also converted into vector representation through a BERT model, the title picture is converted into vectors through a LeNet convolution network model, then the two vectors are spliced to be used as vectorization representation of picture representation, and then the similarity between the vectors is calculated through cosine similarity. The cosine similarity calculation formula is shown below.

When calculating the similarity S4, that is, comparing the similarity represented by the pictures, the topic pictures are simultaneously vectorized and then the cosine similarity is calculated.

After the similarity result of the cross comparison is obtained, different weights Wi are given to different similarity comparison values according to the prior experience after the prior statistical analysis, and the comprehensive similarity of the two questions to be compared is represented as follows:

for each subject, the corresponding weights are set as follows (subject weights can be set empirically and can also be obtained by training of neural networks):

table 1 weight setting table

The method has the advantages that the comparison accuracy can be improved, compared with the original method, the method for cross comparison separately compares the text content of the theme with the text content of the picture, and the theme content is analyzed in a more detailed mode during comparison.

As exemplified by the following examples:

topic 1:

topic content: the question was asked about "mountain is not high, but has a name of immortal rule" to analyze the material in the question. "what is the idea the author wants to express?

Topic picture text content: "Houchan Ming" (Lou Ming-Shi): mountain is not high, so it is named after immortal. If water is not deep, Longshiling can be used. The gaussian is a poor room, but wu De Xin. The green tongue coating and the green grass color. There are julians in the conversation, and there are no white dices in the past.

Topic 2:

topic content: ask "the green upper part of the tongue mark and the green color of grass. What is it?

Topic picture text content: "Houchan Ming" (Lou Ming-Shi): mountain is not high, so it is named after immortal. If water is not deep, Longshiling can be used. The gaussian is a poor room, but wu De Xin. The green tongue coating and the green grass color.

If the topic text content and the topic picture content are directly spliced, then the similarity between the topic content of topic 1 and the topic content of topic 2 is calculated to be 0.68 by using Jaccard.

After the cross-comparison, the similarity between the text content of topic 1 and the text content of topic 2 is 0.05, the similarity between the picture content of topic 1 and the text content of topic 2 is 0.14, the similarity between the text content of topic 1 and the picture content of topic 2 is 0.19, the similarity between the picture content of topic 1 and the picture content of topic 2 is 0.95, and the similarity between topic 1 and topic 2 is (5.5 + 0.05+2 + 0.14+2 + 0.19+0.5 + 0.95)/4 is 0.3 by comprehensive calculation based on the preset weight information (5.5, 2, 2, 0.5) of the linguistic subjects, which indicates that the similarity between the two is small or not similar. In fact, the two items can be judged to be dissimilar by human beings, so that the result 0.3 of the cross comparison can reflect the similarity of the items more than the result 0.68 of the previous method.

(3) And the result output module is used for comparing the calculated comprehensive similarity with a preset subject threshold, wherein different comparison thresholds are set for different subjects, as shown in table 2. And if the comprehensive similarity is greater than the subject threshold corresponding to the subject input by the user, considering that the problem in the subject library is similar to the problem input by the user and taking the problem as a candidate problem. Thresholds may also be obtained based on the data sets and artificial intelligence network training. After all questions in the question bank are compared, the candidate questions are ranked according to the magnitude of the comprehensive similarity, and the first five questions are selected and returned to the user; if the number of the candidate topics is less than five, all the candidate topics meeting the conditions are returned to the user; if not, prompting the user that the similar problem is not found, and issuing the problem input by the user as a new problem.

TABLE 2 threshold setting table

step 1, receiving question information input by a user;

In step 1, after receiving the topic data, the topic data is divided into topic text data and topic picture data, and the topic text data and the topic picture data are preprocessed respectively. The specific process is described in a topic parsing data module.

In step 2, a, acquiring text representation and picture representation of user input title information; b. representing the text representation and the picture representation of the title by T1 and P1, representing the text representation and the picture representation of the title in the title library by T2 and P2, and then cross-comparing the similarity of four parts of content S1, S2, S3 and S4; wherein, S1 is calculated by adopting a Jaccard method; when calculating S2 and S3, firstly, vectorizing representation is carried out on text representation through a BERT model, then vectorizing representation is carried out on the subject picture, specifically, text identification of the subject picture is converted into vector representation through the BERT model, the subject picture is converted into vectors through a LeNet convolution network model, then the two vectors are spliced to be used as vectorized representation of picture representation, and then the similarity between the vectors is calculated through cosine similarity; when the similarity is calculated S4, calculating the cosine similarity of the vectorization expression of the topic picture and the topic pictures in the topic library; c. and calculating the comprehensive similarity s. The specific calculation method is as described above.

In step 3, the calculated similarity is compared with a preset subject threshold, wherein different comparison thresholds are set for different subjects, as shown in table 2. And if the similarity is greater than the subject threshold corresponding to the subject input by the user, considering that the problem in the subject library is similar to the problem input by the user and taking the problem as a candidate problem. Thresholds may also be obtained based on the data sets and artificial intelligence network training. After all questions in the question bank are compared, the candidate questions are sorted according to the similarity, and the first five questions are selected and returned to the user; if the number of the candidate topics is less than five, all the candidate topics meeting the conditions are returned to the user; if not, prompting the user that the similar problem is not found, and issuing the problem input by the user as a new problem.

The above examples are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A topic retrieval system based on multimodal cross-comparisons, the system comprising: the system comprises a question data analysis module, a question similarity calculation module and a structure output module; wherein the content of the first and second substances,

2. The topic retrieval system according to claim 1, wherein the topic similarity calculation module comprises:

a. splicing the text after the title is cleaned and the text after the title content is cleaned to be used as text representation, and using the title picture text identification content and the title picture as picture representation;

c. and calculating the comprehensive similarity s.

3. The title retrieval system of claim 2, wherein the similarity S1 is calculated by a Jaccard method; calculating similarities S2, S3, and S4 by cosine similarity; preferably, the topic picture text recognition is converted into a vector representation through a BERT model, the topic picture is converted into a vector through a LeNet convolutional network model, and then the two vectors are spliced to be used as a vectorized representation of the picture representation.

4. The topic retrieval system according to claim 2, wherein the calculation formula of the integrated similarity s is:

w is the subject weight.

5. The topic retrieval system of claim 4, wherein the topic weights are:

6. the topic retrieval system of claim 1, wherein in the result output module, if the integrated similarity is greater than a topic threshold corresponding to a topic input by a user, the question in the topic library is taken as a candidate question.

7. The topic retrieval system of claim 6, wherein the topic threshold is:

8. A topic retrieval method based on multi-modal cross comparison comprises the following steps:

step 1, receiving question information input by a user;

9. The title retrieval method according to claim 1, wherein the step 2 comprises:

c. and calculating the comprehensive similarity s.

10. The topic retrieval method according to claim 9, wherein in the step 3, topics in the topic database having a comprehensive similarity greater than a topic threshold corresponding to a user-input topic are used as candidate questions.