CN111209745B

CN111209745B - Information reliability evaluation method, equipment and storage medium

Info

Publication number: CN111209745B
Application number: CN201811302280.6A
Authority: CN
Inventors: 田勇; 毕海; 殷晓珑
Original assignee: Beijing Haola Technology Co ltd
Current assignee: Beijing Haola Technology Co ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2022-04-22
Anticipated expiration: 2038-11-02
Also published as: CN111209745A

Abstract

The invention discloses an information reliability evaluation method, equipment and a storage medium. The method comprises the following steps: respectively carrying out depth semantic vector coding on all information in the information base; calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix; constructing a semantic network according to the semantic similarity matrix; and according to a preset random walk model and information corresponding to a central node in the semantic network, carrying out reliability scoring on each piece of information in the information base. The method has the advantages that the reliability of the viewpoint in the information is evaluated, the information is subjected to deep language vector coding, a semantic network is constructed by calculating the similarity between every two information, and then the reliability score of each information can be calculated.

Description

Information reliability evaluation method, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, a device, and a storage medium for evaluating reliability of information.

Background

The traditional information acquisition mode is often active, such as a user actively browsing a portal to acquire the latest news information, or actively searching information of interest by a search engine. In recent years, with the development of computer networks and artificial intelligence technologies, the way for people to acquire information is greatly changed, various waterfall flow information and intelligently pushed information are directly displayed in front of users, and the users passively receive the information in many times. In the process of changing the information acquisition mode from active to passive, besides the benign development of the technology, along with the information explosion and information flooding, some false information and even rumors are rapidly spread, the positive information (such as health information) is negatively affected by the negative information, how to discriminate the authenticity of the positive information relates to the reliability evaluation of the information, and how to effectively evaluate the reliability of the positive information becomes the problem to be solved at present.

In the rumor recognition project, the analysis of the information content is focused on, and the exaggerated and unreasonable content in the information is recognized through the examination of professional personnel or network crowdsourcing learning, so as to deduce whether the information is rumor. However, the examination relying on professionals and the network crowdsourcing learning have great limitations and consume a great deal of labor cost. Because no efficient rumor identification method exists at present, network crowd-sourced learning has become the only choice for each splitting platform in practice. The network crowdsourcing learning depends on the social contact participation of the internet, the advantages of the group planning force are exerted, rumor contents are marked and identified together, and the reliability of information is judged by counting marks.

With the wide application of deep learning technology, researchers begin to consider using deep learning models to identify rumors, and their basic ideas still start from the content of information itself, and through making a lot of labels on samples of rumors and non-rumors, through the deep learning network, a classifier capable of distinguishing the rumors and non-rumors is constructed, so as to directly judge the reliability of the information content. However, the deep learning model has the following problems: although the deep learning model has a good effect in the image and video field, in the natural language field, especially in the information evaluation field which can not be distinguished by general people, a proper deep learning model is difficult to find to meet the actual requirement; secondly, the interpretability of the deep learning model needs to be further studied deeply, in practical application, the output result of the deep learning model is obtained through a large amount of complex calculation, the final result is not easy to control, and the quality of the output result cannot be directly verified through evidence.

Disclosure of Invention

The invention mainly aims to provide an information reliability evaluation method, equipment and a storage medium, so as to solve the problems of high labor cost and low accuracy of the existing information reliability evaluation method.

Aiming at the technical problems, the invention solves the technical problems by the following technical scheme:

the invention provides an information reliability evaluation method, which comprises the following steps: respectively carrying out depth semantic vector coding on all information in the information base; calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix; constructing a semantic network according to the semantic similarity matrix; and according to a preset random walk model and information corresponding to a central node in the semantic network, carrying out reliability scoring on each piece of information in the information base.

Wherein, the depth semantic vector coding is respectively performed on all the information in the information base, and the depth semantic vector coding comprises the following steps: capturing common words in a preset website, and adding the common words into a preset word segmentation tool; utilizing the word segmentation tool to perform word segmentation processing on all information in the information base respectively to obtain a plurality of words; according to a preset distributed word vector representation method, training a preset distributed word vector model by using the multiple participles to obtain a distributed word vector corresponding to each participle; and carrying out depth semantic vector coding on each piece of information in the information base according to the distributed word vector corresponding to each participle.

Wherein, the constructing a semantic network according to the semantic similarity matrix comprises: performing principal component analysis on the semantic similarity matrix to construct a sparse semantic similarity matrix; and constructing a single-connected undirected simple graph with the weight as a semantic network according to the semantic similarity matrix and the sparse semantic similarity matrix.

Wherein, according to the semantic similarity matrix and the sparse semantic similarity matrix, a single-connected undirected simple graph with weights is constructed, which comprises the following steps: constructing a weighted undirected simple graph according to the sparse semantic similarity matrix; determining a plurality of disconnected subgraphs contained in the weighted undirected simple graph; querying the similarity of node pairs among the disconnected subgraphs in the semantic similarity matrix; in the weighted undirected simple graph, the node pairs with the maximum similarity are connected, and the maximum similarity is used as the weight of the connection to form the singly-connected weighted undirected simple graph.

Wherein, the scoring the reliability of each information in the information base according to the preset random walk model and the information corresponding to the central node in the semantic network comprises: based on the random walk model, carrying out support scoring on information corresponding to each node in the semantic network; extracting keywords and topics from information corresponding to the central node of the semantic network to serve as the keywords and the topics of the information base; carrying out depth semantic vector coding on the keywords and the subjects of the information base; respectively calculating the similarity between the depth semantic vector of each piece of information in the information base and the depth semantic vector of the information base, and taking the similarity as the self-evidence score of each piece of information; and obtaining the reliability score of each piece of information according to the support score and the self-evidence score of each piece of information.

Wherein the method further comprises: and obtaining the reliability score of the information base according to the reliability score of each piece of information in the information base.

The present invention also provides an information reliability evaluation apparatus, which includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of: respectively carrying out depth semantic vector coding on all information in the information base; calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix; constructing a semantic network according to the semantic similarity matrix; and according to a preset random walk model and information corresponding to a central node in the semantic network, carrying out reliability scoring on each piece of information in the information base.

Wherein the processor is further configured to execute the computer program stored in the memory to implement the steps of: based on the random walk model, carrying out support scoring on information corresponding to each node in the semantic network; extracting keywords and topics from information corresponding to the central node of the semantic network to serve as the keywords and the topics of the information base; carrying out depth semantic vector coding on the keywords and the subjects of the information base; respectively calculating the similarity between the depth semantic vector of each piece of information in the information base and the depth semantic vector of the information base, and taking the similarity as the self-evidence score of each piece of information; and obtaining the reliability score of each piece of information according to the support score and the self-evidence score of each piece of information.

Wherein the processor is further configured to execute the computer program stored in the memory to implement the steps of: and obtaining the reliability score of the information base according to the reliability score of each piece of information in the information base.

The invention further provides a storage medium, on which an information reliability evaluation program is stored, which when executed by a processor implements the steps of the information reliability evaluation method described above.

The invention has the following beneficial effects:

the method has the advantages that the reliability of the viewpoint in the information is evaluated, the information is subjected to deep language vector coding, a semantic network is constructed by calculating the similarity between every two information, and then the reliability score of each information can be calculated. Furthermore, in the evaluation process, the invention not only relies on the evidence provided by the information to be evaluated, but also needs other information with the same view point as the information in the information base for support, if the other information supporting the view point in the information base is few, even if the other information has a view point incompatible with the view point, the reliability of the information is low, otherwise, a large amount of other information has the evidence with the same view point as the considered information for verification, the reliability of the information is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flowchart illustrating a method for evaluating reliability of information according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the steps of depth semantic vector encoding according to a second embodiment of the present invention;

FIG. 3 is a flowchart of steps of semantic network construction according to a third embodiment of the present invention;

FIG. 4 is a flowchart illustrating the steps of reliability scoring according to a fourth embodiment of the present invention;

FIG. 5 is a block diagram of an information reliability evaluation apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

Example one

According to a first embodiment of the present invention, an information reliability evaluation method is provided.

Fig. 1 is a flowchart illustrating an information reliability evaluation method according to an embodiment of the invention.

Step S110, depth semantic vector encoding is performed on all the information in the information base.

Deep semantic vector coding refers to extracting vector representation of information in semantic context space by deep learning technology. Through the deep learning technology, the context dependence of the descriptive words on the information of the words can perform better semantic modeling on the words, and the vector coding refers to converting the information into a calculable quantity, so that the information is convenient for a computer to process.

Step S120, calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix.

The language similarity matrix includes the similarity between any two pieces of information in the information base.

And step S130, constructing a semantic network according to the semantic similarity matrix.

The nodes in the semantic network are information in the information base, the connection between any two nodes in the semantic network has weight, and the value of the weight is the similarity of the two nodes.

Step S140, according to the preset random walk model and the information corresponding to the central node in the semantic network, reliability scoring is carried out on each information in the information base.

The random walk model is applied to a network and is used for describing a random process model of a path probability relation formed by a series of random steps, the random walk starts from an initial node, then jumps to the next step depending on a preset transition probability according to the structure of the network (semantic network), and the transition probability finally tends to be stably distributed along with the increase of iteration steps. The random walk model can better describe the inherent attributes of the network structure and find out the central node which has important effect on the network.

In this embodiment, after obtaining the reliability score of each information in the information base, the reliability score of the information base can be obtained according to the reliability score of each information in the information base.

The reliability score of the information is higher, the reliability of the information is higher, and the reliability score of the information is lower, the reliability of the information is lower. Similarly, the higher the reliability score of the library, the higher the reliability of the library, and the lower the reliability score of the library, the lower the reliability of the library.

In this embodiment, the information can be sorted according to the reliability score, and the information with high reliability score can be provided to the user. Further, according to the reliability scores of the information bases, the information with the highest reliability score is selected from the information base with the highest reliability score and provided for the user to check.

The embodiment provides an information reliability evaluation method based on ontology self-consistency. The embodiment has the advantages that the reliability of the viewpoint in the information is evaluated, the information is subjected to deep language vector coding, a semantic network is constructed by calculating the similarity between every two information, and then the reliability score of each information can be calculated.

In the evaluation process, the information to be evaluated is evaluated not only by the evidence provided by the information to be evaluated, but also needs to be supported by other information with the same view as the information in the information base, if the other information supporting the view in the information base is few, even if the other information has a view incompatible with the view, the reliability of the information is low, and conversely, if a large amount of other information has evidence with the same view as the verification of the considered information, the reliability of the information is high.

The procedure of example one will be further described below with reference to examples two to four. Among them, the second to fourth examples will be explained in detail based on the health field.

Example two

The present embodiment further describes the steps of depth semantic vector encoding.

Fig. 2 is a flowchart of the steps of depth semantic vector encoding according to the second embodiment of the present invention.

Step S210, capturing common words in a preset website, and adding the common words into a preset word segmentation tool.

The common words mean: technical terms, professional terms, common names or words with high frequency of appearance appear in the preset website.

The preset website is, for example: "A + medical encyclopedia", "39 health net", "medical inquiry and medicine net", and "Baidu medical encyclopedia".

The word segmentation tools are, for example: severe, NLPIR, LTP, THULAC, IK-Analyzer.

The common words are obtained by grabbing entries in the preset website, and the dictionary of the word segmentation tool is expanded so as to provide a relatively ideal word segmentation effect. Such as: the term "allergic rhinitis" is a common disease noun of rhinitis, most word segmentation tools divide the term into two words of "anaphylaxis" and "rhinitis", after segmentation of the word segmentation tools, the meaning of a special disease cannot be completely and effectively reflected, and great adverse effect is generated on subsequent semantic analysis. Therefore, the health website can be designated, and terms of diseases and symptoms of the health website are captured to obtain common words.

When selecting the health website, the selection is based on the following: (1) the website has two plates of disease encyclopedia and symptom encyclopedia, and pages with links for disease and symptom are described in detail; (2) the website filters out results which are explicitly marked as being searched out outside the advertisement links in a plurality of search engines, is relatively early and has a relatively clear network structure.

The common words are loaded into the word segmentation tool as the user dictionary, so that the word segmentation tool can be used for carrying out symbol removal, word stop and word segmentation on each piece of health information in the health information base.

Step S220, using the word segmentation tool to perform word segmentation processing on all the information in the information base respectively to obtain a plurality of words.

And performing word segmentation processing on each piece of health information in the health information base respectively to obtain a plurality of words, and forming a health information data set.

Step S230, training a preset distributed word vector model by using the multiple participles according to a preset distributed word vector representation method, to obtain a distributed word vector corresponding to each participle.

In this embodiment, the distributed Word vector representation method may be a Word Embedding (Word Embedding) -based distributed vector representation method. The participles in the health information data set are encoded (vector representation) based on a distributed vector representation method of word embedding.

The distributed word vector model can be a word2vec model or a GloVe model. The word2vec model is a typical three-layer feedforward network, which is represented by an input layer, a hidden layer (mapping layer) and an output layer, and constructs input and output through the context of words in an information base, so as to find the context semantic relation of the words. The dimensions may be predefined, such as: the context of all words is represented using 250 dimensions, each of which is a composite of multiple semantics, referred to as a distributed semantic representation. The input and output vectors of the word2vec model are one-hot encodings of each word based on dictionary position, such as: "healthy" is that if the number in the dictionary is 500, the position is 1 except 500, and all other positions are 0. The word2vec model has two types of training methods, the definitions of input and output of the two types of training methods are just opposite when the two types of training methods are constructed, one type of training method is called a continuous word bag (CBOW) model and uses context words to predict words, the other type of training method is called a Skip-gram and uses corresponding words to predict the context words, and the network structure and the optimization method of the two types of training methods have little difference but are both used for better obtaining a relatively compact semantic representation of the words.

In the multi-task of natural language processing, the distributed vector representation of words becomes a cornerstone for natural language quantization computation because of the ability to handle well the quantization of the words themselves through their context semantics. Thus, the health information data set is used as a training data set of the word2vec model, and the word2vec model is trained by using the training data set, that is: inputting a sequence formed by the word segmentation in the health information data set into a word2vec model, and setting appropriate parameters, such as: and parameters such as the distributed dimensionality of the word, the size of a context window, an iteration period, a training method and the like are adopted, so that the word2vec model outputs a distributed word vector corresponding to each participle.

Step S240, according to the distributed word vector corresponding to each word segmentation, performing depth semantic vector coding on each information in the information base.

In the distributed word vector representation method based on word embedding, the context semantics of the participle has additive property, so that the depth semantic vector of each piece of information can be obtained through the weighted average of the participle.

EXAMPLE III

The present embodiment further describes the construction of semantic networks.

FIG. 3 is a flowchart of steps of semantic network construction according to a third embodiment of the present invention.

Step S310, according to the depth semantic vector of each information, calculating the similarity between every two information to obtain a semantic similarity matrix.

The semantic similarity matrix includes the similarity between any two pieces of information in the information base.

In the present embodiment, the purpose of the similarity calculation is to findSimilar considerations apply. Such as: information A shows view a, information B shows view B, if view a and view B have similar semantics, A, B information is corresponding to each other, the strength of the support can be defined as semantic similarity S (a, B) of a and B, the higher the similarity, the higher the strength of the support, the lower the similarity, the lower the strength of the support, in the process, the depth semantic vectors of information A and information B are respectively v_a、v_b。

Although the semantics of the distributed word vector representation method based on word embedding have additivity, in the present embodiment, not only the direction similarity S is used_pos(a, b), adding amplitude similarity S_str(a, b) which together measure the similarity of the two pieces of information.

Direction similarity S_pos(a, b) cosine similarity may be used, which is defined as:

wherein, | | v_aI means taking vector v_aModulo operation of, | | v_bI means taking vector v_bModulo (d) operation.

Magnitude similarity S_str(a, b) are defined as follows:

thus, the similarity of information A, B can be defined as the weighted sum of the above two similarities:

S(a，b)＝λS_pos(a，b)+(1-λ)S_str(a，b)

wherein, the parameter lambda (lambda is more than 0.5 and less than 1) is a preset value and is used for adjusting the weight of the direction similarity and the amplitude similarity. In the embodiment, the direction similarity represents the direction consistency of the expressed viewpoint in the semantic space, and the magnitude similarity represents the strength consistency of the viewpoint in the semantic space, and the direction is often more important than the strength, so the value range S (a, b) belongs to (- λ, 1) in the embodiment.

After the similarity of any two pieces of information in the information base is calculated, a semantic similarity matrix can be constructed according to the obtained multiple similarities.

And step S320, performing principal component analysis on the semantic similarity matrix to construct a sparse semantic similarity matrix.

Because the depth semantic vector of the information often has higher dimensionality, the probability that the semantics of the two pieces of information are completely orthogonal, namely the similarity is 0, is extremely small, which indicates that the semantic similarity matrix is dense, so that the matrix is dense, on one hand, the result of distributed representation of each semantic based on the word embedding distributed word vector representation method is obtained, and on the other hand, some high-frequency noise which is not related to the subject semantics of the information exists in the information base.

In order to eliminate the influence of high-frequency noise in semantics, Principal Component Analysis (PCA) may be performed on a dense semantic similarity matrix, which is mathematically subjected to Singular Value Decomposition (SVD), and then reconstructed to obtain a sparse representation. The semantic similarity matrix obtained after reconstruction is a sparse semantic similarity matrix which is an approximation of the original semantic similarity matrix, and besides the influence of high-frequency noise is eliminated, the calculation amount of subsequent operation can also be reduced, so that the subsequent random walk algorithm can be more robust.

And S330, constructing a single-connected undirected weighted simple graph as a semantic network according to the semantic similarity matrix and the sparse semantic similarity matrix.

And 1, constructing a weighted undirected simple graph according to the sparse semantic similarity matrix.

A weighted undirected simple graph refers to a graph in which there is one and only one edge associated with a pair of vertices, no vertex-to-self edge (i.e., no ring), and the edges bear weights.

And constructing a weighted undirected simple graph by using the sparse semantic similarity matrix as an adjacent matrix. The weighted undirected simple graph is actually a semantic context network. Each piece of information corresponds to a node in the undirected graph with authority.

And 2, determining a plurality of disconnected subgraphs contained in the weighted undirected simple graph.

A disconnected subgraph is a subgraph that has no connections to other subgraphs.

Because the principal component analysis removes semantic context connections among a plurality of nodes, the weighted undirected simple graph is possibly not a single connected network, for the subsequent analysis needs, several unconnected sub-networks (unconnected sub-networks) need to be found in the weighted undirected simple graph, and a bridge is constructed in the unconnected sub-networks, so that the semantic context of the whole network can be connected.

And 3, inquiring the similarity of the node pairs among the disconnected subgraphs in the semantic similarity matrix.

The node pair comprises two nodes, wherein one node is positioned in one disconnected subgraph and the other node is positioned in the other disconnected subgraph in the two disconnected subgraphs.

In order to not affect the original semantic context as much as possible, as few connected unconnected subgraphs as possible and as many semantic contexts between unconnected subgraphs as possible should be included.

And 4, connecting the node pairs with the maximum similarity in the weighted undirected simple graph, and using the maximum similarity as the weight of the connection to form the singly-connected weighted undirected simple graph.

Between each two unconnected subgraphs: determining a first node in the first disconnected subgraph and a second node in the second disconnected subgraph, and inquiring the similarity of the first node and the second node in the semantic similarity matrix; the first disconnected subgraph comprises a plurality of first nodes, the second disconnected subgraph comprises a plurality of second nodes, the similarity between each first node and each second node is inquired, the obtained similarities are sequenced, the maximum similarity is determined, the first node and the second node corresponding to the maximum similarity are connected, and the maximum similarity is used as the weight of the connection, so that the first disconnected subgraph and the second disconnected subgraph are connected.

Example four

This embodiment further describes how to score the reliability of the information.

Fig. 4 is a flowchart of the steps of reliability scoring according to the fourth embodiment of the present invention.

Step S410, based on the random walk model, carrying out support scoring on the information corresponding to each node in the semantic network.

In this embodiment, a random walk model is implemented in the semantic network to complete the data support scoring of each node in the semantic network.

For example: the health information base has N pieces of information, the semantic similarity matrix is M, wherein the similarity of the health information i and j is s_ij. The statement support score of the information i in the t step of the iteration is recorded

And the initial statement of the information i is scored as support

Obtaining the following result according to the semantic similarity matrix:

the random walk model follows the support score of the statement of the node i and is obtained by the support score of other nodes in the previous step, one part is obtained by adjacent nodes, and the other part is obtained by the random average contribution of other nodes, so that the iterative formula of the support score of the statement of the node i in the step t +1 is as follows:

wherein P is a preset value and represents that two nodes in the semantic network are connected if the two nodes are connectedOne node selects the probability of the adjacent node swimming, and 1-P correspondingly represents the probability of randomly selecting other adjacent or non-adjacent nodes, and the embodiment preferably has the value of P less than or equal to 0.5 and less than or equal to 1. W denotes the semantic network, i and j are adjacent nodes in the semantic network, k is the other node of i (k ≠ i, and k ≠ j), W_ijIs the weight of the connection between the information i and j, i.e. the similarity of the information i and j; w is a_kjIs the weight of the connection between the information k and j, i.e. the similarity of the information k and j; s_ikIs the similarity of i and k.

Thus, through the initial condition and the iterative formula, the data support score of each node can be obtained.

Step S420, extracting keywords and topics from the information corresponding to the central node of the semantic network, and using the keywords and topics as the keywords and topics of the information base.

And determining a central node in the semantic network according to the node contribution amount of the nodes in the semantic network.

The types of the node contribution amount include: degree center (degree center) contribution, proximity center (close center) contribution, betweenness center (betweenness center) contribution, and eigen value center (eigen center) contribution.

The degree center contribution, which is the number of edges connecting the node (if an edge is weighted, the weighted sum of the edges), is also commonly referred to simply as degree.

And (4) approaching the central contribution, and the average distance between one node and the rest of node sets in the semantic network.

The number of intermediate center contributions, the number of bridges in the view propagation that pass through the node, i.e., the number of times that the bridge between two other nodes is acted upon.

The eigenvalue center contribution is the nodes associated with the eigenvector with the largest eigenvalue obtained by performing eigenvalue decomposition on the network adjacency matrix, and is similar to the degree center, but the eigenvalue center also takes into account the center measurement of the nodes connected with the eigenvalue center. The eigenvalue center takes into account both the time and space of propagation factors, and can propagate an information view to a larger extent in a shorter time.

In order to balance the types of the four types of central nodes, the four types of central node measures are comprehensively considered, each node calculates the four central measures, the minimum value is used as the contribution amount of the node to the central node, and then some nodes which have large contribution to the central node are searched in all the nodes, so that the health information corresponding to the central node in the semantic network can be found. Further, 4 types of node contribution amounts, namely a degree center contribution amount, an approach center contribution amount, an intermediate center contribution amount and a feature value center contribution amount, are respectively calculated for each node in the semantic network; for each node in the semantic network, determining the minimum value of the 4 types of node contribution amounts of the node; and sequencing the minimum values of the node contribution amounts of all nodes in the semantic network according to the sequence from large to small, determining the maximum node contribution amount, and taking the node corresponding to the maximum node contribution amount as a central node. That is, the minimum value among the 4 types of node contribution amounts corresponding to the center node is the maximum value in each node.

And determining information corresponding to one or more types of central nodes according to the types of the central nodes, and extracting keywords and topics from the determined information.

And performing TFIDF (term frequency-inverse document frequency) calculation on the information content corresponding to the central node to complete extraction of the keywords and basic topic analysis.

Of course, the extraction can be performed according to the characteristics of the keywords and the titles. Since the keyword is the most appeared word in the information, the word with the most appeared times can be extracted from the information as the keyword of the information. Since a topic generally appears at a title position of information, information of a < title > position can be extracted as a topic of the information in a Hyper Text Markup Language (HTML) document of the information.

Step S430, performing depth semantic vector coding on the keywords and topics of the information base.

Step S440, respectively calculating the similarity between the depth semantic vector of each information in the information base and the depth semantic vector of the information base, as the self-evidence score of each information.

Step S450, obtaining the reliability score of each information according to the support score and the self-evidence score of the basis of each information.

The depth semantic vector of the information i is v_iThe depth semantic vector of the information base is v_cThen the statement of the information i is self-certified as E_ic＝cos(v_i，v_c) And finally, obtaining the reliability score of the information i as the weighted sum of the self-evidence score and the support score:

E_i＝αE_ic+(1-α)RE_i

in practical applications, where 0.5 < α < 1, the weight of the forensic score is generally less than the weight of the supportive score.

The reliability score of the entire library may be defined as the average of the reliability scores of each item:

thus, each piece of information in the information base and the reliability evaluation of the information base are obtained.

The embodiment combines the content analysis of the information and the context structure information of the information base where the information is located, so that the information has both self-evidence and bystander, and the information is self-consistent in the context of the information base, otherwise, if incompatibility occurs or even contradictory results occur, the reliability of the information in the information base is greatly reduced.

EXAMPLE five

The present embodiment provides an information reliability evaluation apparatus. FIG. 5 is a block diagram of an information reliability evaluating apparatus according to a fifth embodiment of the present invention.

In this embodiment, the information reliability evaluation apparatus 500 includes, but is not limited to: processor 510, memory 520.

The processor 510 is configured to execute the information reliability evaluation program stored in the memory 520 to implement the information reliability evaluation method according to the first to fourth embodiments.

Specifically, the processor 510 is configured to execute the information reliability evaluation program stored in the memory 520 to implement the following steps: respectively carrying out depth semantic vector coding on all information in the information base; calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix; constructing a semantic network according to the semantic similarity matrix; and according to a preset random walk model and information corresponding to a central node in the semantic network, carrying out reliability scoring on each piece of information in the information base.

Wherein the processor is further configured to execute the computer program stored in the memory to implement the steps of: capturing common words in a preset website, and adding the common words into a preset word segmentation tool; utilizing the word segmentation tool to perform word segmentation processing on all information in the information base respectively to obtain a plurality of words; according to a preset distributed word vector representation method, training a preset distributed word vector model by using the multiple participles to obtain a distributed word vector corresponding to each participle; and carrying out depth semantic vector coding on each piece of information in the information base according to the distributed word vector corresponding to each participle.

Wherein the processor is further configured to execute the computer program stored in the memory to implement the steps of: performing principal component analysis on the semantic similarity matrix to construct a sparse semantic similarity matrix; and constructing a single-connected undirected simple graph with the weight as a semantic network according to the semantic similarity matrix and the sparse semantic similarity matrix.

Wherein the processor is further configured to execute the computer program stored in the memory to implement the steps of: constructing a weighted undirected simple graph according to the sparse semantic similarity matrix; determining a plurality of disconnected subgraphs contained in the weighted undirected simple graph; querying the similarity of node pairs among the disconnected subgraphs in the semantic similarity matrix; in the weighted undirected simple graph, the node pairs with the maximum similarity are connected, and the maximum similarity is used as the weight of the connection to form the singly-connected weighted undirected simple graph.

EXAMPLE six

The embodiment of the invention also provides a storage medium. The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executed by one or more processors, the above-described information reliability evaluation method is implemented.

Specifically, the processor is configured to execute an information reliability evaluation program stored in the memory to implement the following steps: respectively carrying out depth semantic vector coding on all information in the information base; calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix; constructing a semantic network according to the semantic similarity matrix; and according to a preset random walk model and information corresponding to a central node in the semantic network, carrying out reliability scoring on each piece of information in the information base.

The above description is only an example of the present invention, and is not intended to limit the present invention, and it is obvious to those skilled in the art that various modifications and variations can be made in the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. An information reliability evaluation method, comprising:

respectively carrying out depth semantic vector coding on all information in the information base;

calculating the similarity between every two information according to the depth semantic vector of each information to obtain a semantic similarity matrix;

constructing a semantic network according to the semantic similarity matrix;

according to a preset random walk model and information corresponding to a central node in the semantic network, reliability scoring is carried out on each information in the information base;

the reliability scoring of each information in the information base according to a preset random walk model and information corresponding to a central node in the semantic network comprises the following steps:

based on the random walk model, carrying out support scoring on information corresponding to each node in the semantic network;

extracting keywords and topics from information corresponding to the central node of the semantic network to serve as the keywords and the topics of the information base;

carrying out depth semantic vector coding on the keywords and the subjects of the information base;

respectively calculating the similarity between the depth semantic vector of each piece of information in the information base and the depth semantic vector of the information base, and taking the similarity as the self-evidence score of each piece of information;

and obtaining the reliability score of each piece of information according to the support score and the self-evidence score of each piece of information.

2. The method of claim 1, wherein the depth semantic vector encoding is performed on all information in the information base respectively, and comprises:

capturing common words in a preset website, and adding the common words into a preset word segmentation tool;

utilizing the word segmentation tool to perform word segmentation processing on all information in the information base respectively to obtain a plurality of words;

according to a preset distributed word vector representation method, training a preset distributed word vector model by using the multiple participles to obtain a distributed word vector corresponding to each participle;

and carrying out depth semantic vector coding on each piece of information in the information base according to the distributed word vector corresponding to each participle.

3. The method according to claim 1, wherein the constructing a semantic network according to the semantic similarity matrix comprises:

performing principal component analysis on the semantic similarity matrix to construct a sparse semantic similarity matrix;

and constructing a single-connected undirected simple graph with the weight as a semantic network according to the semantic similarity matrix and the sparse semantic similarity matrix.

4. The method according to claim 3, wherein constructing a single connected weighted undirected simple graph according to the semantic similarity matrix and the sparse semantic similarity matrix comprises:

constructing a weighted undirected simple graph according to the sparse semantic similarity matrix;

determining a plurality of disconnected subgraphs contained in the weighted undirected simple graph;

querying the similarity of node pairs among the disconnected subgraphs in the semantic similarity matrix;

in the weighted undirected simple graph, the node pairs with the maximum similarity are connected, and the maximum similarity is used as the weight of the connection to form the singly-connected weighted undirected simple graph.

5. The method of claim 1, further comprising:

and obtaining the reliability score of the information base according to the reliability score of each piece of information in the information base.

6. An information reliability evaluation apparatus, characterized by comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

constructing a semantic network according to the semantic similarity matrix;

the processor is further configured to execute a computer program stored in the memory to implement the steps of:

7. The apparatus of claim 6, wherein the processor is further configured to execute a computer program stored in the memory to perform the steps of:

8. A storage medium having stored thereon an information reliability evaluation program, the information reliability evaluation program being executed by a processor to implement the steps of the information reliability evaluation method according to any one of claims 1 to 5.