CN114386421A - Similar news detection method and device, computer equipment and storage medium - Google Patents

Similar news detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114386421A
CN114386421A CN202210035103.6A CN202210035103A CN114386421A CN 114386421 A CN114386421 A CN 114386421A CN 202210035103 A CN202210035103 A CN 202210035103A CN 114386421 A CN114386421 A CN 114386421A
Authority
CN
China
Prior art keywords
news
similar
vector
target
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210035103.6A
Other languages
Chinese (zh)
Inventor
严勇文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210035103.6A priority Critical patent/CN114386421A/en
Publication of CN114386421A publication Critical patent/CN114386421A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for detecting similar news, computer equipment and a storage medium, wherein the method comprises the following steps: when similar news of the target news is determined, basic information acquisition is carried out on the target news, and the basic information acquisition at least comprises the following steps: extracting an abstract of target news; inputting the abstract into a pre-configured twin network model to obtain vector representation information of the target news; searching in a pre-configured vector database according to the vector characterization information, and determining whether similar vector characterization information of the target news exists in the vector database; and when the vector database has the similar vector representation information of the target news, retrieving the similar news of the target news from the pre-configured historical news database based on the similar vector representation information. The method can improve the processing efficiency of similar news determination.

Description

Similar news detection method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of big data analysis, in particular to a similar news detection method, a similar news detection device, computer equipment and a storage medium.
Background
The similarity algorithm of massive texts is an important basic algorithm in text processing, and many text processing programs such as target news duplicate removal in target news analysis, web page duplicate removal of search engines and the like need the similarity algorithm capable of processing massive texts.
At present, the mainstream similarity calculation method for mass texts is the simhash algorithm. The simhash algorithm is a local sensitive hash algorithm and adopts the principle that a text is decomposed into words, the hash value of each word is calculated, weighted summation is carried out, the position which is greater than 0 is 1 after summation, and the position which is equal to 0 is kept as 0, so that the hash string of the text is obtained. And comparing the Hamming distance of the text hash string to judge whether the texts are similar, wherein if the Hamming distance is larger than a threshold value, the texts are not similar, and otherwise, the texts are similar. However, the local sensitive hashing such as simhash cannot realize the judgment of the news with the same semantic level and the same target (for example, the difference of the text descriptions is large, but the descriptions are the same thing).
Besides the basic simhash algorithm, similar text query based on bert has semantic capability, but when the target news reaches the million level, the query speed of the project is very slow, and the method cannot be applied to a large-flow online environment. Conventionally, the judgment of 2 items of marked news is carried out by adopting bert, and input bert needs to be paired one by one for reasoning. For example, if an A, B, C, D4 entry is marked for news, then A [ SEP ] B, A [ SEP ] C, A [ SEP ] D is computed in 3 inputs into bert ([ SEP ] is the sign of the connection in bert). If 00 ten thousand items of marked news are provided, 00 ten thousand reasoning calculation needs to be carried out on the newly-entered target news, and the calculation can be finished within 1 hour under a V00 video card, which cannot meet the time-efficiency requirement of the project.
Disclosure of Invention
The application provides a similar news detection method, a similar news detection device, computer equipment and a storage medium.
A first aspect provides a method of similar news detection, the method comprising:
when similar news of the target news is determined, basic information acquisition is carried out on the target news, and the basic information acquisition at least comprises the following steps: extracting the abstract of the target news;
inputting the abstract into a pre-configured twin network model to obtain vector representation information of the target news;
searching in a pre-configured vector database according to the vector characterization information, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing historical news when similar news of the historical news is detected;
and when the similar vector representation information of the target news exists in the vector database, retrieving the similar news of the target news from a pre-configured historical news database based on the similar vector representation information.
In some embodiments, the extracting the summary of the target news includes:
cutting the target news into sentences to obtain a sentence list;
inputting the sentence list into a preconfigured Bert model, and extracting to obtain text features;
inputting the text characteristics into an odd-even sentence coding layer, identifying whether the number of words in a sentence is odd or even, and performing separate coding on the odd sentence and the even sentence;
and decoding by adopting a decoder extracted from the transform model, and extracting the abstract of the target news.
In some embodiments, when similar vector representation information of the target news exists in the vector database, the similar news detection method further includes:
inputting the abstract into a preconfigured entity extraction model to obtain a named entity in the abstract;
inputting the abstract into a pre-configured keyword extraction model to obtain keywords in the abstract;
inputting the abstract into a pre-configured classification model to obtain a target news classification of the target news determined according to the abstract;
marking the unique ID of the target news;
based on an establishment algorithm of a relational database, converting the target news, the abstract, the named entity, the keyword, the target news classification and the unique ID into database index information and storing the database index information into a historical news database;
and converting the text feature vector and the unique ID into vector index information, and storing the vector index information into the vector database.
In some embodiments, after determining that the historical news corresponding to the similarity vector characterization information is similar to the target news, the method further includes:
determining whether the target news and the historical news are the same target news in semantic meaning according to one or a combination of the Hamming distance, the first contact ratio, the second contact ratio and the third contact ratio of the target news and the historical news; wherein the hamming distance is the hamming distance between the simhash value of the target news and the simhash value of the historical news; the first contact ratio is the contact ratio of the named entity of the target news and the named entity of the historical news; the second overlap ratio is the overlap ratio of the keywords of the target news and the keywords of the historical news; the third overlap ratio is the overlap ratio of the target news classification of the target news and the target news classification of the historical news.
In some embodiments, said extracting named entities in said summary comprises:
inputting the abstract of the target news into a pre-configured BERT-BilSTM-CRF model to obtain a named entity in the abstract; wherein the BERT-BilSTM-CRF model comprises the following components: the system comprises a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, wherein the BERT pre-training model layer is used for coding each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting the named entity with the maximum probability based on the new feature vector.
In some embodiments, the method of training the twin network model comprises, in the taking of the summary of the target news as an input to a preconfigured twin network model:
acquiring a plurality of same training news and a plurality of similar training news;
acquiring basic information of the target news, and extracting an abstract of the training news;
the abstract of the same training news is used as a positive sample and input into a twin network model, the abstract of the similar training news is used as a negative sample and input into the twin network model, the twin network model converts the abstract into a vector by using a Bert model, and similarity calculation is carried out on 2 output vectors after passing through an average pooling layer, so that the similarity of the two training news is obtained;
and training the twin network model according to the similarity of the two training news.
In some embodiments, the determining whether similar vector representation information of the target news exists in the vector database includes:
and determining whether similar vector representation information of the target news exists in the vector database according to the cosine similarity between the vector representation information of the target news and the vector representation information in the vector database.
A second aspect provides a similar news detection apparatus, comprising:
the basic information acquisition unit is used for acquiring basic information of the target news when similar news of the target news is determined, and the basic information acquisition at least comprises the following steps: extracting the abstract of the target news;
the twin network model unit is used for inputting the abstract into a pre-configured twin network model to obtain vector representation information of the target news;
the vector database unit is used for searching in a pre-configured vector database according to the vector characterization information and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing historical news when similar news of the historical news is detected;
and the result output unit is used for searching the similar news of the target news in a pre-configured historical news database based on the similar vector representation information when the similar vector representation information of the target news exists in the vector database.
A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the similar news detection method described above.
A fourth aspect provides a storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the similar news detection method described above.
According to the similar news detection method, the similar news detection device, the computer equipment and the storage medium, firstly, when the similar news of the target news is determined, basic information acquisition is carried out on the target news, and the basic information acquisition at least comprises the following steps: extracting an abstract of target news; secondly, taking the abstract of the target news as the input of a pre-configured twin network model to obtain vector representation information of the target news; searching in a pre-configured vector database according to the vector characterization information of the target news again, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing the historical news when similar news of the historical news is detected; and finally, when the similar vector representation information of the target news exists in the vector database, determining that the historical news corresponding to the similar vector representation information is similar to the target news. Therefore, through the sbert model, text comparison of the same news is converted into a vector similarity comparison problem, and vector representation information of all historical news is obtained in advance by means of a pre-configured vector database, so that judgment of the news with the same semantic level can be in the order of hundred million (hundred million) level news, and the result can be output within 100 milliseconds (typical value), namely, compared with the prior art, the matching result precision and the matching efficiency can be well improved.
Drawings
FIG. 1 is a diagram of an environment for implementing a similar news detection method provided in one embodiment;
FIG. 2 is a flow diagram of a similar news detection method in one embodiment;
FIG. 3 is a schematic diagram of a twin network model of a similar news detection method in one embodiment;
fig. 4 is a block diagram showing the structure of a similar news detection apparatus in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, technical terms related to the embodiments of the present invention are first explained:
twin neural networks (siameseeuralnetworks), also known as twins, are coupled frameworks built on two artificial neural networks. The twin neural network takes two samples as input and outputs the characterization of embedding high-dimensional space of the two samples so as to compare the similarity degree of the two samples. The narrowly defined twin neural network is formed by splicing two neural networks which have the same structure and share the weight. A generalized twin neural network, or a pseudo-twin neural network (pseudo-twin neural network), may be formed by splicing any two neural networks. Twin neural networks typically have a deep structure and may consist of convolutional neural networks, cyclic neural networks, and the like. In the supervised learning paradigm, a twin neural network maximizes the characterization of different tags and minimizes the characterization of the same tag. In an unsupervised or unsupervised learning paradigm, a twin neural network can minimize the characterization between the original input and the interfering input (e.g., the original image and the clipping of the image). The twin neural network can perform one-shot learning (one-shot learning), and is not easily interfered by an error sample, so that the twin neural network can be used for pattern recognition problems with strict requirements on fault tolerance, such as portrait recognition, fingerprint recognition, target tracking, and the like.
Bert (BidirectionEncoder Repressions from transforms) is a pre-trained model, the new language representation model of Bert, which represents the bidirectional encoder representation of the transform. Unlike other language representation models in the near future, Bert aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained Bert representation can be finely adjusted through an additional output layer, and the method is suitable for constructing the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, and does not need to make great architectural modification aiming at specific tasks.
The simhash is one of the commonly used text deduplication hash algorithms, similar to md5, crc32, etc. The principle is that a large section of text is mapped into a hash value of only 8 bytes by performing weight calculation on keywords extracted from text data. The method does not support direct similarity analysis and calculation of texts, but the generated hash result values can be compared through a Hamming distance algorithm, so that the similarity between texts is calculated. Since the hamming distance is calculated for the simhash result, the hamming distance is not original text data, the calculation amount is very small, and the simhash result can be calculated in advance after the text data is obtained.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
The terminal of the embodiment of the invention is a server.
As shown in fig. 1, the terminal may include: a processor 001, such as a CPU, a network interface 004, a user interface 003, memory 005, and a communication bus 002. Wherein a communication bus 002 is used to enable the connective communication between these components. The user interface 003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 003 may also include a standard wired interface, a wireless interface. The network interface 004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 005 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 005 may alternatively be a storage device separate from the processor 001.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a text information matching measurement program.
In the terminal shown in fig. 1, the network interface 004 is mainly used for connecting to the background server and performing data communication with the background server; the user interface 003 is mainly used for connecting a client (user terminal) and performing data communication with the client; and the processor may be used to invoke the text information matching metric program stored in the memory 005.
As shown in fig. 2, in an embodiment, a similar news detection method is provided, which specifically includes the following steps:
step 201, when determining similar news of the target news, performing basic information acquisition on the target news, wherein the basic information acquisition at least comprises the following steps: extracting an abstract of target news;
in some embodiments, extracting a summary of the target news includes:
step 201a, cutting a target news into sentences to obtain a sentence list;
wherein the target news is segmented using [ CLS ].
Step 201b, inputting the sentence list into a preconfigured Bert model, and extracting to obtain text features; a
Wherein the Bert model may be received directly from the program management library in which the Bert model resides. The Bert model (Bidirectional Encoder responses from transforms) is a currently disclosed general natural language processing framework, and an internal structure of the Bert model comprises an embedding layer, a multi-head attention machine layer and a feedforward reversal layer, wherein the embedding layer is used for representing a text by a matrix, the multi-head attention machine layer is used for extracting text features from the matrix text, and the feedforward reversal layer is used for mediating internal parameters of the Bert model according to the text features to achieve the purpose of optimizing the Bert model.
Step 201c, inputting text characteristics into an odd-even sentence coding layer, identifying whether the number of words in a sentence is odd or even, and performing separate coding on the odd sentence and the even sentence;
the abstract extraction model is obtained by adding an odd-even sentence coding layer after a feedforward reverse layer in a Bert model; the main purpose of the odd-even sentence coding layer is to identify whether the number of words in a sentence is odd or even, thereby performing separate coding on odd and even sentences. The odd-even sentence coding layer comprises a Chinese character coding layer, wherein the Chinese character coding layer comprises a Chinese character coding layer, a Chinese character coding layer and a Chinese character coding layer, the Chinese character coding layer is used for coding a Chinese character coding layer, the Chinese character coding layer comprises a Chinese character coding layer, the Chinese character coding layer is used for coding a Chinese character coding layer, the Chinese character coding layer is used for partitioning a sentence by using the Chinese character coding layer to obtain a plurality of groups of words, and the number of the words in the sentence is traversed, so that the number of the words in the sentence is recognized.
And step 201d, decoding by using a decoder extracted from the transform model, and extracting the abstract of the target news.
the transform model is an open-source natural language processing model, and includes a decoder. And extracting a decoder from the transform model, and combining the encoder and the decoder to obtain a digest extraction model.
Step 202, inputting the abstract into a pre-configured twin network model to obtain vector representation information of target news;
it can be understood that, since the twin network is composed of two parallel BERT models, the data input each time is a set of sentence pairs, and therefore some processing of the data is required to complete the training. Similar sentences and identical sentences need to be constructed, identical in this embodiment meaning identical at the semantic level. The training process is to input the two problems into a BERT model of the twin network, the two BERT models share parameters, the output of the last layer is respectively taken, an average pooling strategy is adopted, and the average value of each dimension output of all tokens is taken as an Embedding vector. And assuming that the output vector of the first sentence is u and the output vector of the second sentence is v, and adopting cos similarity as an optimized objective function. Training is carried out by using the new network, and FineTurning is carried out on the BERT network.
As shown in fig. 3, the twin network uses a Bert pre-training model to obtain a vector of a sentence (sense) from a text, and obtains 2 outputs (u, v) through pooling (posing) and a full connection layer (dense), and calculates cosine similarity of the output values to obtain final similar probability values.
In some embodiments, in the step 202, the summary of the target news is used as an input of a pre-configured twin network model, and the training method of the twin network model comprises:
step 202a, acquiring a plurality of same training news and a plurality of similar training news;
step 202b, basic information acquisition is carried out on the target news, and the abstract of the training news is extracted;
step 202c, taking the abstracts of the same training news as positive samples and inputting the abstracts of the similar training news as negative samples and inputting the abstracts of the similar training news into the twin network model, wherein the twin network model converts the abstracts into one vector by using a Bert model, and 2 output vectors are subjected to similarity calculation through an average pooling layer to obtain the similarity of the two training news;
and step 202d, training the twin network model according to the similarity of the two training news.
Step 203, searching in a pre-configured vector database according to the vector characterization information, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing historical news when similar news of the historical news is detected;
it is understood that the title + abstract are inserted into the millius vector database by averaging the vectors obtained after pooling again through the sbert model (millius supports near real-time search, and can be retrieved by inserting a landing). The target news vector is searched in a vector database such as millius or faiss.
And 204, when the similar vector representation information of the target news exists in the vector database, retrieving the similar news of the target news from the pre-configured historical news database based on the similar vector representation information.
In some embodiments, determining whether similar vector representation information to the vector representation information of the target news exists in the vector database may include: and determining whether similar vector representation information of the target news exists in the vector database according to the cosine similarity between the vector representation information of the target news and the vector representation information in the vector database.
In some embodiments, when similar vector representation information of the target news exists in the vector database, the similar news detection method in an embodiment further includes:
(1) extracting named entities in the abstract of the target news;
in some embodiments, extracting named entities in a summary of target news includes:
inputting the abstract of the target news into a pre-configured BERT-BilSTM-CRF model to obtain a named entity in the abstract of the target news; wherein, the BERT-BilSTM-CRF model comprises the following components: the system comprises a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, wherein the BERT pre-training model layer is used for coding each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting the named entity with the maximum probability based on the new feature vector.
The named entity recognition model constructed based on the BERT model well solves the problems of difficult and low-precision entity recognition when the labeling data is insufficient and the entity boundary is fuzzy, and improves the performance and recognition accuracy of the entity recognition model.
(2) Extracting key words in the abstract of the target news;
in some embodiments, the keywords of the summary are extracted by the tf-idf algorithm.
(3) Determining target news classification of the target news according to the abstract of the target news;
in some embodiments, determining a target news category for the target news may include:
clustering the training news by adopting an LDA model, and labeling the categories of various training news;
marking categories of the training news events and the target news events as training data of a BERT model, and training the BERT model to obtain a type analysis model;
and inputting the abstract of the recent target news event into a type analysis model to obtain the type of the recent target news event.
(4) Marking the unique ID of the target news;
(5) based on an establishment algorithm of a relational database, converting target news, abstracts, named entities, keywords, target news classification and unique IDs into database index information and storing the database index information into a historical news database;
(6) and converting the text feature vector and the unique ID into vector index information, and storing the vector index information into a vector database.
Further, after step 204, similar news of the target news is recalled, and after the recall, similar but different target news needs to be filtered to improve the precision.
In some embodiments, after determining that the historical news corresponding to the similarity vector characterization information is similar news of the target news, the method further includes:
determining whether the target news and the historical news are the same target news with the same semantics or not according to one or the combination of the Hamming distance between the target news and the historical news, the first contact ratio, the second contact ratio and the third contact ratio; wherein the hamming distance is the hamming distance between the simhash value of the target news and the simhash value of the historical news; the first contact ratio is the contact ratio of the named entity of the target news and the named entity of the historical news; the second coincidence degree is the coincidence degree of the keywords of the target news and the keywords of the historical news; the third overlap ratio is the overlap ratio of the target news classification of the target news and the target news classification of the historical news.
Specific examples are:
cosine similarity between target news and historical news is more than 0.8, otherwise filtering is carried out
The simhash value between the target news and the historical news is less than 20, and the method is directly adopted
And if the simhash value between the target news and the historical news is more than or equal to 20, filtering by adopting the following method:
and extracting keywords from the title and the abstract by using a rank mode of an LAC (Baidu word segmentation device), wherein the keywords with the score of 3 are core keywords and the keywords with the score of 2 are important keywords.
If the newly introduced target news contains the core key words and all the historical news compared with the target news contain any one of the core key words, the method directly adopts
If the newly introduced target news has no core keywords, the important keywords are compared.
And all the historical news compared with the historical news comprise any 2 important keywords, and the historical news is directly adopted.
As shown in fig. 4, in an embodiment, a similar news detection apparatus is provided, which may specifically include:
the basic information collecting unit 411 is configured to, when similar news of the target news is determined, perform basic information collection on the target news, where the basic information collection at least includes: extracting an abstract of target news;
the twin network model unit 412 is used for taking the abstract of the target news as the input of a pre-configured twin network model to obtain the vector representation information of the target news;
the vector database unit 413 is configured to search in a preconfigured vector database according to the vector characterization information of the target news, and determine whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing the historical news when similar news of the historical news is detected;
and a result output unit 414, configured to determine, when similar vector characterization information of the target news exists in the vector database, that the historical news corresponding to the similar vector characterization information is similar news of the target news.
In one embodiment, a computer device is provided, which may include a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: receiving a claim settlement request, and determining claim settlement data according to the claim settlement request, wherein the claim settlement data may include basic information acquisition of target news when similar news of the target news is determined, and the basic information acquisition at least includes: extracting an abstract of target news; taking the abstract of the target news as the input of a pre-configured twin network model to obtain the vector representation information of the target news; searching in a pre-configured vector database according to the vector characterization information of the target news, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing the historical news when similar news of the historical news is detected; and when the similar vector representation information of the target news exists in the vector database, determining that the historical news corresponding to the similar vector representation information is similar news of the target news.
In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: when similar news of the target news is determined, basic information acquisition is carried out on the target news, and the basic information acquisition at least comprises the following steps: extracting an abstract of target news; taking the abstract of the target news as the input of a pre-configured twin network model to obtain the vector representation information of the target news; searching in a pre-configured vector database according to the vector characterization information of the target news, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing the historical news when similar news of the historical news is detected; and when the similar vector representation information of the target news exists in the vector database, determining that the historical news corresponding to the similar vector representation information is similar news of the target news.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-only memory (ROM), or a Random Access Memory (RAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for detecting similar news, the method comprising:
when similar news of the target news is determined, basic information acquisition is carried out on the target news, and the basic information acquisition at least comprises the following steps: extracting the abstract of the target news;
inputting the abstract into a pre-configured twin network model to obtain vector representation information of the target news;
searching in a pre-configured vector database according to the vector characterization information, and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing historical news when similar news of the historical news is detected;
and when the similar vector representation information of the target news exists in the vector database, retrieving the similar news of the target news from a pre-configured historical news database based on the similar vector representation information.
2. The similar news detection method of claim 1, wherein the extracting the summary of the target news comprises:
cutting the target news into sentences to obtain a sentence list;
inputting the sentence list into a preconfigured Bert model, and extracting to obtain text features;
inputting the text characteristics into an odd-even sentence coding layer, identifying whether the number of words in a sentence is odd or even, and performing separate coding on the odd sentence and the even sentence;
and decoding by adopting a decoder extracted from the transform model, and extracting the abstract of the target news.
3. The similar news detection method of claim 1, wherein when similar vector representation information of the target news exists in the vector database, the similar news detection method further comprises:
inputting the abstract into a preconfigured entity extraction model to obtain a named entity in the abstract;
inputting the abstract into a pre-configured keyword extraction model to obtain keywords in the abstract;
inputting the abstract into a pre-configured classification model to obtain a target news classification of the target news determined according to the abstract;
marking the unique ID of the target news;
based on an establishment algorithm of a relational database, converting the target news, the abstract, the named entity, the keyword, the target news classification and the unique ID into database index information and storing the database index information into a historical news database;
and converting the text feature vector and the unique ID into vector index information, and storing the vector index information into the vector database.
4. The similar news detection method of claim 3, wherein after determining that the historical news corresponding to the similar vector representation information is similar news of the target news, the method further comprises:
determining whether the target news and the historical news are the same target news in semantic meaning according to one or a combination of the Hamming distance, the first contact ratio, the second contact ratio and the third contact ratio of the target news and the historical news; wherein the hamming distance is the hamming distance between the simhash value of the target news and the simhash value of the historical news; the first contact ratio is the contact ratio of the named entity of the target news and the named entity of the historical news; the second overlap ratio is the overlap ratio of the keywords of the target news and the keywords of the historical news; the third overlap ratio is the overlap ratio of the target news classification of the target news and the target news classification of the historical news.
5. The method for detecting similar news as claimed in claim 3, wherein the extracting named entities in the abstract comprises:
inputting the abstract of the target news into a pre-configured BERT-BilSTM-CRF model to obtain a named entity in the abstract; wherein the BERT-BilSTM-CRF model comprises the following components: the system comprises a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, wherein the BERT pre-training model layer is used for coding each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting the named entity with the maximum probability based on the new feature vector.
6. The similar news detection method of claim 1, wherein the summarization of the target news is taken as an input of a pre-configured twin network model, and the training method of the twin network model comprises:
acquiring a plurality of same training news and a plurality of similar training news;
acquiring basic information of the target news, and extracting an abstract of the training news;
the abstract of the same training news is used as a positive sample and input into a twin network model, the abstract of the similar training news is used as a negative sample and input into the twin network model, the twin network model converts the abstract into a vector by using a Bert model, and similarity calculation is carried out on 2 output vectors after passing through an average pooling layer, so that the similarity of the two training news is obtained;
and training the twin network model according to the similarity of the two training news.
7. The similar news detection method of claim 1, wherein the determining whether similar vector representation information of the target news exists in the vector database comprises:
and determining whether similar vector representation information of the target news exists in the vector database according to the cosine similarity between the vector representation information of the target news and the vector representation information in the vector database.
8. A similar news detection apparatus, comprising:
the basic information acquisition unit is used for acquiring basic information of the target news when similar news of the target news is determined, and the basic information acquisition at least comprises the following steps: extracting the abstract of the target news;
the twin network model unit is used for inputting the abstract into a pre-configured twin network model to obtain vector representation information of the target news;
the vector database unit is used for searching in a pre-configured vector database according to the vector characterization information and determining whether similar vector characterization information of the target news exists in the vector database; the vector database stores vector representation information extracted by processing historical news when similar news of the historical news is detected;
and the result output unit is used for searching the similar news of the target news in a pre-configured historical news database based on the similar vector representation information when the similar vector representation information of the target news exists in the vector database.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the similar news detection method as claimed in any one of claims 1 to 7.
10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the similar news detection method as claimed in any one of claims 1 to 7.
CN202210035103.6A 2022-01-13 2022-01-13 Similar news detection method and device, computer equipment and storage medium Withdrawn CN114386421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210035103.6A CN114386421A (en) 2022-01-13 2022-01-13 Similar news detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210035103.6A CN114386421A (en) 2022-01-13 2022-01-13 Similar news detection method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114386421A true CN114386421A (en) 2022-04-22

Family

ID=81202348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210035103.6A Withdrawn CN114386421A (en) 2022-01-13 2022-01-13 Similar news detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114386421A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309860A (en) * 2022-07-18 2022-11-08 黑龙江大学 False news detection method based on pseudo twin network
CN116304745A (en) * 2023-03-27 2023-06-23 济南大学 Text topic matching method and system based on deep semantic information
CN116304065A (en) * 2023-05-23 2023-06-23 美云智数科技有限公司 Public opinion text classification method, device, electronic equipment and storage medium
CN116522165A (en) * 2023-06-27 2023-08-01 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN117573726A (en) * 2024-01-12 2024-02-20 邯郸鉴晨网络科技有限公司 Order information intelligent searching method based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250550A (en) * 2016-08-12 2016-12-21 智者四海(北京)技术有限公司 A kind of method and apparatus of real time correlation news content recommendation
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112182337A (en) * 2020-10-14 2021-01-05 数库(上海)科技有限公司 Method for identifying similar news from massive short news and related equipment
CN112528013A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Text abstract extraction method and device, electronic equipment and storage medium
WO2021196468A1 (en) * 2020-03-31 2021-10-07 深圳壹账通智能科技有限公司 Tag creation method and apparatus, electronic device and medium
CN113704386A (en) * 2021-10-27 2021-11-26 深圳前海环融联易信息科技服务有限公司 Text recommendation method and device based on deep learning and related media

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250550A (en) * 2016-08-12 2016-12-21 智者四海(北京)技术有限公司 A kind of method and apparatus of real time correlation news content recommendation
WO2021196468A1 (en) * 2020-03-31 2021-10-07 深圳壹账通智能科技有限公司 Tag creation method and apparatus, electronic device and medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112182337A (en) * 2020-10-14 2021-01-05 数库(上海)科技有限公司 Method for identifying similar news from massive short news and related equipment
CN112528013A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Text abstract extraction method and device, electronic equipment and storage medium
CN113704386A (en) * 2021-10-27 2021-11-26 深圳前海环融联易信息科技服务有限公司 Text recommendation method and device based on deep learning and related media

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309860A (en) * 2022-07-18 2022-11-08 黑龙江大学 False news detection method based on pseudo twin network
CN116304745A (en) * 2023-03-27 2023-06-23 济南大学 Text topic matching method and system based on deep semantic information
CN116304745B (en) * 2023-03-27 2024-04-12 济南大学 Text topic matching method and system based on deep semantic information
CN116304065A (en) * 2023-05-23 2023-06-23 美云智数科技有限公司 Public opinion text classification method, device, electronic equipment and storage medium
CN116304065B (en) * 2023-05-23 2023-09-29 美云智数科技有限公司 Public opinion text classification method, device, electronic equipment and storage medium
CN116522165A (en) * 2023-06-27 2023-08-01 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN116522165B (en) * 2023-06-27 2024-04-02 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN117573726A (en) * 2024-01-12 2024-02-20 邯郸鉴晨网络科技有限公司 Order information intelligent searching method based on big data
CN117573726B (en) * 2024-01-12 2024-05-03 新疆原行网智慧文旅有限公司 Order information intelligent searching method based on big data

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN114386421A (en) Similar news detection method and device, computer equipment and storage medium
CN107229668B (en) Text extraction method based on keyword matching
CN111125460B (en) Information recommendation method and device
CN112800170A (en) Question matching method and device and question reply method and device
CN111104511B (en) Method, device and storage medium for extracting hot topics
CN113806482B (en) Cross-modal retrieval method, device, storage medium and equipment for video text
CN111428028A (en) Information classification method based on deep learning and related equipment
CN112131352A (en) Method and system for detecting bad information of webpage text type
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN113626704A (en) Method, device and equipment for recommending information based on word2vec model
CN114218348A (en) Method, device, equipment and medium for acquiring live broadcast segments based on question and answer text
CN114691864A (en) Text classification model training method and device and text classification method and device
CN115168590A (en) Text feature extraction method, model training method, device, equipment and medium
CN117609479B (en) Model processing method, device, equipment, medium and product
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN116127097A (en) Structured text relation extraction method, device and equipment
CN113010643B (en) Method, device, equipment and storage medium for processing vocabulary in Buddha field
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN117725555B (en) Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220422

WW01 Invention patent application withdrawn after publication