CN110134792B

CN110134792B - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN110134792B
Application number: CN201910431256.0A
Authority: CN
Inventors: 李长亮; 樊骏锋; 汪美玲; 唐剑波
Original assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2022-03-08
Anticipated expiration: 2039-05-22
Also published as: CN110134792A

Abstract

The present specification provides a text recognition method, an apparatus, an electronic device, and a storage medium, wherein the text recognition method includes: acquiring a text set of a plurality of texts; extracting subject keywords of texts in a text set, and acquiring actual subject keywords extracted from at least one text in the text set; determining a first distribution of the subject keywords in each text in the text set and a second distribution of the actual subject keywords in each text in the text set; inputting the texts in the text set carrying the first distribution and the second distribution into a classifier for recognition to obtain key sentences and non-key sentences of the texts in the text set; by the text recognition method, the key sentences and the non-key sentences of the text can be quickly and accurately acquired, the key sentences of the text can be conveniently marked by cleaning the non-key sentences of the text, the construction efficiency of the knowledge graph is improved, and the key sentences of the text are retained, so that a user can conveniently and quickly know the main contents of the text when looking up the text.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The specification relates to the technical field of natural language processing, in particular to a text recognition method. The present specification also relates to a text recognition apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of internet technology, obtaining required information through a network is a means frequently used by people, when a user queries information in the same field through the network, the user can quickly know the topic of each article when querying the information conveniently, and the user can know whether the required information is contained in each article by screening and displaying the topic key sentence of each article to the user through checking the topic key sentence.

In the prior art, there are various methods for extracting the topic key sentence of each article, which can be implemented by extracting the topic key word of each article through an unsupervised key word screening method, and determining the topic key sentence according to the number of the key words contained in each sentence of each article.

However, since the accuracy of the topic keywords extracted by the unsupervised keyword screening method is not very high, the accuracy of extracting the topic key sentences of each article is greatly reduced, so that the topic key sentences viewed by the user are not necessarily the actual topic key sentences of the articles when the user looks up the articles.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a text recognition method to solve technical defects in the prior art. The embodiment of the specification also provides a text recognition device, an electronic device and a computer readable storage medium.

According to a first aspect of embodiments of the present specification, there is provided a text recognition method including:

acquiring a text set of a plurality of texts;

extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set;

determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;

and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.

Optionally, the extracting the topic keyword of each text in the text set includes:

performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;

and inputting the keywords of each text into a theme generation model for theme keyword identification, and outputting the keywords as the theme keywords.

calculating the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text of the keywords in the text set;

determining the keyword score of the keyword according to the product of the frequency and the reverse keyword frequency;

and taking the keywords with the scores larger than the keyword score threshold value as the topic keywords.

Optionally, the obtaining of the actual topic keyword extracted from at least one text in the text set includes:

randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;

and acquiring the actual subject key words of the at least one text extracted manually.

Optionally, the determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set includes:

generating a keyword distribution matrix of each text at a sentence level according to the topic keywords contained in the sentences in each text, wherein the keyword distribution matrix is used as the first distribution;

and generating an actual keyword distribution matrix of each text at a sentence level according to actual topic keywords contained in the sentences in each text, wherein the actual keyword distribution matrix is used as the second distribution.

Optionally, the classifier is constructed in the following manner:

constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;

correspondingly, executing the step of inputting the text in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences so as to obtain the key sentences and non-key sentences of the text in the text set;

the identifying key sentences and non-key sentences of the text in the text set by the text input classifier carrying the first distribution and the second distribution to obtain the key sentences and non-key sentences of the text in the text set comprises:

and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.

Optionally, after the step of performing key sentence and non-key sentence identification on the text centralized text input classifier carrying the first distribution and the second distribution to obtain the key sentence and the non-key sentence of the text in the text centralized text is executed, the method further includes:

calculating the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the text in the text set;

and optimizing the classifier according to the recall rate and/or the accuracy rate of each text.

Optionally, the calculating the recall ratio of each text includes:

counting the total number of key sentences contained in each text and the actual number of key sentences contained in the output key sentences of each text;

and calculating the ratio of the actual number of the key sentences to the total number of the key sentences as the recall rate of each text.

Optionally, the calculating the accuracy of each text includes:

counting the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;

and calculating the ratio of the actual number of the key sentences to the number of the key sentences to serve as the accuracy of each text.

Optionally, the obtaining a text set of a plurality of texts includes:

and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.

According to a second aspect of embodiments herein, there is provided a text recognition apparatus including:

an acquisition module configured to acquire a text set of a plurality of texts;

the extraction module is configured to extract a subject keyword of each text in the text set and acquire an actual subject keyword extracted from at least one text in the text set;

a determining module configured to determine a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;

and the identification module is configured to identify key sentences and non-key sentences of the texts in the text set by inputting the texts in the text set carrying the first distribution and the second distribution into the classifier, so as to obtain the key sentences and the non-key sentences of the texts in the text set.

Optionally, the extracting module includes:

the first word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;

and the identification unit is configured to input the keywords of each text into a theme generation model for theme keyword identification, and output the keywords as the theme keywords.

Optionally, the extracting module includes:

the second word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;

a first calculating unit, configured to calculate the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text in the text set;

a keyword scoring unit configured to determine a keyword score of the keyword according to a product of the frequency and the reverse keyword frequency;

and the subject keyword determining unit is configured to take the keywords with the keyword scores larger than a keyword score threshold value as the subject keywords.

Optionally, the extracting module is further configured to:

Optionally, the determining module includes:

a keyword distribution matrix generation unit configured to generate a keyword distribution matrix of each text at a sentence level as the first distribution according to a topic keyword included in a sentence in each text;

and generating an actual keyword distribution matrix unit, configured to generate an actual keyword distribution matrix of each text at a sentence level according to the actual subject keywords included in the sentences in each text, as the second distribution.

Optionally, the classifier is constructed in the following manner:

correspondingly, operating the identification module;

the identification module further configured to:

Optionally, the text recognition apparatus further includes:

the second calculation unit is configured to calculate the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the texts in the text set;

an optimizing unit configured to optimize the classifier according to the recall rate and/or the accuracy rate of each text.

Optionally, the second computing unit includes:

a first statistic submodule configured to count a total number of key sentences included in each text and an actual number of key sentences included in the output key sentences of each text;

a recall rate calculation submodule configured to calculate a ratio of the actual number of key sentences to the total number of key sentences as the recall rate of each text.

Optionally, the second computing unit includes:

the second counting submodule is configured to count the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;

and the calculation accuracy sub-module is configured to calculate a ratio of the actual number of the key sentences to the number of the key sentences as the accuracy of each text.

Optionally, the obtaining module is further configured to:

According to a third aspect of embodiments herein, there is provided an electronic apparatus including:

a memory and a processor;

the memory is for storing computer-executable instructions that when executed by the processor implement the steps of the text recognition method.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of any one of the text recognition methods.

Compared with the prior art, the specification has the following advantages:

the present specification provides a text recognition method, including: acquiring a text set of a plurality of texts; extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set; determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set; and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.

The text recognition method provided by the specification extracts actual subject keywords of a small amount of texts in the text set and subject keywords of a large amount of texts in the text set, and determines a first distribution of the actual subject keywords in each text in the text set, and the second distribution of the subject keywords in each text in the text set, inputting each text carrying the first distribution and the second distribution into a classifier for identifying key sentences and non-key sentences, and determining the key sentences and the non-key sentences of each text in the text set, by cleaning the non-key sentences of the text, the key sentences of the text are reserved, the key sentences of the text are conveniently marked, the construction efficiency is improved in the process of constructing the knowledge graph, and the key sentences of the text are reserved, so that a user can conveniently and quickly know the main content of the text when looking up the text.

Drawings

Fig. 1 is a flowchart of a text recognition method provided in an embodiment of the present specification;

FIG. 2 is a process flow diagram of a text recognition process provided by an embodiment of the present specification;

fig. 3 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure;

fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

TF-IDF: (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, TF means Term Frequency (Term Frequency) and IDF means Inverse text Frequency index (Inverse Document Frequency). It is a statistical method to assess how important a word is to one of the documents in a corpus or a corpus.

LDA: (Latent Dirichlet Allocation), which is a document theme generation model, is also called a three-layer Bayesian probability model, and comprises three layers of structures of words, themes and documents. It is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document sets or corpora.

Key words: is a word, phrase or phrase used to express the subject matter of a document, such as a scientific paper, a scientific report, an academic paper or an article.

Actual topic keywords: the method is characterized in that words, phrases or phrases of document subject contents are marked by a small number of texts such as scientific papers, scientific reports, academic papers or articles and the like manually; and the accuracy of manually marking the actual topic keywords on texts such as scientific papers, scientific reports, academic papers or articles is high.

Topic keywords: the method is characterized in that words, phrases or phrases of the subject matter of the document are marked out by TF-IDF or LDA on a large number of texts such as scientific and technical papers, scientific and technical reports, academic papers or articles, and the marking efficiency of the subject keywords of the texts such as the scientific and technical papers, scientific and technical reports, academic papers or articles is high.

In the present specification, a text recognition method is provided. This specification also relates to a text recognition apparatus, an electronic device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

Fig. 1 shows a flow diagram of a text recognition method according to an embodiment of the present description, including steps 102 to 108.

Step 102: a text set of a plurality of texts is obtained.

In one embodiment of the present disclosure, the text set of the plurality of texts may be a text set composed of a plurality of articles or a text set composed of a plurality of news reports, wherein the text set composed of the plurality of articles or the text set composed of the plurality of news reports belong to the same domain. For example, searching for soccer in a search engine, a platform carrying the search engine may show a large number of articles, news and pictures about soccer, all of which belong to the field of sports soccer.

Here, the text recognition method will be described by taking the text set as a text set composed of articles as an example. Therefore, when a user searches for knowledge about a certain aspect, the user usually searches for related articles through a network to further know the knowledge, and when a search engine provides the articles about the knowledge about the certain aspect, in order to enable the user to quickly know the main content of the articles, the key sentences of the articles are extracted and preferentially displayed to the user, so that the user can accurately know what the main content of the articles is and whether the articles are the articles required by the user.

In order to provide accurate key sentences for users, after a text set composed of a plurality of texts is acquired, extracting actual subject key words from a small amount of texts in the text set, extracting key words from a large amount of texts in the text set, determining a first distribution from the distribution of actual topic keywords in the sentences of each text in the text set, and the distribution of the keywords in the sentence of each text in the text set determines a second distribution, and the texts in the text set carrying the first distribution and the second distribution are input to a classifier for key sentence identification, so that the accuracy of identifying the key sentence of each text is improved, and the non-key sentences are cleaned, the key sentences of each text are reserved as the main contents of the article displayed to the user, and the key sentences displayed to the user are ensured to be the actual main contents of the corresponding article.

In addition, the extraction of the events at the discourse level is an important loop for the construction of the knowledge graph, and the extraction of the key sentences and the cleaning of the non-key sentences from the events at the discourse level play an important role in the accuracy and efficiency of the subsequent event extraction. By cleaning the non-key sentences of the text, the key sentences of the text are reserved, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.

For example, a news report article, which reports a car accident, describes 10000 words of content, and some users only pay attention to the main content of the news when watching a news report, the main content of the news is information such as occurrence location, occurrence time, and the number of injured people, and the key sentence of the news is that a car accident occurs at a location a at eight am on 4 and 17 months in 2019, and no any person is injured.

In one or more implementations of this embodiment, the obtaining a text set of a plurality of texts includes:

Specifically, in the process of identifying key sentences in subsequent texts, the key sentences of the texts in the same category in the vertical field are identified, that is, the obtained text sets of the plurality of texts are the text sets created for the plurality of texts in the same category in the vertical field.

Here, the vertical domain may be understood as a plurality of small domains vertically subdivided in one large domain, and the subdivided small domains belong to small domains in the vertical domain. For example, in the sports vertical field, the track and field belong to the two-level field subdivided by the sports vertical field, and the track and field can be determined as a category in the sports vertical field. Further, the second field of the track and field can be divided into more three-level fields, for example, hectometre, relay and marathon all belong to the three-level fields subdivided from the second field of the track and field.

In the text set in the vertical field, the attributes of the keywords are similar, and the types of the keywords are limited, so that the texts in the same category in the vertical field are acquired, the text set is created according to the texts in the same category in the vertical field, and the texts in the same field are identified in the subsequent process of identifying the text key sentences, so that the obtained text key sentences can be more accurate.

Step 104: and extracting the subject key words of each text in the text set, and acquiring the actual subject key words extracted from at least one text in the text set.

Specifically, according to the obtained text set, further extracting each text topic keyword in the text set, and obtaining an actual topic keyword of at least one text in the text set. The topic keywords of each text are extracted through a set algorithm or a set model, and the actual topic keywords are extracted in a manual labeling mode.

For example, a text set is composed of 100 articles about football, keywords are extracted from the 100 articles through a set model, the topic keywords are determined to be football, winner and score, keywords are labeled on one of the 100 articles in a manual labeling mode, and the actual topic keywords are determined to be football, winner, score, team, main/guest field, game time and player. Based on the above, it can be determined that the richness of the keywords included in the actual topic keywords labeled manually is greater than the richness of the keywords included in the topic keywords extracted by the set model.

On the basis of obtaining the actual topic keyword of at least one text in the text set, further, in one or more embodiments of this embodiment, the obtained actual topic keyword is extracted manually, and a specific implementation manner is as follows:

Specifically, on the basis of obtaining the text set of the plurality of texts, a small number of texts are randomly selected from the text set to manually extract actual subject keywords, and the actual subject keywords of the manually extracted small number of texts are obtained.

In practical applications, the manual extraction process is still described by taking a paragraph of the above article as "flower blossom, flower withering, and not meaning that the life of the flower is lost … …", and the keywords of the paragraph are determined to include "flower", "blossom", "withering", "not", "meaning", "life", "in" and "lost" according to the manual labeling, and the actual subject keywords are determined to be "flower", "blossom", "withering", and "lost" by understanding the description of the paragraph.

The accuracy of the extracted actual subject key words of the texts can be ensured by manually extracting a small amount of actual subject key words of the texts, a measuring standard can be provided for subsequently identifying the key sentences of each text, and the high accuracy of the subsequently identified key sentences of each text is ensured.

On the basis of extracting the topic keyword of each text in the text set, further, in one or more embodiments of this embodiment, the topic keyword of each text in the text set is extracted, and a specific implementation manner is as follows:

Specifically, word segmentation processing is performed on each text in the text set through a word segmentation processing algorithm in natural language processing, keywords of each text are determined according to word segmentation processing results, the keywords of each text are input into a theme generation model to perform theme keyword recognition, and the recognized keywords can be used as theme keywords of each text.

Based on the above, the process of identifying the topic keywords by the topic generation model is to determine the topic keywords by traversing the times of occurrence of each keyword in the corresponding text.

For example, in one article, a paragraph is "flower blossom", flower withering, and does not mean that the life of the flower is dying … … ", words of the paragraph are determined to be" flower "," blossom "," withering "," not "," meaning "," life "," dying "," in ", and" dying "respectively by a word segmentation algorithm, all 11 keywords are input to a topic generation model for topic keyword recognition, and the obtained topic keyword of the paragraph is" flower ".

In practical application, when the topic generation model identifies the topic keywords, a large number of samples are required to be trained to ensure that the topic keywords identified by the topic generation model are more accurate, and the training process of the topic generation model can select a proper sample library to be trained according to practical application, and the specification is not limited herein.

On the basis of extracting the topic keyword of each text in the text set, further, this specification further provides another method for extracting the topic keyword of each text in the text set, where in one or more embodiments of this embodiment, the extracting the topic keyword of each text in the text set includes:

Specifically, word segmentation processing is carried out on each text in the text set through a word segmentation processing algorithm in natural language processing, a keyword of each text is determined according to a word segmentation processing result, the frequency of the keyword appearing in the corresponding text and the reverse keyword frequency of the keyword in the corresponding text are calculated, the reverse keyword frequency and the frequency are multiplied to determine a keyword score of each keyword, the keyword score is compared with a keyword score threshold, if the keyword score is greater than the keyword score threshold, the keyword with the keyword score greater than the keyword score threshold is taken as the subject keyword, and if the keyword score is less than or equal to the keyword score threshold, the keyword with the keyword score less than or equal to the keyword score threshold is not processed.

In specific implementation, the reverse keyword frequency of the keyword in each text can be calculated in the following manner: determining the weight of each keyword in each text, and determining the reverse keyword frequency of each keyword relative to the corresponding text through the weight; here, the weight of each keyword may be determined by matching each keyword with a keyword in a preset keyword library, where the keywords in the keyword library all have corresponding weights, and assigning the keyword matched with the keyword library to a weight recorded in the keyword library, that is, the reverse keyword frequency of each keyword in each text in the text set may be determined according to the weight of each keyword.

Or the reverse keyword frequency of the keyword in each text can be calculated by the following method: the reverse keyword frequency of each keyword is determined in a logarithmic function manner, for example, in ten million articles, the word "china" appears in one thousand articles, and the reverse keyword frequency of the word "china" in one ten million articles is determined to be lg (10000000/1000) ═ 4 through the logarithmic function.

In practical applications, still taking the above paragraph as "flower blossom, flower withering, and not meaning flower life is dying … …", another method for extracting the subject keyword of each text in the text set is described, determining the words of the paragraph as "flower", "flower blossom", "withering", "not", "meaning", "life", "at", and "dying" through a word segmentation algorithm, determining the matching frequency of each keyword as "flower" matching frequency of 3 "," matching frequency of "flower" of 3 "," blooming "," withering "," not "," meaning "," life "," at ", and" dying "of 3", determining the matching frequency of "flower" as 0.7 ", determining the reverse keyword frequency of" flower "as 0.1" and "blooming" as 0.1 according to calculation, The reverse keyword frequency of "zero", "mean", "life", and "death" is 0.5, and the reverse keyword frequency of "and", "not", and "at" is 0.2, the keyword score of "flower" is determined to be 2.1 by calculation, the keyword score of "at" is 0.3, the keyword score of "blossom", "zero", "mean", "life", and "death" is 0.5, the keywords of "and", "not", and "at" are 0.3, and the keyword score threshold is 1, and the keyword "flower" is determined to be the subject keyword of the paragraph.

In addition, each text topic keyword can be extracted by a TF-IDF statistical method or an LDA document topic generation model, and the description of the specification is omitted.

In the process of extracting the topic keywords, the topic keywords of each text are extracted by the two methods, so that the accuracy of the extracted topic keywords and the extraction efficiency of the extracted topic keywords are ensured, and an important basis is laid for the subsequent more accurate identification of the key sentences of each text.

Step 106: a first distribution of the topic keyword in each text in the corpus and a second distribution of the actual topic keyword in each text in the corpus are determined.

Specifically, the method includes extracting a topic keyword from each text, extracting an actual topic keyword of at least one text, determining the first distribution according to the distribution of the topic keyword in each text based on the topic keyword, and determining the second distribution according to the distribution of the actual topic keyword in each text.

In specific implementation, the first distribution of the topic keywords in each text is determined as the first distribution of the distribution condition of the topic keywords in each sentence in each text; and the second distribution of the actual topic keywords in each text is determined as the second distribution of the distribution condition of the actual topic keywords in each sentence in each text.

On the basis of the foregoing determination of the first distribution and the second distribution, further, in one or more implementations of this embodiment, a specific implementation manner of the generation process of the first distribution and the second distribution is as follows:

Specifically, a keyword distribution matrix of each text at a sentence level is generated according to the topic keywords included in the sentences in each text, the distribution matrix of the topic keywords at the sentence level in each text is determined as the first distribution, an actual keyword distribution matrix of each text at the sentence level is generated according to the actual topic keywords included in the sentences in each text, and the distribution matrix of the actual topic keywords at the sentence level in each text is determined as the second distribution.

In practical applications, two texts doc1 and doc2 are taken as examples to describe the process of determining the first distribution and the second distribution, wherein doc 1: i like playing football; doc 2: i like playing tennis; extracting the subject keywords as 'I' and 'like', the actual subject keywords as 'football' and 'tennis', and the element values in the keyword matrix and the actual keyword matrix represent word frequency; determining a keyword matrix according to the distribution of the topic keywords in the two texts as follows:

wherein, a11, a12, a21 and a22 are all 1, which means that "me" and "like" appear in two texts doc1 and doc2 with frequency of 1;

determining an actual keyword matrix according to the distribution of the actual topic keywords in the two texts as follows:

where B11 and B22 are 1, B12 and B21 are 0, which means that "soccer" appears in the text doc1 with a frequency of 1, and "tennis" appears in the text doc2 with a frequency of 0, and "tennis" appears in the text doc2 with a frequency of 1.

Determining a keyword distribution matrix by determining the distribution of the topic keywords at the sentence level of each text, taking the keyword distribution matrix as the first distribution of the topic keywords, determining the distribution of the actual topic keywords at the sentence level of each text to determine the actual keyword distribution matrix, taking the actual keyword distribution matrix as the second distribution of the actual topic keywords, and determining the distribution condition of the topic keywords and the actual topic keywords in each text more intuitively by taking the matrix mode as the first distribution and the second distribution.

Step 108: and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.

Specifically, according to the first distribution of the determined topic keywords in each text and the second distribution of the actual topic keywords in each text, the text in the text set carrying the first distribution and the second distribution is input to the classifier for identifying key sentences and non-key sentences, and the key sentences and the non-key sentences of the text in the text set are obtained.

In specific implementation, the text carrying the first distribution and the second distribution is identified through a classifier, and each text key sentence and each text non-key sentence are obtained. The recognition process of the classifier is that the probability of a key sentence is calculated by carrying out calculation on sentences in the text carrying the first distribution and the second distribution, the probability of the key sentence of each sentence can exist in the text output by the classifier, the sentences with the probability greater than or equal to a preset threshold value are taken as the key sentences, the sentences with the probability smaller than the preset threshold value are taken as non-key sentences, two sets can be created by taking the text as a unit for the output key sentences, one set is a set of key sentences corresponding to the text, and the other set is a set of non-key sentences corresponding to the text.

In specific implementation, in the text output by the classifier, different labels can be respectively carried out on the key sentences and the non-key sentences of each text, the key sentences can be labeled in a highlight mode, the non-key sentences are not labeled, and the key sentences and the non-key sentences of each text can be easily and quickly identified; at least one label exists in each text output by the classifier, and the label is used for labeling key sentences in the text.

For example, a sentence in an article is: the method comprises the steps of 'sunshine and charm today, i want to go to the park for walking', inputting the section of speech into a classifier to identify a key sentence and a non-key sentence, and obtaining a corresponding sentence 'sunshine and charm today, i want to go to the park for walking', wherein 'i want to go to the park for walking' is marked as the key sentence in a manner that text lines are thickened.

On the basis that the classifier identifies a key sentence and a non-key sentence, in one or more embodiments of this embodiment, the classifier is constructed in the following manner:

Specifically, the classifier is constructed by giving a weight to the sentences in each text, presetting a classification rule, and establishing an association relationship between the keyword distribution matrix corresponding to the first distribution and the sentences contained in each text, wherein the corresponding weight of the sentences contained in each text is set by the reverse keyword frequency of the keywords contained in each sentence.

Based on this, after the classifier is constructed according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, the preset classification rule and the corresponding weight of the sentences contained in each text, the classifier is used for correspondingly identifying the key sentences and the non-key sentences of the text set of the plurality of texts.

In specific implementation, the preset classification rule may be: in practical application, the preset classification rule can be set according to an application scene, and the description is not limited at all.

By adopting the classifier to identify the key sentences and the non-key sentences of each text, compared with the deep learning method, the method can identify the key sentences and the non-key sentences of each text without a large amount of labeled data, thereby saving the cost of labeling data in the deep learning method.

On the basis of identifying the key sentences and the non-key sentences of each text by the classifier, further, in one or more embodiments of this embodiment, the classifier is optimized, and a process of specifically optimizing the classifier is as follows:

Specifically, the recall rate and/or the accuracy rate of each text are calculated according to the number of key sentences and the number of non-key sentences of each text output by the classifier, the weight of the sentence corresponding to each text in the classifier is adjusted according to the recall rate and/or the accuracy rate of each text, in the process of adjusting the weight of the sentence corresponding to each text, the weight of the sentence corresponding to each text is adjusted through a back propagation algorithm, whether the recall rate and/or the accuracy rate of each text approaches to 1 is calculated according to the weight after each adjustment, if not, iteration is continuously performed through the back propagation algorithm, the weight of the sentence corresponding to each text is continuously adjusted until the recall rate and/or the accuracy rate approaches to 1, and a small number of text samples are extracted randomly and labeled in a manual labeling mode, and training the classifier to ensure that the accuracy of the obtained classifier for identifying the key sentences and the non-key sentences is higher.

In addition, F1 parameters of each text are calculated according to the number of key sentences and non-key sentences of the text in the text set, and the classifier is optimized according to the F1 parameters of each text. The F1 parameter is determined according to the recall ratio and the accuracy, and can be understood as an integrated standard determined by integrating the recall ratio and the accuracy.

For example, there are 1400 articles, 300 articles about football, 300 articles about basketball, 800 articles about track and field, and these 1400 articles are searched for track and field articles, and when 200 articles about football, 100 articles about basketball and 100 articles about track and field are obtained, the accuracy of the current inspection is 200/(200+100+100) 50%, the recall is 200/300-66.7%, and the F1 parameter is 50% + 66.7% + 2/(50% + 66.7%) 57.1%.

On the basis of the optimization of the classifier, in one or more embodiments of this embodiment, a calculation process of the recall ratio is as follows:

Specifically, the recall rate is the number of actual key sentences/the total number of key sentences, and is used for measuring the accuracy of the key sentences identified by the classifier, and if the recall rate is higher, the accuracy of the classifier in identifying the key sentences is higher, otherwise, if the recall rate is lower, the accuracy of the classifier in identifying the key sentences is lower.

On the basis of the optimization of the classifier, further, in one or more implementations of this embodiment, a calculation process of the accuracy is as follows:

Specifically, the accuracy rate is the actual number of key sentences/the number of key sentences, and the accuracy rate is used for measuring the accuracy of the key sentences identified by the classifier, and if the accuracy rate is higher, the accuracy of the classifier in identifying the key sentences is higher, otherwise, if the accuracy rate is lower, the accuracy of the classifier in identifying the key sentences is lower.

In the process of optimizing the classifier, the recall rate and the accuracy rate are both applied, in order to enable the classifier to identify the key sentences and the non-key sentences more accurately, a metric value can be determined by fusing the recall rate and the accuracy rate, the metric value is an F1 parameter, and the classifier is further optimized through an F1 parameter, so that the identification accuracy of the classifier is higher.

The text recognition method provided by the specification obtains the topic keywords in each text set by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by a manual extraction method, greatly reduces the problem of high cost of manually extracting the keywords, further determines the distribution in each text according to the extracted topic keywords and the actual topic keywords, determines the first distribution and the second distribution, both of which are in a matrix distribution form, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each text, identifies the key sentences and the non-key sentences of each text by outputting the first distribution and the second distribution to the classifier, and optimizes the classifier by the accuracy and/or recall rate, the accuracy of the classifier for identifying the key sentences and the non-key sentences of each text is guaranteed, the identification efficiency of the key sentences and the non-key sentences is improved, the text identification method provided by the specification keeps the key sentences of the text by cleaning the non-key sentences of the text, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.

The text recognition method provided in the present specification will be further described below with reference to fig. 2, by taking an example of an application of the text recognition method to recognition of a sports news related article. The specific steps include steps 202 to 218.

Step 202: a sports text collection consisting of a large number of sports news articles is obtained.

Specifically, the sports news articles are the same category of sports news articles in the same field.

Step 204: the topic keywords of each sports news article are extracted.

Specifically, topic keyword extraction is performed on each sports news article through LDA or TF-IDF.

Step 206: actual topic keywords in a small number of sports news articles are obtained.

Specifically, a small number of sports news articles are randomly selected from a large number of sports news articles to manually extract actual subject keywords of the small number of sports news articles.

Step 208: and determining a keyword distribution matrix according to the distribution of the topic keywords in the sentence level of each article.

Step 210: and determining an actual keyword distribution matrix according to the distribution of the actual topic keywords in the sentence level of each article.

Wherein the step 204 and the step 206 are executed in parallel, and the step 208 and the step 210 are executed in parallel.

Step 212: and inputting the keyword distribution matrix and the actual keyword distribution matrix into a classifier to identify key sentences and non-key sentences.

Step 214: key sentences and non-key sentences of each sports news article are obtained.

Step 216: and calculating the accuracy according to the number of the key sentences and the non-key sentences of each sports news article.

Step 218: the weight of the sentences contained in each sports news article in the classifier is adjusted according to the accuracy rate.

Specifically, the weight of the sentence contained in each sports news article in the classifier is adjusted through the accuracy rate, so that the classifier can identify the key sentence and the non-key sentence more accurately.

The text recognition method provided by the specification obtains the topic keywords of each sports news article by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by an artificial extraction method, greatly reduces the high cost problem of artificial extraction, further determines the distribution in each sports news article according to the extracted topic keywords and the actual topic keywords, determines a keyword distribution matrix and an actual keyword distribution matrix, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each sports news article, identifies the key sentences and the non-key sentences of each sports news article by outputting the keyword distribution matrix and the actual keyword distribution matrix to the classifier, optimizes the classifier by the accuracy rate, and ensures the accuracy of identifying the key sentences and the non-key sentences of each sports news article by the classifier, and the recognition efficiency of the key sentences and the non-key sentences is improved.

Corresponding to the above method embodiment, the present specification further provides a text recognition apparatus embodiment, and fig. 3 shows a schematic structural diagram of the text recognition apparatus according to an embodiment of the present specification. As shown in fig. 3, the apparatus includes:

an obtaining module 302 configured to obtain a text set of a plurality of texts;

an extracting module 304, configured to extract a topic keyword of each text in the text set, and obtain an actual topic keyword extracted from at least one text in the text set;

a determining module 306 configured to determine a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;

the identifying module 308 is configured to identify a key sentence and a non-key sentence of the text in the text set carrying the first distribution and the second distribution by using the text input classifier, so as to obtain the key sentence and the non-key sentence of the text in the text set.

In an optional embodiment, the extracting module 304 includes:

In an optional embodiment, the extracting module 304 is further configured to:

In an optional embodiment, the determining module 306 includes:

In an optional embodiment, the classifier is constructed in the following manner:

accordingly, the identification module 308 is run;

the identification module 308, further configured to:

In an optional embodiment, the text recognition apparatus further includes:

In an optional embodiment, the second computing unit includes:

In an optional embodiment, the obtaining module 302 is further configured to:

The text recognition device provided by the specification obtains the topic keywords in each text set by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by a manual extraction method, greatly reduces the problem of high cost of manually extracting the keywords, further determines the distribution in each text according to the extracted topic keywords and the actual topic keywords, determines the first distribution and the second distribution, wherein the first distribution and the second distribution are both in a matrix distribution form, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each text, identifies the key sentences and the non-key sentences of each text by outputting the first distribution and the second distribution to the classifier, and optimizes the classifier by the accuracy and/or the recall rate, the accuracy of the classifier for identifying the key sentences and the non-key sentences of each text is guaranteed, the identification efficiency of the key sentences and the non-key sentences is improved, the text identification method provided by the specification keeps the key sentences of the text by cleaning the non-key sentences of the text, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.

The above is a schematic scheme of a text recognition apparatus of the present embodiment. It should be noted that the technical solution of the text recognition apparatus and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the text recognition apparatus can be referred to the description of the technical solution of the text recognition method.

Fig. 4 shows a block diagram of an electronic device 400 according to an embodiment of the present description. The components of the electronic device 400 include, but are not limited to, a memory 410 and a processor 420. Processor 420 is coupled to memory 410 via bus 430 and database 450 is used to store data.

The electronic device 400 also includes an access device 440, the access device 440 enabling the electronic device 400 to communicate via one or more networks 460. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 440 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-mentioned components of the electronic device 400 and other components not shown in fig. 4 may also be connected to each other, for example, through a bus. It should be understood that the block diagram of the electronic device shown in fig. 4 is for exemplary purposes only and is not intended to limit the scope of the present disclosure. Those skilled in the art may add or replace other components as desired.

The electronic device 400 may be any type of stationary or mobile electronic device, including a mobile computer or mobile electronic device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable electronic device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary electronic device such as a desktop computer or PC. The electronic device 400 may also be a mobile or stationary server.

Wherein processor 420 is configured to execute the following computer-executable instructions:

acquiring a text set of a plurality of texts;

Optionally, the classifier is constructed in the following manner:

Optionally, the calculating the recall ratio of each text includes:

Optionally, the calculating the accuracy of each text includes:

Optionally, the obtaining a text set of a plurality of texts includes:

and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts. The above is a schematic scheme of an electronic device of the present embodiment. It should be noted that the technical solution of the electronic device and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the electronic device can be referred to the description of the technical solution of the text recognition method.

An embodiment of the present specification further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the text recognition method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text recognition method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text recognition method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A text recognition method, comprising:

acquiring a text set of a plurality of texts;

generating a keyword distribution matrix of each text at a sentence level according to a topic keyword contained in a sentence in each text in the text set, wherein the keyword distribution matrix is used as a first distribution of the topic keyword in each text in the text set, and generating an actual keyword distribution matrix of each text at the sentence level according to an actual topic keyword contained in the sentence in each text in the text set, and the actual keyword distribution matrix is used as a second distribution of the actual topic keyword in each text in the text set;

2. The method of claim 1, wherein the extracting the topic keyword of each text in the text set comprises:

3. The method of claim 1, wherein the extracting the topic keyword of each text in the text set comprises:

4. The text recognition method of claim 1, wherein the obtaining of the actual topic keyword extracted from at least one text in the text set comprises:

5. The text recognition method of claim 1, wherein the classifier is constructed by:

6. The text recognition method of claim 1, wherein after the step of performing the steps of recognizing key sentences and non-key sentences by the text input classifier in the text set carrying the first distribution and the second distribution to obtain the key sentences and the non-key sentences of the text in the text set is performed, the method further comprises:

7. The text recognition method of claim 6, wherein the calculating the recall ratio of each text comprises:

8. The method of claim 6, wherein the calculating the accuracy of each text comprises:

9. The method of claim 1, wherein the obtaining a text set of a plurality of texts comprises:

10. A text recognition apparatus, comprising:

an acquisition module configured to acquire a text set of a plurality of texts;

a determining module configured to generate a keyword distribution matrix of each text at a sentence level according to a topic keyword included in a sentence in each text in the text set, as a first distribution of the topic keyword in each text in the text set, and generate an actual keyword distribution matrix of each text at the sentence level according to an actual topic keyword included in the sentence in each text in the text set, as a second distribution of the actual topic keyword in each text in the text set;

11. The text recognition apparatus of claim 10, wherein the extraction module comprises:

12. The text recognition apparatus of claim 10, wherein the extraction module comprises:

13. The text recognition apparatus of claim 10, wherein the extraction module is further configured to:

14. The text recognition apparatus of claim 10, wherein the classifier is constructed as follows:

correspondingly, operating the identification module;

the identification module further configured to:

15. The text recognition apparatus of claim 10, further comprising:

16. The text recognition apparatus according to claim 15, wherein the second calculation unit includes:

17. The text recognition apparatus according to claim 15, wherein the second calculation unit includes:

18. The text recognition apparatus of claim 10, wherein the obtaining module is further configured to:

19. An electronic device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the text recognition method of any one of claims 1 to 9.

20. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the text recognition method of any one of claims 1 to 9.