CN110134792B - Text recognition method and device, electronic equipment and storage medium - Google Patents

Text recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110134792B
CN110134792B CN201910431256.0A CN201910431256A CN110134792B CN 110134792 B CN110134792 B CN 110134792B CN 201910431256 A CN201910431256 A CN 201910431256A CN 110134792 B CN110134792 B CN 110134792B
Authority
CN
China
Prior art keywords
text
keyword
key sentences
sentences
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910431256.0A
Other languages
Chinese (zh)
Other versions
CN110134792A (en
Inventor
李长亮
樊骏锋
汪美玲
唐剑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Kingsoft Interactive Entertainment Technology Co ltd, Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Priority to CN201910431256.0A priority Critical patent/CN110134792B/en
Publication of CN110134792A publication Critical patent/CN110134792A/en
Application granted granted Critical
Publication of CN110134792B publication Critical patent/CN110134792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present specification provides a text recognition method, an apparatus, an electronic device, and a storage medium, wherein the text recognition method includes: acquiring a text set of a plurality of texts; extracting subject keywords of texts in a text set, and acquiring actual subject keywords extracted from at least one text in the text set; determining a first distribution of the subject keywords in each text in the text set and a second distribution of the actual subject keywords in each text in the text set; inputting the texts in the text set carrying the first distribution and the second distribution into a classifier for recognition to obtain key sentences and non-key sentences of the texts in the text set; by the text recognition method, the key sentences and the non-key sentences of the text can be quickly and accurately acquired, the key sentences of the text can be conveniently marked by cleaning the non-key sentences of the text, the construction efficiency of the knowledge graph is improved, and the key sentences of the text are retained, so that a user can conveniently and quickly know the main contents of the text when looking up the text.

Description

Text recognition method and device, electronic equipment and storage medium
Technical Field
The specification relates to the technical field of natural language processing, in particular to a text recognition method. The present specification also relates to a text recognition apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of internet technology, obtaining required information through a network is a means frequently used by people, when a user queries information in the same field through the network, the user can quickly know the topic of each article when querying the information conveniently, and the user can know whether the required information is contained in each article by screening and displaying the topic key sentence of each article to the user through checking the topic key sentence.
In the prior art, there are various methods for extracting the topic key sentence of each article, which can be implemented by extracting the topic key word of each article through an unsupervised key word screening method, and determining the topic key sentence according to the number of the key words contained in each sentence of each article.
However, since the accuracy of the topic keywords extracted by the unsupervised keyword screening method is not very high, the accuracy of extracting the topic key sentences of each article is greatly reduced, so that the topic key sentences viewed by the user are not necessarily the actual topic key sentences of the articles when the user looks up the articles.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a text recognition method to solve technical defects in the prior art. The embodiment of the specification also provides a text recognition device, an electronic device and a computer readable storage medium.
According to a first aspect of embodiments of the present specification, there is provided a text recognition method including:
acquiring a text set of a plurality of texts;
extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set;
determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;
and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.
Optionally, the extracting the topic keyword of each text in the text set includes:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
and inputting the keywords of each text into a theme generation model for theme keyword identification, and outputting the keywords as the theme keywords.
Optionally, the extracting the topic keyword of each text in the text set includes:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
calculating the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text of the keywords in the text set;
determining the keyword score of the keyword according to the product of the frequency and the reverse keyword frequency;
and taking the keywords with the scores larger than the keyword score threshold value as the topic keywords.
Optionally, the obtaining of the actual topic keyword extracted from at least one text in the text set includes:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
Optionally, the determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set includes:
generating a keyword distribution matrix of each text at a sentence level according to the topic keywords contained in the sentences in each text, wherein the keyword distribution matrix is used as the first distribution;
and generating an actual keyword distribution matrix of each text at a sentence level according to actual topic keywords contained in the sentences in each text, wherein the actual keyword distribution matrix is used as the second distribution.
Optionally, the classifier is constructed in the following manner:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, executing the step of inputting the text in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences so as to obtain the key sentences and non-key sentences of the text in the text set;
the identifying key sentences and non-key sentences of the text in the text set by the text input classifier carrying the first distribution and the second distribution to obtain the key sentences and non-key sentences of the text in the text set comprises:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
Optionally, after the step of performing key sentence and non-key sentence identification on the text centralized text input classifier carrying the first distribution and the second distribution to obtain the key sentence and the non-key sentence of the text in the text centralized text is executed, the method further includes:
calculating the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the text in the text set;
and optimizing the classifier according to the recall rate and/or the accuracy rate of each text.
Optionally, the calculating the recall ratio of each text includes:
counting the total number of key sentences contained in each text and the actual number of key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the total number of the key sentences as the recall rate of each text.
Optionally, the calculating the accuracy of each text includes:
counting the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the number of the key sentences to serve as the accuracy of each text.
Optionally, the obtaining a text set of a plurality of texts includes:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
According to a second aspect of embodiments herein, there is provided a text recognition apparatus including:
an acquisition module configured to acquire a text set of a plurality of texts;
the extraction module is configured to extract a subject keyword of each text in the text set and acquire an actual subject keyword extracted from at least one text in the text set;
a determining module configured to determine a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;
and the identification module is configured to identify key sentences and non-key sentences of the texts in the text set by inputting the texts in the text set carrying the first distribution and the second distribution into the classifier, so as to obtain the key sentences and the non-key sentences of the texts in the text set.
Optionally, the extracting module includes:
the first word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
and the identification unit is configured to input the keywords of each text into a theme generation model for theme keyword identification, and output the keywords as the theme keywords.
Optionally, the extracting module includes:
the second word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
a first calculating unit, configured to calculate the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text in the text set;
a keyword scoring unit configured to determine a keyword score of the keyword according to a product of the frequency and the reverse keyword frequency;
and the subject keyword determining unit is configured to take the keywords with the keyword scores larger than a keyword score threshold value as the subject keywords.
Optionally, the extracting module is further configured to:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
Optionally, the determining module includes:
a keyword distribution matrix generation unit configured to generate a keyword distribution matrix of each text at a sentence level as the first distribution according to a topic keyword included in a sentence in each text;
and generating an actual keyword distribution matrix unit, configured to generate an actual keyword distribution matrix of each text at a sentence level according to the actual subject keywords included in the sentences in each text, as the second distribution.
Optionally, the classifier is constructed in the following manner:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, operating the identification module;
the identification module further configured to:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
Optionally, the text recognition apparatus further includes:
the second calculation unit is configured to calculate the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the texts in the text set;
an optimizing unit configured to optimize the classifier according to the recall rate and/or the accuracy rate of each text.
Optionally, the second computing unit includes:
a first statistic submodule configured to count a total number of key sentences included in each text and an actual number of key sentences included in the output key sentences of each text;
a recall rate calculation submodule configured to calculate a ratio of the actual number of key sentences to the total number of key sentences as the recall rate of each text.
Optionally, the second computing unit includes:
the second counting submodule is configured to count the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and the calculation accuracy sub-module is configured to calculate a ratio of the actual number of the key sentences to the number of the key sentences as the accuracy of each text.
Optionally, the obtaining module is further configured to:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
According to a third aspect of embodiments herein, there is provided an electronic apparatus including:
a memory and a processor;
the memory is for storing computer-executable instructions that when executed by the processor implement the steps of the text recognition method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of any one of the text recognition methods.
Compared with the prior art, the specification has the following advantages:
the present specification provides a text recognition method, including: acquiring a text set of a plurality of texts; extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set; determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set; and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.
The text recognition method provided by the specification extracts actual subject keywords of a small amount of texts in the text set and subject keywords of a large amount of texts in the text set, and determines a first distribution of the actual subject keywords in each text in the text set, and the second distribution of the subject keywords in each text in the text set, inputting each text carrying the first distribution and the second distribution into a classifier for identifying key sentences and non-key sentences, and determining the key sentences and the non-key sentences of each text in the text set, by cleaning the non-key sentences of the text, the key sentences of the text are reserved, the key sentences of the text are conveniently marked, the construction efficiency is improved in the process of constructing the knowledge graph, and the key sentences of the text are reserved, so that a user can conveniently and quickly know the main content of the text when looking up the text.
Drawings
Fig. 1 is a flowchart of a text recognition method provided in an embodiment of the present specification;
FIG. 2 is a process flow diagram of a text recognition process provided by an embodiment of the present specification;
fig. 3 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
TF-IDF: (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, TF means Term Frequency (Term Frequency) and IDF means Inverse text Frequency index (Inverse Document Frequency). It is a statistical method to assess how important a word is to one of the documents in a corpus or a corpus.
LDA: (Latent Dirichlet Allocation), which is a document theme generation model, is also called a three-layer Bayesian probability model, and comprises three layers of structures of words, themes and documents. It is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document sets or corpora.
Key words: is a word, phrase or phrase used to express the subject matter of a document, such as a scientific paper, a scientific report, an academic paper or an article.
Actual topic keywords: the method is characterized in that words, phrases or phrases of document subject contents are marked by a small number of texts such as scientific papers, scientific reports, academic papers or articles and the like manually; and the accuracy of manually marking the actual topic keywords on texts such as scientific papers, scientific reports, academic papers or articles is high.
Topic keywords: the method is characterized in that words, phrases or phrases of the subject matter of the document are marked out by TF-IDF or LDA on a large number of texts such as scientific and technical papers, scientific and technical reports, academic papers or articles, and the marking efficiency of the subject keywords of the texts such as the scientific and technical papers, scientific and technical reports, academic papers or articles is high.
In the present specification, a text recognition method is provided. This specification also relates to a text recognition apparatus, an electronic device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Fig. 1 shows a flow diagram of a text recognition method according to an embodiment of the present description, including steps 102 to 108.
Step 102: a text set of a plurality of texts is obtained.
In one embodiment of the present disclosure, the text set of the plurality of texts may be a text set composed of a plurality of articles or a text set composed of a plurality of news reports, wherein the text set composed of the plurality of articles or the text set composed of the plurality of news reports belong to the same domain. For example, searching for soccer in a search engine, a platform carrying the search engine may show a large number of articles, news and pictures about soccer, all of which belong to the field of sports soccer.
Here, the text recognition method will be described by taking the text set as a text set composed of articles as an example. Therefore, when a user searches for knowledge about a certain aspect, the user usually searches for related articles through a network to further know the knowledge, and when a search engine provides the articles about the knowledge about the certain aspect, in order to enable the user to quickly know the main content of the articles, the key sentences of the articles are extracted and preferentially displayed to the user, so that the user can accurately know what the main content of the articles is and whether the articles are the articles required by the user.
In order to provide accurate key sentences for users, after a text set composed of a plurality of texts is acquired, extracting actual subject key words from a small amount of texts in the text set, extracting key words from a large amount of texts in the text set, determining a first distribution from the distribution of actual topic keywords in the sentences of each text in the text set, and the distribution of the keywords in the sentence of each text in the text set determines a second distribution, and the texts in the text set carrying the first distribution and the second distribution are input to a classifier for key sentence identification, so that the accuracy of identifying the key sentence of each text is improved, and the non-key sentences are cleaned, the key sentences of each text are reserved as the main contents of the article displayed to the user, and the key sentences displayed to the user are ensured to be the actual main contents of the corresponding article.
In addition, the extraction of the events at the discourse level is an important loop for the construction of the knowledge graph, and the extraction of the key sentences and the cleaning of the non-key sentences from the events at the discourse level play an important role in the accuracy and efficiency of the subsequent event extraction. By cleaning the non-key sentences of the text, the key sentences of the text are reserved, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.
For example, a news report article, which reports a car accident, describes 10000 words of content, and some users only pay attention to the main content of the news when watching a news report, the main content of the news is information such as occurrence location, occurrence time, and the number of injured people, and the key sentence of the news is that a car accident occurs at a location a at eight am on 4 and 17 months in 2019, and no any person is injured.
In one or more implementations of this embodiment, the obtaining a text set of a plurality of texts includes:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
Specifically, in the process of identifying key sentences in subsequent texts, the key sentences of the texts in the same category in the vertical field are identified, that is, the obtained text sets of the plurality of texts are the text sets created for the plurality of texts in the same category in the vertical field.
Here, the vertical domain may be understood as a plurality of small domains vertically subdivided in one large domain, and the subdivided small domains belong to small domains in the vertical domain. For example, in the sports vertical field, the track and field belong to the two-level field subdivided by the sports vertical field, and the track and field can be determined as a category in the sports vertical field. Further, the second field of the track and field can be divided into more three-level fields, for example, hectometre, relay and marathon all belong to the three-level fields subdivided from the second field of the track and field.
In the text set in the vertical field, the attributes of the keywords are similar, and the types of the keywords are limited, so that the texts in the same category in the vertical field are acquired, the text set is created according to the texts in the same category in the vertical field, and the texts in the same field are identified in the subsequent process of identifying the text key sentences, so that the obtained text key sentences can be more accurate.
Step 104: and extracting the subject key words of each text in the text set, and acquiring the actual subject key words extracted from at least one text in the text set.
Specifically, according to the obtained text set, further extracting each text topic keyword in the text set, and obtaining an actual topic keyword of at least one text in the text set. The topic keywords of each text are extracted through a set algorithm or a set model, and the actual topic keywords are extracted in a manual labeling mode.
For example, a text set is composed of 100 articles about football, keywords are extracted from the 100 articles through a set model, the topic keywords are determined to be football, winner and score, keywords are labeled on one of the 100 articles in a manual labeling mode, and the actual topic keywords are determined to be football, winner, score, team, main/guest field, game time and player. Based on the above, it can be determined that the richness of the keywords included in the actual topic keywords labeled manually is greater than the richness of the keywords included in the topic keywords extracted by the set model.
On the basis of obtaining the actual topic keyword of at least one text in the text set, further, in one or more embodiments of this embodiment, the obtained actual topic keyword is extracted manually, and a specific implementation manner is as follows:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
Specifically, on the basis of obtaining the text set of the plurality of texts, a small number of texts are randomly selected from the text set to manually extract actual subject keywords, and the actual subject keywords of the manually extracted small number of texts are obtained.
In practical applications, the manual extraction process is still described by taking a paragraph of the above article as "flower blossom, flower withering, and not meaning that the life of the flower is lost … …", and the keywords of the paragraph are determined to include "flower", "blossom", "withering", "not", "meaning", "life", "in" and "lost" according to the manual labeling, and the actual subject keywords are determined to be "flower", "blossom", "withering", and "lost" by understanding the description of the paragraph.
The accuracy of the extracted actual subject key words of the texts can be ensured by manually extracting a small amount of actual subject key words of the texts, a measuring standard can be provided for subsequently identifying the key sentences of each text, and the high accuracy of the subsequently identified key sentences of each text is ensured.
On the basis of extracting the topic keyword of each text in the text set, further, in one or more embodiments of this embodiment, the topic keyword of each text in the text set is extracted, and a specific implementation manner is as follows:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
and inputting the keywords of each text into a theme generation model for theme keyword identification, and outputting the keywords as the theme keywords.
Specifically, word segmentation processing is performed on each text in the text set through a word segmentation processing algorithm in natural language processing, keywords of each text are determined according to word segmentation processing results, the keywords of each text are input into a theme generation model to perform theme keyword recognition, and the recognized keywords can be used as theme keywords of each text.
Based on the above, the process of identifying the topic keywords by the topic generation model is to determine the topic keywords by traversing the times of occurrence of each keyword in the corresponding text.
For example, in one article, a paragraph is "flower blossom", flower withering, and does not mean that the life of the flower is dying … … ", words of the paragraph are determined to be" flower "," blossom "," withering "," not "," meaning "," life "," dying "," in ", and" dying "respectively by a word segmentation algorithm, all 11 keywords are input to a topic generation model for topic keyword recognition, and the obtained topic keyword of the paragraph is" flower ".
In practical application, when the topic generation model identifies the topic keywords, a large number of samples are required to be trained to ensure that the topic keywords identified by the topic generation model are more accurate, and the training process of the topic generation model can select a proper sample library to be trained according to practical application, and the specification is not limited herein.
On the basis of extracting the topic keyword of each text in the text set, further, this specification further provides another method for extracting the topic keyword of each text in the text set, where in one or more embodiments of this embodiment, the extracting the topic keyword of each text in the text set includes:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
calculating the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text of the keywords in the text set;
determining the keyword score of the keyword according to the product of the frequency and the reverse keyword frequency;
and taking the keywords with the scores larger than the keyword score threshold value as the topic keywords.
Specifically, word segmentation processing is carried out on each text in the text set through a word segmentation processing algorithm in natural language processing, a keyword of each text is determined according to a word segmentation processing result, the frequency of the keyword appearing in the corresponding text and the reverse keyword frequency of the keyword in the corresponding text are calculated, the reverse keyword frequency and the frequency are multiplied to determine a keyword score of each keyword, the keyword score is compared with a keyword score threshold, if the keyword score is greater than the keyword score threshold, the keyword with the keyword score greater than the keyword score threshold is taken as the subject keyword, and if the keyword score is less than or equal to the keyword score threshold, the keyword with the keyword score less than or equal to the keyword score threshold is not processed.
In specific implementation, the reverse keyword frequency of the keyword in each text can be calculated in the following manner: determining the weight of each keyword in each text, and determining the reverse keyword frequency of each keyword relative to the corresponding text through the weight; here, the weight of each keyword may be determined by matching each keyword with a keyword in a preset keyword library, where the keywords in the keyword library all have corresponding weights, and assigning the keyword matched with the keyword library to a weight recorded in the keyword library, that is, the reverse keyword frequency of each keyword in each text in the text set may be determined according to the weight of each keyword.
Or the reverse keyword frequency of the keyword in each text can be calculated by the following method: the reverse keyword frequency of each keyword is determined in a logarithmic function manner, for example, in ten million articles, the word "china" appears in one thousand articles, and the reverse keyword frequency of the word "china" in one ten million articles is determined to be lg (10000000/1000) ═ 4 through the logarithmic function.
In practical applications, still taking the above paragraph as "flower blossom, flower withering, and not meaning flower life is dying … …", another method for extracting the subject keyword of each text in the text set is described, determining the words of the paragraph as "flower", "flower blossom", "withering", "not", "meaning", "life", "at", and "dying" through a word segmentation algorithm, determining the matching frequency of each keyword as "flower" matching frequency of 3 "," matching frequency of "flower" of 3 "," blooming "," withering "," not "," meaning "," life "," at ", and" dying "of 3", determining the matching frequency of "flower" as 0.7 ", determining the reverse keyword frequency of" flower "as 0.1" and "blooming" as 0.1 according to calculation, The reverse keyword frequency of "zero", "mean", "life", and "death" is 0.5, and the reverse keyword frequency of "and", "not", and "at" is 0.2, the keyword score of "flower" is determined to be 2.1 by calculation, the keyword score of "at" is 0.3, the keyword score of "blossom", "zero", "mean", "life", and "death" is 0.5, the keywords of "and", "not", and "at" are 0.3, and the keyword score threshold is 1, and the keyword "flower" is determined to be the subject keyword of the paragraph.
In addition, each text topic keyword can be extracted by a TF-IDF statistical method or an LDA document topic generation model, and the description of the specification is omitted.
In the process of extracting the topic keywords, the topic keywords of each text are extracted by the two methods, so that the accuracy of the extracted topic keywords and the extraction efficiency of the extracted topic keywords are ensured, and an important basis is laid for the subsequent more accurate identification of the key sentences of each text.
Step 106: a first distribution of the topic keyword in each text in the corpus and a second distribution of the actual topic keyword in each text in the corpus are determined.
Specifically, the method includes extracting a topic keyword from each text, extracting an actual topic keyword of at least one text, determining the first distribution according to the distribution of the topic keyword in each text based on the topic keyword, and determining the second distribution according to the distribution of the actual topic keyword in each text.
In specific implementation, the first distribution of the topic keywords in each text is determined as the first distribution of the distribution condition of the topic keywords in each sentence in each text; and the second distribution of the actual topic keywords in each text is determined as the second distribution of the distribution condition of the actual topic keywords in each sentence in each text.
On the basis of the foregoing determination of the first distribution and the second distribution, further, in one or more implementations of this embodiment, a specific implementation manner of the generation process of the first distribution and the second distribution is as follows:
generating a keyword distribution matrix of each text at a sentence level according to the topic keywords contained in the sentences in each text, wherein the keyword distribution matrix is used as the first distribution;
and generating an actual keyword distribution matrix of each text at a sentence level according to actual topic keywords contained in the sentences in each text, wherein the actual keyword distribution matrix is used as the second distribution.
Specifically, a keyword distribution matrix of each text at a sentence level is generated according to the topic keywords included in the sentences in each text, the distribution matrix of the topic keywords at the sentence level in each text is determined as the first distribution, an actual keyword distribution matrix of each text at the sentence level is generated according to the actual topic keywords included in the sentences in each text, and the distribution matrix of the actual topic keywords at the sentence level in each text is determined as the second distribution.
In practical applications, two texts doc1 and doc2 are taken as examples to describe the process of determining the first distribution and the second distribution, wherein doc 1: i like playing football; doc 2: i like playing tennis; extracting the subject keywords as 'I' and 'like', the actual subject keywords as 'football' and 'tennis', and the element values in the keyword matrix and the actual keyword matrix represent word frequency; determining a keyword matrix according to the distribution of the topic keywords in the two texts as follows:
Figure BDA0002069077610000181
wherein, a11, a12, a21 and a22 are all 1, which means that "me" and "like" appear in two texts doc1 and doc2 with frequency of 1;
determining an actual keyword matrix according to the distribution of the actual topic keywords in the two texts as follows:
Figure BDA0002069077610000182
where B11 and B22 are 1, B12 and B21 are 0, which means that "soccer" appears in the text doc1 with a frequency of 1, and "tennis" appears in the text doc2 with a frequency of 0, and "tennis" appears in the text doc2 with a frequency of 1.
Determining a keyword distribution matrix by determining the distribution of the topic keywords at the sentence level of each text, taking the keyword distribution matrix as the first distribution of the topic keywords, determining the distribution of the actual topic keywords at the sentence level of each text to determine the actual keyword distribution matrix, taking the actual keyword distribution matrix as the second distribution of the actual topic keywords, and determining the distribution condition of the topic keywords and the actual topic keywords in each text more intuitively by taking the matrix mode as the first distribution and the second distribution.
Step 108: and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.
Specifically, according to the first distribution of the determined topic keywords in each text and the second distribution of the actual topic keywords in each text, the text in the text set carrying the first distribution and the second distribution is input to the classifier for identifying key sentences and non-key sentences, and the key sentences and the non-key sentences of the text in the text set are obtained.
In specific implementation, the text carrying the first distribution and the second distribution is identified through a classifier, and each text key sentence and each text non-key sentence are obtained. The recognition process of the classifier is that the probability of a key sentence is calculated by carrying out calculation on sentences in the text carrying the first distribution and the second distribution, the probability of the key sentence of each sentence can exist in the text output by the classifier, the sentences with the probability greater than or equal to a preset threshold value are taken as the key sentences, the sentences with the probability smaller than the preset threshold value are taken as non-key sentences, two sets can be created by taking the text as a unit for the output key sentences, one set is a set of key sentences corresponding to the text, and the other set is a set of non-key sentences corresponding to the text.
In specific implementation, in the text output by the classifier, different labels can be respectively carried out on the key sentences and the non-key sentences of each text, the key sentences can be labeled in a highlight mode, the non-key sentences are not labeled, and the key sentences and the non-key sentences of each text can be easily and quickly identified; at least one label exists in each text output by the classifier, and the label is used for labeling key sentences in the text.
For example, a sentence in an article is: the method comprises the steps of 'sunshine and charm today, i want to go to the park for walking', inputting the section of speech into a classifier to identify a key sentence and a non-key sentence, and obtaining a corresponding sentence 'sunshine and charm today, i want to go to the park for walking', wherein 'i want to go to the park for walking' is marked as the key sentence in a manner that text lines are thickened.
On the basis that the classifier identifies a key sentence and a non-key sentence, in one or more embodiments of this embodiment, the classifier is constructed in the following manner:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, executing the step of inputting the text in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences so as to obtain the key sentences and non-key sentences of the text in the text set;
the identifying key sentences and non-key sentences of the text in the text set by the text input classifier carrying the first distribution and the second distribution to obtain the key sentences and non-key sentences of the text in the text set comprises:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
Specifically, the classifier is constructed by giving a weight to the sentences in each text, presetting a classification rule, and establishing an association relationship between the keyword distribution matrix corresponding to the first distribution and the sentences contained in each text, wherein the corresponding weight of the sentences contained in each text is set by the reverse keyword frequency of the keywords contained in each sentence.
Based on this, after the classifier is constructed according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, the preset classification rule and the corresponding weight of the sentences contained in each text, the classifier is used for correspondingly identifying the key sentences and the non-key sentences of the text set of the plurality of texts.
In specific implementation, the preset classification rule may be: in practical application, the preset classification rule can be set according to an application scene, and the description is not limited at all.
By adopting the classifier to identify the key sentences and the non-key sentences of each text, compared with the deep learning method, the method can identify the key sentences and the non-key sentences of each text without a large amount of labeled data, thereby saving the cost of labeling data in the deep learning method.
On the basis of identifying the key sentences and the non-key sentences of each text by the classifier, further, in one or more embodiments of this embodiment, the classifier is optimized, and a process of specifically optimizing the classifier is as follows:
calculating the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the text in the text set;
and optimizing the classifier according to the recall rate and/or the accuracy rate of each text.
Specifically, the recall rate and/or the accuracy rate of each text are calculated according to the number of key sentences and the number of non-key sentences of each text output by the classifier, the weight of the sentence corresponding to each text in the classifier is adjusted according to the recall rate and/or the accuracy rate of each text, in the process of adjusting the weight of the sentence corresponding to each text, the weight of the sentence corresponding to each text is adjusted through a back propagation algorithm, whether the recall rate and/or the accuracy rate of each text approaches to 1 is calculated according to the weight after each adjustment, if not, iteration is continuously performed through the back propagation algorithm, the weight of the sentence corresponding to each text is continuously adjusted until the recall rate and/or the accuracy rate approaches to 1, and a small number of text samples are extracted randomly and labeled in a manual labeling mode, and training the classifier to ensure that the accuracy of the obtained classifier for identifying the key sentences and the non-key sentences is higher.
In addition, F1 parameters of each text are calculated according to the number of key sentences and non-key sentences of the text in the text set, and the classifier is optimized according to the F1 parameters of each text. The F1 parameter is determined according to the recall ratio and the accuracy, and can be understood as an integrated standard determined by integrating the recall ratio and the accuracy.
For example, there are 1400 articles, 300 articles about football, 300 articles about basketball, 800 articles about track and field, and these 1400 articles are searched for track and field articles, and when 200 articles about football, 100 articles about basketball and 100 articles about track and field are obtained, the accuracy of the current inspection is 200/(200+100+100) 50%, the recall is 200/300-66.7%, and the F1 parameter is 50% + 66.7% + 2/(50% + 66.7%) 57.1%.
On the basis of the optimization of the classifier, in one or more embodiments of this embodiment, a calculation process of the recall ratio is as follows:
counting the total number of key sentences contained in each text and the actual number of key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the total number of the key sentences as the recall rate of each text.
Specifically, the recall rate is the number of actual key sentences/the total number of key sentences, and is used for measuring the accuracy of the key sentences identified by the classifier, and if the recall rate is higher, the accuracy of the classifier in identifying the key sentences is higher, otherwise, if the recall rate is lower, the accuracy of the classifier in identifying the key sentences is lower.
On the basis of the optimization of the classifier, further, in one or more implementations of this embodiment, a calculation process of the accuracy is as follows:
counting the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the number of the key sentences to serve as the accuracy of each text.
Specifically, the accuracy rate is the actual number of key sentences/the number of key sentences, and the accuracy rate is used for measuring the accuracy of the key sentences identified by the classifier, and if the accuracy rate is higher, the accuracy of the classifier in identifying the key sentences is higher, otherwise, if the accuracy rate is lower, the accuracy of the classifier in identifying the key sentences is lower.
In the process of optimizing the classifier, the recall rate and the accuracy rate are both applied, in order to enable the classifier to identify the key sentences and the non-key sentences more accurately, a metric value can be determined by fusing the recall rate and the accuracy rate, the metric value is an F1 parameter, and the classifier is further optimized through an F1 parameter, so that the identification accuracy of the classifier is higher.
The text recognition method provided by the specification obtains the topic keywords in each text set by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by a manual extraction method, greatly reduces the problem of high cost of manually extracting the keywords, further determines the distribution in each text according to the extracted topic keywords and the actual topic keywords, determines the first distribution and the second distribution, both of which are in a matrix distribution form, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each text, identifies the key sentences and the non-key sentences of each text by outputting the first distribution and the second distribution to the classifier, and optimizes the classifier by the accuracy and/or recall rate, the accuracy of the classifier for identifying the key sentences and the non-key sentences of each text is guaranteed, the identification efficiency of the key sentences and the non-key sentences is improved, the text identification method provided by the specification keeps the key sentences of the text by cleaning the non-key sentences of the text, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.
The text recognition method provided in the present specification will be further described below with reference to fig. 2, by taking an example of an application of the text recognition method to recognition of a sports news related article. The specific steps include steps 202 to 218.
Step 202: a sports text collection consisting of a large number of sports news articles is obtained.
Specifically, the sports news articles are the same category of sports news articles in the same field.
Step 204: the topic keywords of each sports news article are extracted.
Specifically, topic keyword extraction is performed on each sports news article through LDA or TF-IDF.
Step 206: actual topic keywords in a small number of sports news articles are obtained.
Specifically, a small number of sports news articles are randomly selected from a large number of sports news articles to manually extract actual subject keywords of the small number of sports news articles.
Step 208: and determining a keyword distribution matrix according to the distribution of the topic keywords in the sentence level of each article.
Step 210: and determining an actual keyword distribution matrix according to the distribution of the actual topic keywords in the sentence level of each article.
Wherein the step 204 and the step 206 are executed in parallel, and the step 208 and the step 210 are executed in parallel.
Step 212: and inputting the keyword distribution matrix and the actual keyword distribution matrix into a classifier to identify key sentences and non-key sentences.
Step 214: key sentences and non-key sentences of each sports news article are obtained.
Step 216: and calculating the accuracy according to the number of the key sentences and the non-key sentences of each sports news article.
Step 218: the weight of the sentences contained in each sports news article in the classifier is adjusted according to the accuracy rate.
Specifically, the weight of the sentence contained in each sports news article in the classifier is adjusted through the accuracy rate, so that the classifier can identify the key sentence and the non-key sentence more accurately.
The text recognition method provided by the specification obtains the topic keywords of each sports news article by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by an artificial extraction method, greatly reduces the high cost problem of artificial extraction, further determines the distribution in each sports news article according to the extracted topic keywords and the actual topic keywords, determines a keyword distribution matrix and an actual keyword distribution matrix, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each sports news article, identifies the key sentences and the non-key sentences of each sports news article by outputting the keyword distribution matrix and the actual keyword distribution matrix to the classifier, optimizes the classifier by the accuracy rate, and ensures the accuracy of identifying the key sentences and the non-key sentences of each sports news article by the classifier, and the recognition efficiency of the key sentences and the non-key sentences is improved.
Corresponding to the above method embodiment, the present specification further provides a text recognition apparatus embodiment, and fig. 3 shows a schematic structural diagram of the text recognition apparatus according to an embodiment of the present specification. As shown in fig. 3, the apparatus includes:
an obtaining module 302 configured to obtain a text set of a plurality of texts;
an extracting module 304, configured to extract a topic keyword of each text in the text set, and obtain an actual topic keyword extracted from at least one text in the text set;
a determining module 306 configured to determine a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;
the identifying module 308 is configured to identify a key sentence and a non-key sentence of the text in the text set carrying the first distribution and the second distribution by using the text input classifier, so as to obtain the key sentence and the non-key sentence of the text in the text set.
In an optional embodiment, the extracting module 304 includes:
the first word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
and the identification unit is configured to input the keywords of each text into a theme generation model for theme keyword identification, and output the keywords as the theme keywords.
In an optional embodiment, the extracting module 304 includes:
the second word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
a first calculating unit, configured to calculate the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text in the text set;
a keyword scoring unit configured to determine a keyword score of the keyword according to a product of the frequency and the reverse keyword frequency;
and the subject keyword determining unit is configured to take the keywords with the keyword scores larger than a keyword score threshold value as the subject keywords.
In an optional embodiment, the extracting module 304 is further configured to:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
In an optional embodiment, the determining module 306 includes:
a keyword distribution matrix generation unit configured to generate a keyword distribution matrix of each text at a sentence level as the first distribution according to a topic keyword included in a sentence in each text;
and generating an actual keyword distribution matrix unit, configured to generate an actual keyword distribution matrix of each text at a sentence level according to the actual subject keywords included in the sentences in each text, as the second distribution.
In an optional embodiment, the classifier is constructed in the following manner:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
accordingly, the identification module 308 is run;
the identification module 308, further configured to:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
In an optional embodiment, the text recognition apparatus further includes:
the second calculation unit is configured to calculate the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the texts in the text set;
an optimizing unit configured to optimize the classifier according to the recall rate and/or the accuracy rate of each text.
In an optional embodiment, the second computing unit includes:
a first statistic submodule configured to count a total number of key sentences included in each text and an actual number of key sentences included in the output key sentences of each text;
a recall rate calculation submodule configured to calculate a ratio of the actual number of key sentences to the total number of key sentences as the recall rate of each text.
In an optional embodiment, the second computing unit includes:
the second counting submodule is configured to count the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and the calculation accuracy sub-module is configured to calculate a ratio of the actual number of the key sentences to the number of the key sentences as the accuracy of each text.
In an optional embodiment, the obtaining module 302 is further configured to:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
The text recognition device provided by the specification obtains the topic keywords in each text set by adopting a statistical algorithm or a topic generation model, extracts the actual topic keywords of a small amount of texts by a manual extraction method, greatly reduces the problem of high cost of manually extracting the keywords, further determines the distribution in each text according to the extracted topic keywords and the actual topic keywords, determines the first distribution and the second distribution, wherein the first distribution and the second distribution are both in a matrix distribution form, can more intuitively determine the distribution condition of the actual topic keywords and the topic keywords in each text, identifies the key sentences and the non-key sentences of each text by outputting the first distribution and the second distribution to the classifier, and optimizes the classifier by the accuracy and/or the recall rate, the accuracy of the classifier for identifying the key sentences and the non-key sentences of each text is guaranteed, the identification efficiency of the key sentences and the non-key sentences is improved, the text identification method provided by the specification keeps the key sentences of the text by cleaning the non-key sentences of the text, the key sentences of the text are conveniently marked, and the construction efficiency is improved in the process of constructing the knowledge graph.
The above is a schematic scheme of a text recognition apparatus of the present embodiment. It should be noted that the technical solution of the text recognition apparatus and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the text recognition apparatus can be referred to the description of the technical solution of the text recognition method.
Fig. 4 shows a block diagram of an electronic device 400 according to an embodiment of the present description. The components of the electronic device 400 include, but are not limited to, a memory 410 and a processor 420. Processor 420 is coupled to memory 410 via bus 430 and database 450 is used to store data.
The electronic device 400 also includes an access device 440, the access device 440 enabling the electronic device 400 to communicate via one or more networks 460. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 440 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-mentioned components of the electronic device 400 and other components not shown in fig. 4 may also be connected to each other, for example, through a bus. It should be understood that the block diagram of the electronic device shown in fig. 4 is for exemplary purposes only and is not intended to limit the scope of the present disclosure. Those skilled in the art may add or replace other components as desired.
The electronic device 400 may be any type of stationary or mobile electronic device, including a mobile computer or mobile electronic device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable electronic device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary electronic device such as a desktop computer or PC. The electronic device 400 may also be a mobile or stationary server.
Wherein processor 420 is configured to execute the following computer-executable instructions:
acquiring a text set of a plurality of texts;
extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set;
determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set;
and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.
Optionally, the extracting the topic keyword of each text in the text set includes:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
and inputting the keywords of each text into a theme generation model for theme keyword identification, and outputting the keywords as the theme keywords.
Optionally, the extracting the topic keyword of each text in the text set includes:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
calculating the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text of the keywords in the text set;
determining the keyword score of the keyword according to the product of the frequency and the reverse keyword frequency;
and taking the keywords with the scores larger than the keyword score threshold value as the topic keywords.
Optionally, the obtaining of the actual topic keyword extracted from at least one text in the text set includes:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
Optionally, the determining a first distribution of the topic keyword in each text in the text set and a second distribution of the actual topic keyword in each text in the text set includes:
generating a keyword distribution matrix of each text at a sentence level according to the topic keywords contained in the sentences in each text, wherein the keyword distribution matrix is used as the first distribution;
and generating an actual keyword distribution matrix of each text at a sentence level according to actual topic keywords contained in the sentences in each text, wherein the actual keyword distribution matrix is used as the second distribution.
Optionally, the classifier is constructed in the following manner:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, executing the step of inputting the text in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences so as to obtain the key sentences and non-key sentences of the text in the text set;
the identifying key sentences and non-key sentences of the text in the text set by the text input classifier carrying the first distribution and the second distribution to obtain the key sentences and non-key sentences of the text in the text set comprises:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
Optionally, after the step of performing key sentence and non-key sentence identification on the text centralized text input classifier carrying the first distribution and the second distribution to obtain the key sentence and the non-key sentence of the text in the text centralized text is executed, the method further includes:
calculating the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the text in the text set;
and optimizing the classifier according to the recall rate and/or the accuracy rate of each text.
Optionally, the calculating the recall ratio of each text includes:
counting the total number of key sentences contained in each text and the actual number of key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the total number of the key sentences as the recall rate of each text.
Optionally, the calculating the accuracy of each text includes:
counting the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the number of the key sentences to serve as the accuracy of each text.
Optionally, the obtaining a text set of a plurality of texts includes:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts. The above is a schematic scheme of an electronic device of the present embodiment. It should be noted that the technical solution of the electronic device and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the electronic device can be referred to the description of the technical solution of the text recognition method.
An embodiment of the present specification further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the text recognition method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text recognition method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text recognition method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims (20)

1. A text recognition method, comprising:
acquiring a text set of a plurality of texts;
extracting a subject keyword of each text in the text set, and acquiring an actual subject keyword extracted from at least one text in the text set;
generating a keyword distribution matrix of each text at a sentence level according to a topic keyword contained in a sentence in each text in the text set, wherein the keyword distribution matrix is used as a first distribution of the topic keyword in each text in the text set, and generating an actual keyword distribution matrix of each text at the sentence level according to an actual topic keyword contained in the sentence in each text in the text set, and the actual keyword distribution matrix is used as a second distribution of the actual topic keyword in each text in the text set;
and inputting the texts in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the texts in the text set.
2. The method of claim 1, wherein the extracting the topic keyword of each text in the text set comprises:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
and inputting the keywords of each text into a theme generation model for theme keyword identification, and outputting the keywords as the theme keywords.
3. The method of claim 1, wherein the extracting the topic keyword of each text in the text set comprises:
performing word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determining a keyword of each text in the text set according to a word segmentation processing result;
calculating the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text of the keywords in the text set;
determining the keyword score of the keyword according to the product of the frequency and the reverse keyword frequency;
and taking the keywords with the scores larger than the keyword score threshold value as the topic keywords.
4. The text recognition method of claim 1, wherein the obtaining of the actual topic keyword extracted from at least one text in the text set comprises:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
5. The text recognition method of claim 1, wherein the classifier is constructed by:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, executing the step of inputting the text in the text set carrying the first distribution and the second distribution into a classifier to identify key sentences and non-key sentences so as to obtain the key sentences and non-key sentences of the text in the text set;
the identifying key sentences and non-key sentences of the text in the text set by the text input classifier carrying the first distribution and the second distribution to obtain the key sentences and non-key sentences of the text in the text set comprises:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
6. The text recognition method of claim 1, wherein after the step of performing the steps of recognizing key sentences and non-key sentences by the text input classifier in the text set carrying the first distribution and the second distribution to obtain the key sentences and the non-key sentences of the text in the text set is performed, the method further comprises:
calculating the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the text in the text set;
and optimizing the classifier according to the recall rate and/or the accuracy rate of each text.
7. The text recognition method of claim 6, wherein the calculating the recall ratio of each text comprises:
counting the total number of key sentences contained in each text and the actual number of key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the total number of the key sentences as the recall rate of each text.
8. The method of claim 6, wherein the calculating the accuracy of each text comprises:
counting the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and calculating the ratio of the actual number of the key sentences to the number of the key sentences to serve as the accuracy of each text.
9. The method of claim 1, wherein the obtaining a text set of a plurality of texts comprises:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
10. A text recognition apparatus, comprising:
an acquisition module configured to acquire a text set of a plurality of texts;
the extraction module is configured to extract a subject keyword of each text in the text set and acquire an actual subject keyword extracted from at least one text in the text set;
a determining module configured to generate a keyword distribution matrix of each text at a sentence level according to a topic keyword included in a sentence in each text in the text set, as a first distribution of the topic keyword in each text in the text set, and generate an actual keyword distribution matrix of each text at the sentence level according to an actual topic keyword included in the sentence in each text in the text set, as a second distribution of the actual topic keyword in each text in the text set;
and the identification module is configured to identify key sentences and non-key sentences of the texts in the text set by inputting the texts in the text set carrying the first distribution and the second distribution into the classifier, so as to obtain the key sentences and the non-key sentences of the texts in the text set.
11. The text recognition apparatus of claim 10, wherein the extraction module comprises:
the first word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
and the identification unit is configured to input the keywords of each text into a theme generation model for theme keyword identification, and output the keywords as the theme keywords.
12. The text recognition apparatus of claim 10, wherein the extraction module comprises:
the second word segmentation processing unit is configured to perform word segmentation processing on each text in the text set through a word segmentation processing algorithm, and determine a keyword of each text in the text set according to a word segmentation processing result;
a first calculating unit, configured to calculate the matching frequency of the keywords in the corresponding texts and the reverse keyword frequency of each text in the text set;
a keyword scoring unit configured to determine a keyword score of the keyword according to a product of the frequency and the reverse keyword frequency;
and the subject keyword determining unit is configured to take the keywords with the keyword scores larger than a keyword score threshold value as the subject keywords.
13. The text recognition apparatus of claim 10, wherein the extraction module is further configured to:
randomly selecting at least one text from the text set, and manually extracting corresponding actual topic keywords from the at least one randomly extracted text;
and acquiring the actual subject key words of the at least one text extracted manually.
14. The text recognition apparatus of claim 10, wherein the classifier is constructed as follows:
constructing the classifier according to the incidence relation between the keyword distribution matrix and the sentences contained in each text, a preset classification rule and the corresponding weight of the sentences contained in each text;
correspondingly, operating the identification module;
the identification module further configured to:
and inputting the text in the text set carrying the subject keyword distribution matrix and the actual subject keyword distribution matrix into the classifier to identify key sentences and non-key sentences, so as to obtain the key sentences and non-key sentences of the text in the text set.
15. The text recognition apparatus of claim 10, further comprising:
the second calculation unit is configured to calculate the recall rate and/or the accuracy rate of each text according to the number of key sentences and non-key sentences of the texts in the text set;
an optimizing unit configured to optimize the classifier according to the recall rate and/or the accuracy rate of each text.
16. The text recognition apparatus according to claim 15, wherein the second calculation unit includes:
a first statistic submodule configured to count a total number of key sentences included in each text and an actual number of key sentences included in the output key sentences of each text;
a recall rate calculation submodule configured to calculate a ratio of the actual number of key sentences to the total number of key sentences as the recall rate of each text.
17. The text recognition apparatus according to claim 15, wherein the second calculation unit includes:
the second counting submodule is configured to count the number of the output key sentences of each text and the number of actual key sentences contained in the output key sentences of each text;
and the calculation accuracy sub-module is configured to calculate a ratio of the actual number of the key sentences to the number of the key sentences as the accuracy of each text.
18. The text recognition apparatus of claim 10, wherein the obtaining module is further configured to:
and acquiring a plurality of texts of the same category in the vertical field, and creating the text set according to the plurality of texts.
19. An electronic device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the text recognition method of any one of claims 1 to 9.
20. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the text recognition method of any one of claims 1 to 9.
CN201910431256.0A 2019-05-22 2019-05-22 Text recognition method and device, electronic equipment and storage medium Active CN110134792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910431256.0A CN110134792B (en) 2019-05-22 2019-05-22 Text recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910431256.0A CN110134792B (en) 2019-05-22 2019-05-22 Text recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110134792A CN110134792A (en) 2019-08-16
CN110134792B true CN110134792B (en) 2022-03-08

Family

ID=67572514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910431256.0A Active CN110134792B (en) 2019-05-22 2019-05-22 Text recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110134792B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598211B (en) * 2019-09-02 2023-09-26 腾讯科技(深圳)有限公司 Article identification method and device, storage medium and electronic device
CN110781299B (en) * 2019-09-18 2024-03-19 平安科技(深圳)有限公司 Asset information identification method, device, computer equipment and storage medium
CN110728143A (en) * 2019-09-23 2020-01-24 上海蜜度信息技术有限公司 Method and equipment for identifying document key sentences
JP7415433B2 (en) * 2019-10-24 2024-01-17 富士フイルムビジネスイノベーション株式会社 Information processing device and program
CN110851598B (en) * 2019-10-30 2023-04-07 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111291186B (en) * 2020-01-21 2024-01-09 北京捷通华声科技股份有限公司 Context mining method and device based on clustering algorithm and electronic equipment
CN111814482B (en) * 2020-09-03 2020-12-11 平安国际智慧城市科技股份有限公司 Text key data extraction method and system and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN108897857A (en) * 2018-06-28 2018-11-27 东华大学 The Chinese Text Topic sentence generating method of domain-oriented

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9213997B2 (en) * 2012-10-24 2015-12-15 Moodwire, Inc. Method and system for social media burst classifications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN108897857A (en) * 2018-06-28 2018-11-27 东华大学 The Chinese Text Topic sentence generating method of domain-oriented

Also Published As

Publication number Publication date
CN110134792A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
Devika et al. Sentiment analysis: a comparative study on different approaches
CN106156204B (en) Text label extraction method and device
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN106649818B (en) Application search intention identification method and device, application search method and server
Zhao et al. Topical keyphrase extraction from twitter
CN109960756B (en) News event information induction method
CN111221962B (en) Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN111104526A (en) Financial label extraction method and system based on keyword semantics
KR101713558B1 (en) Method of classification and analysis of sentiment in social network service
CN112559684A (en) Keyword extraction and information retrieval method
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN111090771B (en) Song searching method, device and computer storage medium
Gaikwad et al. Multiclass mood classification on Twitter using lexicon dictionary and machine learning algorithms
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN110705247A (en) Based on x2-C text similarity calculation method
Tiwari et al. Ensemble approach for twitter sentiment analysis
Rani et al. Study and comparision of vectorization techniques used in text classification
Andriotis et al. Smartphone message sentiment analysis
KR101652433B1 (en) Behavioral advertising method according to the emotion that are acquired based on the extracted topics from SNS document
KR20130103249A (en) Method of classifying emotion from multi sentence using context information
CN114138969A (en) Text processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant