CN112699232A

CN112699232A - Text label extraction method, device, equipment and storage medium

Info

Publication number: CN112699232A
Application number: CN201910986050.4A
Authority: CN
Inventors: 窦方正
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2021-04-23

Abstract

The embodiment of the invention discloses a text label extraction method, a text label extraction device, text label extraction equipment and a storage medium. The method comprises the following steps: acquiring each text of a label to be extracted, and vectorizing each text to obtain a text vector corresponding to the corresponding text; clustering each text vector to obtain at least one text clustering result; extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result; and determining the text label of each text according to each label candidate word corresponding to each text clustering result. By the technical scheme, the automatic extraction of the text labels is realized, and the accuracy and comprehensiveness of the text label extraction and the expandability of the label extraction method are improved.

Description

Text label extraction method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to computer technology, in particular to a text label extraction method, a text label extraction device, text label extraction equipment and a storage medium.

Background

In application scenarios such as information search and information recommendation, data mining is usually required, and one of the items is extraction of text labels. Taking e-commerce platform as an example, the object extracted by the text label is usually related information of the commodity, such as detailed description of the commodity (article description detailed diagram for short), specification parameters of the commodity, comments, and the like. The article introduction detail drawing comprises more detailed and comprehensive article description information, such as marketing labels related to article use occasions, applicable crowds, small occupied area, large suction force and the like; the commodity specification parameters are stored in a structured form such as a table, and include specification attributes such as the length, the width and the like of the commodity and commodity expansion attributes such as the color, the memory, the communication mode and the like of the commodity.

At present, a text label extraction method for an e-commerce platform roughly comprises the following steps: first, a text label is obtained from a manual filling system. For example, when a merchant shelves a commodity, the merchant is required to fill in attribute information related to the commodity in a manual filling system. Secondly, text labels are automatically extracted from the structured commodity related information such as the commodity specification parameters, for example, structured text carriers such as an automatic recognition table are automatically identified, and then the text labels are extracted from the text carriers. Thirdly, text labels are automatically extracted from unstructured commodity relevant information such as article introduction detailed drawings and comments, for example, the text labels needing to be extracted are manually marked, and then the relevant text labels are automatically extracted.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: (1) the information manually filled and reported has the problems of errors, loss, inaccurate expression and the like, and the manually filled and reported label has strong normalization and insufficient individuation; (2) although the mode of automatically extracting the tags from the structured information can correct manual errors to a certain extent and improve the filling efficiency of manually filling specified attributes, the problem of insufficient personalization still exists; (3) although the method for automatically extracting the tags from unstructured information such as article introduction detailed diagrams can solve the problem of insufficient individuation of the text tags to a certain extent, manual data labeling is needed, so that the efficiency of extracting the text tags is low, and the expansibility is poor.

Disclosure of Invention

The embodiment of the invention provides a text label extraction method, a text label extraction device, text label extraction equipment and a storage medium, so as to realize automatic extraction of text labels and improve the accuracy, comprehensiveness and expandability of text label extraction.

In a first aspect, an embodiment of the present invention provides a text label extraction method, including:

acquiring each text of a label to be extracted, and vectorizing each text to obtain a text vector corresponding to the corresponding text;

clustering each text vector to obtain at least one text clustering result;

extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result;

and determining the text label of each text according to each label candidate word corresponding to each text clustering result.

In a second aspect, an embodiment of the present invention further provides a text label extraction apparatus, where the apparatus includes:

the text vector acquisition module is used for acquiring each text of the label to be extracted, vectorizing each text and acquiring a text vector corresponding to the corresponding text;

a text clustering result obtaining module, configured to cluster the text vectors to obtain at least one text clustering result;

a label candidate word obtaining module, configured to perform keyword extraction on each text clustering result to obtain each label candidate word corresponding to each text clustering result;

and the text label determining module is used for determining the text label of each text according to each label candidate word corresponding to each text clustering result.

In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the text label extraction method provided by any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text label extraction method provided in any embodiment of the present invention.

The method comprises the steps of obtaining each text of a label to be extracted, and vectorizing each text to obtain a text vector corresponding to the corresponding text; and clustering each text vector to obtain at least one text clustering result. The method and the device realize unsupervised classification of a plurality of texts, reduce human intervention in the label extraction process, and improve the accuracy of subsequent label extraction. Extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result; and determining the text label of each text according to each label candidate word corresponding to each text clustering result. The method and the device realize the unsupervised automatic extraction of the labels of various texts under the same theme, solve the problems of incomplete labels and insufficient individuation caused by extracting the labels of the structured texts only, and improve the individuation degree and comprehensiveness of the text labels; meanwhile, unsupervised text label extraction solves the problems of low label extraction efficiency and poor extraction method expansibility caused by manual data labeling, improves the extraction efficiency of text labels, improves the applicability of methods for extracting labels under different subjects, and enhances the expandability of the label extraction method.

Drawings

Fig. 1a is a flowchart of a text label extraction method in a first embodiment of the present invention;

FIG. 1b is a schematic diagram of extracting text information from an article introduction detail diagram according to a first embodiment of the present invention;

FIG. 1c is a diagram illustrating the word segmentation result of the text in the article introduction detail diagram according to the first embodiment of the present invention;

FIG. 1d is a diagram illustrating the word vectorization result of the word vector model according to the first embodiment of the present invention;

fig. 1e is a schematic diagram of a text clustering result in the first embodiment of the present invention;

fig. 1f is a schematic diagram of a keyword extraction result of a text clustering result in the first embodiment of the present invention;

FIG. 1g is a schematic diagram of a tag clustering result according to a first embodiment of the present invention;

fig. 2a is a flowchart of a text label extraction method in the second embodiment of the present invention;

FIG. 2b is a diagram illustrating a text recognition result of an article introduction detail diagram according to a second embodiment of the present invention;

FIG. 2c is a schematic illustration of the merged text in the article introduction detail view in the second embodiment of the present invention;

FIG. 2d is a schematic diagram of text merging according to the second embodiment of the present invention;

FIG. 2e is a diagram illustrating the result of candidate tags and word frequencies in the second embodiment of the present invention;

fig. 3 is a flowchart of a text label extraction method in the third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text label extraction apparatus in a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus in the fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

The text label extraction method provided by the embodiment can be suitable for the condition of extracting labels from a plurality of texts under the same theme, and is particularly suitable for extracting text labels in an e-commerce platform. The method may be performed by a text label extraction apparatus, which may be implemented by means of software and/or hardware, which may be integrated in a device such as a personal computer or a server. Referring to fig. 1a, the method of the present embodiment specifically includes the following steps:

s110, obtaining each text of the label to be extracted, and vectorizing each text to obtain a text vector corresponding to the corresponding text.

The label refers to a feature description of a text in a certain aspect, and may be, for example, a keyword for expressing a text expression emphasis. In the e-commerce platform, a tag may be a description of an item-specific attribute, such as a specification attribute of the item, an extended attribute of the item, and a functional attribute of the item. The functional attributes of the article can be energy-saving, mute, small occupied area, large suction and the like, and the marketable label is fit for the hot spot concerned by the user. The text vector refers to a numeric vector obtained by text conversion.

In order to automatically extract the text labels, a plurality of texts including the text labels need to be acquired. Further, in order to improve the tag extraction efficiency and the convenience of subsequent tag classification and use, a plurality of texts under the same theme can be acquired. The subject here refers to the main content of the text, and may be, for example, a research subject of articles, or an article type in an e-commerce platform, such as a mobile phone type, a range hood type, and the like. In the embodiment of the invention, the range hood category is taken as an example for explanation, and a plurality of texts under the range hood category need to be acquired. Thereafter, each text obtained needs to be digitized for automatic extraction of subsequent text labels. In this embodiment, each text is vectorized and converted into a digitized vector.

Illustratively, vectorizing each text, and obtaining a text vector corresponding to the corresponding text includes: and for each text, performing word segmentation and word deactivation on the text to obtain word segmentation results to be extracted corresponding to the text, and performing vectorization on the word segmentation results to be extracted by utilizing at least one characteristic vector model to obtain a text vector corresponding to the text.

The word segmentation result to be extracted refers to a word segmentation result of the text for extracting the label. The feature vector model refers to a model for converting text into vectors, and may be, for example, a topic model such as an open source lda (patent Dirichlet allocation) model, a Term Frequency-Inverse Document Frequency model (TF-IDF), or a word vector model such as a word2vec model, and the like. The description of these models is provided in the following.

The text vectorization process can be performed on the texts one by one, and the processing process of a single text is as follows: firstly, word segmentation processing is carried out on the text by utilizing word segmentation tools such as a Chinese character, a Japanese character and a Japanese character to obtain a plurality of words. Then, the words are removed, that is, punctuations, numbers, English letters and brand names under the current category in the words obtained by removing the word segmentation are removed (the brand names can obtain a brand list under the corresponding category from a database of the e-commerce platform through category numbers), and the obtained remaining words are the word segmentation result to be extracted from the text. Then, inputting each word in the word segmentation result to be extracted into at least one characteristic vector model of the topic model, the word frequency-inverse file frequency model and the word vector model, and obtaining the word vector of each word corresponding to the text through vectorization processing of the models. And adding corresponding element values of the word vectors of all the words obtained by the same characteristic vector model to obtain an initial text vector of the text in the vectorization dimension corresponding to the characteristic vector model. If only one feature vector model is used, the initial text vector obtained is the final text vector. If the above three feature vector models are used, three initial text vectors corresponding to the text can be obtained. At this time, the three initial text vectors are directly connected to obtain a final text vector corresponding to the text. Assuming that the vector dimensions of the three initial text vectors are 8, 10, and 12, respectively, the vector dimension of the final text vector is 30. The advantage of setting up like this lies in, can obtain more accurate text vector to improve the precision that follow-up text label drawed to a certain extent.

The Term Frequency-Inverse Document Frequency model (TF-IDF) can measure the importance degree of a Term to a text in all texts (called corpus), wherein the importance degree increases in proportion to the number of times that the Term appears in the text, but decreases in Inverse proportion to the Frequency that the Term appears in the corpus, and the calculation formula is as follows:

wherein, c_wdNumber of times a word appears in the text to be processed, c_wAs a total number of words in the text to be processed, c_dIs the total number of texts in the corpus, c_dwIs the number of texts in the corpus that contain the word.

The word vector model may be a model existing in the related art or a model obtained by retraining. In this embodiment, the article introduction detail drawing in the e-commerce platform is used as a training data source to perform model retraining, and the training process is as follows:

firstly, the text information in the article introduction detail map is obtained as corpus data by operating the text detection and recognition system, as shown in fig. 1 b. Then, the expected data is subjected to word segmentation processing, and the word segmentation result shown in fig. 1c is obtained as a vocabulary for subsequent model training. Finally, the vocabulary is input into the word2vec model to be trained for model training (the parameters are set as follows: -train vocabulary. txt-output word vector. bin-cbow 1-size 200-window 5-negative 25-reads 20-binary 0), and then the word vector model can be obtained. The trained word vector model is used to perform vectorization on "reservation" and "protection" to obtain corresponding word vectors, and the result is shown in fig. 1 d.

And S120, clustering each text vector to obtain at least one text clustering result.

Clustering the obtained text vectors by using a clustering algorithm such as K-Means or AP, automatically dividing the text vectors into a plurality of classes, namely obtaining a plurality of text clustering results. The meaning of the text expressions in each text clustering result is similar. According to the above description of text vectorization, a text vector is obtained by vectorizing a plurality of words, so that one text vector corresponds to a word segmentation result to be extracted, and a plurality of text vectors included in one text clustering result can be represented by the word segmentation result to be extracted corresponding to each text vector. Referring to fig. 1e, the text clustering result with the number 239 includes the word segmentation result to be extracted corresponding to 14 texts, and the text clustering result with the number 139 includes the word segmentation result to be extracted corresponding to 4 texts, so that the meanings of text expressions in the same text clustering result are similar.

It should be noted that the same segmentation result to be extracted exists in each text clustering result, because different SKUs in the category may use the same text description.

And S130, extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result.

And extracting keywords from each text clustering result by using a keyword extraction algorithm in a final tool, and reserving n keywords in the top sequence according to application requirements, so as to obtain a keyword extraction result corresponding to each text clustering result, wherein the extracted keywords are used as label candidate words for further generating text labels subsequently. For example, two groups of tag candidate words shown in fig. 1f can be obtained by performing keyword extraction on the text clustering results with the numbers 239 and 139.

It should be noted that, in the keyword extraction process, if there are overlapped keywords between each text clustering result, keyword deduplication needs to be performed.

And S140, determining the text label of each text according to each label candidate word corresponding to each text clustering result.

Because the semantics of a single keyword are not clear and clear, in the embodiment, after the keyword is extracted, candidate word merging is performed on the candidate words of the labels in each text clustering result, so as to obtain a candidate word combination result with more determined semantics as a text label finally extracted from each text.

According to the above description, the text label extraction in the embodiment of the present invention is mainly based on unsupervised clustering and word frequency information, so that the text label extraction method in the embodiment of the present invention is independent of text topics (such as categories of articles), and further low-cost migration between different topics can be performed, so that the text label extraction method in the embodiment of the present invention has high extensibility, and can efficiently and conveniently serve in various scenes based on article and user related analysis, such as improving search and recommendation accuracy, providing richer article information for product related data analysis, and the like, thereby achieving "correct product accurately reaches correct user".

According to the technical scheme of the embodiment, each text of the label to be extracted is obtained, and each text is vectorized to obtain a text vector corresponding to the corresponding text; and clustering each text vector to obtain at least one text clustering result. The method and the device realize unsupervised classification of a plurality of texts, reduce human intervention in the label extraction process, and improve the accuracy of subsequent label extraction. Extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result; and determining the text label of each text according to each label candidate word corresponding to each text clustering result. The method and the device realize the unsupervised automatic extraction of the labels of various texts under the same theme, solve the problems of incomplete labels and insufficient individuation caused by extracting the labels of the structured texts only, and improve the individuation degree and comprehensiveness of the text labels; meanwhile, unsupervised text label extraction solves the problems of low label extraction efficiency and poor extraction method expansibility caused by manual data labeling, improves the extraction efficiency of text labels, improves the applicability of methods for extracting labels under different subjects, and enhances the expandability of the label extraction method.

On the basis of the above technical solution, after determining the text label of each text according to each label candidate word corresponding to each text clustering result, the method further includes: clustering each text label to obtain at least one label clustering result; and aiming at each label clustering result, taking any text label in the label clustering results as a corrected text label of the corresponding label clustering result, and taking each residual text label except the corrected text label in the label clustering results as a similar text label of the corrected text label.

According to the above description, if label candidate words with similar semantics exist in one text clustering result, labels with similar semantics may also exist in each text label corresponding to the text clustering result. In order to simplify the labels, the present embodiment performs clustering analysis on the text labels again. In specific implementation, for each text clustering result, clustering is performed on a plurality of text labels corresponding to the text clustering result, and a plurality of label clustering results corresponding to the text clustering result are obtained. And then, taking any text label in the label clustering results as a corrected text label corresponding to the text clustering results, and taking other residual text labels in the label clustering results as similar text labels of the corrected text labels. Thus, each text clustering result corresponds to one modified text label, and the modified text label corresponds to a plurality of similar text labels. The revised text label and the similar text label may be stored in a list for subsequent use.

For example, in fig. 1g, "noise reduction mute, decibel mute, sound pressure level noise, decibel low noise" is a tag clustering result, the first text tag "noise reduction mute" is selected as the modified text tag of the tag clustering result, and the remaining text tags are similar text tags, so the text tag list of the tag clustering result may be { "noise reduction mute": "noise reduction mute", "decibel mute", "sound pressure level noise", "decibel low noise" }.

Example two

In this embodiment, based on the first embodiment, the "obtaining each text of the tag to be extracted" is further optimized. On the basis, the method can further optimize the text label of each text determined according to each label candidate word corresponding to each text clustering result. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted.

For convenience of subsequent description, the application scenario is set to extract a text label from an article introduction detail drawing of the e-commerce platform in the embodiment. In addition to the structured article specification attribute and the article extension attribute, the article introduction detail drawing also comprises a large amount of unstructured article information description, and the article information comprises a large amount of personalized labels of the articles. The personalized tags can be used for correcting and supplementing manually filled article attributes, structured article attributes and the like so as to complete description of the articles, and further can be used for basic data of applications such as personalized information search and recommendation.

Referring to fig. 2a, the text label extraction method provided in this embodiment includes:

and S210, performing character recognition on the article introduction detail drawing to obtain text data of the article introduction detail drawing.

Text data of the item introduction detail view shown in fig. 2b can be obtained by performing word recognition on the item introduction detail view by using a word recognition system. These text data contain, in addition to structured item attributes, a large amount of unstructured functional descriptive information. In addition, the text data records, in addition to the text itself, information such as the position of the text in the drawing, the font, font size, and color of the text.

S220, according to the text attributes in the text data and preset text screening conditions, importance screening is conducted on the texts in the text data, screening texts corresponding to the article introduction detail diagrams are obtained, and texts of the labels to be extracted corresponding to the article introduction detail diagrams are determined according to the screening texts.

The text attribute refers to information such as a text position, a font of the text, a font size, a color and the like. The preset text filtering condition is a preset condition for filtering the text, and may be, for example, a font size, a color, and other text attributes used for representing the importance degree of the text.

In the present introduction detail drawings of articles, in order to facilitate users to grasp the main performance characteristics of the articles, the texts describing the important characteristics of the articles are displayed in larger word sizes or special colors. Based on this, in this embodiment, after the text data corresponding to the article introduction detail drawing is obtained, the text is first screened by using the text attribute in the text data and the preset text screening condition, if the preset text screening condition is set to the font size 20, all texts with font sizes smaller than 20 in the text data are removed (the text with a larger font size has a greater correlation with the tag, while the text with a smaller font size generally describes the personalized tag), and the remaining texts are each screened text. And then, determining each text of the label to be extracted corresponding to the article introduction detail drawing according to the screening text, for example, directly taking the screening text as the required text, or performing text combination processing on the screening text to further obtain the required text.

Exemplarily, determining each text of the tag to be extracted corresponding to the article introduction detail drawing according to each filtering text includes: for every two longitudinally adjacent screening texts, if any one of a left edge difference absolute value, a center coordinate difference absolute value and a right edge difference absolute value between text boxes corresponding to the two screening texts in the text data is smaller than a set text box height, and a longitudinal difference absolute value between the two text boxes is smaller than the set text box height, merging the two screening texts to serve as a longitudinal merged text, wherein the set text box height is a larger value of the text box heights in the two text boxes, and the longitudinal difference absolute value is a distance difference absolute value between a lower edge of an upper text box and an upper edge of a lower text box; and for every two transversely adjacent longitudinal merged texts, if the longitudinal overlapping distance between the text boxes of the two longitudinal merged texts is less than the set text box height, merging the two longitudinal merged texts to serve as the texts of the labels to be extracted corresponding to the article introduction detail graphs.

The text in the article introduction detail drawing is different from the ordinary text, wherein the arrangement of the text for describing the same event may be in one area in the drawing, so that the required texts are determined by combining the screening texts in the longitudinal and transverse directions in the embodiment. Referring to fig. 2c, "intelligent control, on-duty, the food materials can enter a preparation state", "food material entry management, food material entry, overdue reminding", and "health recipe online recommendation, also according to the entry of food materials and the recommendation of recipes", all describe the same thing, so that the texts need to be combined longitudinally first and then transversely.

In vertical merge, referring to fig. 2d, for every two vertically adjacent filter texts, the absolute value 201 (denoted as x) of the difference between the left edge of the upper text box 210 and the left edge of the lower text box 220 is obtained_l) Center coordinate ofThe absolute value of the difference 202 (denoted as x)_m) And the absolute value of the difference of the right edge 203 (noted as x)_r) If any of the absolute difference values is less than the set text box height 204 (denoted as h)_m) Then the absolute value 205 (denoted as y) of the vertical difference between the upper text box 210 and the lower text box 220 is determined_b) If the height is smaller than the set text box height 204, if so, the two filter texts corresponding to the two text boxes are vertically combined to form the vertically combined text 200 shown in fig. 2 c. Namely, the two screened texts merged longitudinally satisfy the relationship: x is the number of_l＜h_m or x_m＜h_mor x_r＜h_mAnd y is_b＜h_m。

After the vertical combination, it is also necessary to determine whether horizontal combination is needed between every two vertical combined texts. For every two vertical combined texts, the vertical overlap distance 206 (denoted as y) between the left text box 230 and the right text box 240 is obtained_o) If the vertical overlap distance 206 is less than the set text box height 204, then the two vertical merged texts corresponding to the two text boxes are merged horizontally to form the text 200' shown in fig. 2 c. Namely, the relationship is satisfied between the two longitudinally merged texts which are merged transversely: y is_o＜h_m。

And S230, vectorizing each text to obtain a text vector corresponding to the corresponding text.

S240, clustering each text vector to obtain at least one text clustering result.

And S250, extracting keywords from each text clustering result to obtain each label candidate word corresponding to each text clustering result.

S260, according to a co-occurrence matrix between words obtained in advance based on each text, determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result, screening out at least one pair of label candidate words with the co-occurrence frequency meeting a preset co-occurrence frequency threshold value, and generating each label reorganized word.

The co-occurrence matrix is a matrix for counting the frequency of whether two words in one text occur together.

When the label candidate words are combined, the co-occurrence frequency between every two label candidate words is utilized. First, statistics of co-occurrence frequency between every two words in the text is performed according to the word segmentation result of each text obtained in S220, so as to obtain co-occurrence matrices corresponding to all the words. And then, searching the co-occurrence frequency between every two label candidate words in the text clustering result by using the co-occurrence matrix. Therefore, the co-occurrence frequency of the combined word formed by every two label candidate words in the text clustering result can be obtained. And screening out the combination words with the co-occurrence frequency larger than a preset co-occurrence frequency threshold (empirically set co-occurrence frequency), wherein the combination words are all the label reorganized words corresponding to the text clustering result. According to the process, the label restructured words corresponding to each text clustering result can be obtained.

Before determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to the co-occurrence matrix between words obtained in advance based on each text, the method further includes: determining a candidate word vector of each label candidate word corresponding to each text clustering result by using a feature vector model; and determining the first text similarity between the two corresponding label candidate words according to the candidate word vectors of every two label candidate words, and eliminating the label candidate words of which the first text similarity is smaller than a first preset similarity threshold value in the label candidate words corresponding to each text clustering result.

In order to further improve the accuracy of merging the label candidate words, in this embodiment, before merging the label candidate words by using the co-occurrence frequency, the text similarity (i.e., the first text similarity) between every two label candidate words is calculated. And then, removing the label candidate words with the first text similarity smaller than a first preset similarity threshold (empirically set similarity) from all the label candidate words in the text clustering result, and merging the remaining label candidate words based on the co-occurrence frequency.

And S270, for each label restructuring word, if the label restructuring word exists in each text, determining the label restructuring word as a text label.

After obtaining the tag-containing words, it is necessary to determine whether the tag-containing words exist in the text obtained in S220, and if so, determine the word as a text tag. If not, it may not be considered a text label.

Exemplarily, S270 includes: if the label restructuring words exist in each text, determining the label restructuring words as candidate labels; and if the word frequency of the candidate label meets a preset word frequency threshold value, determining the label recombination word as a text label.

In the process of determining the tag restructured words as the text tags, on the basis of judging that the tag restructured words exist in the text, screening is further performed according to the frequency of each tag restructured word appearing in each text (i.e., word frequency, see the number after each tag restructured word in fig. 2 e), so that candidate tags whose word frequency meets a preset word frequency threshold (empirically set word frequency according to application requirement accuracy) are determined as the text tags. The advantage of this arrangement is that the accuracy of the text labels is further improved.

According to the technical scheme of the embodiment, text data of the article introduction detail drawing is obtained by performing character recognition on the article introduction detail drawing; and according to the text attributes in the text data and preset text screening conditions, performing importance screening on the texts in the text data to obtain screening texts corresponding to the article introduction detail drawing, and determining each text of the label to be extracted corresponding to the article introduction detail drawing according to each screening text. The method and the device have the advantages that the text acquisition based on the article introduction detail drawing is realized, the texts with small label correlation in the acquired texts are reduced, and the efficiency of extracting the subsequent text labels is further improved. Determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to a co-occurrence matrix between words obtained in advance based on each text, screening out at least one pair of label candidate words with the co-occurrence frequency meeting a preset co-occurrence frequency threshold, and generating each label reorganized word; and aiming at each label reorganized word, if the label reorganized word exists in each text, determining the label reorganized word as a text label. The extraction of the text labels of the unstructured information in the article introduction detail drawing is realized, the information utilization rate of the article introduction detail drawing is improved, and the comprehensiveness and the accuracy of the extraction of the text labels are further improved.

EXAMPLE III

On the basis of the foregoing embodiments, the present embodiment describes a step of automatically labeling a label on a text to be labeled. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted.

S310, obtaining a text to be labeled of the label to be labeled, performing word segmentation and word stop removal on the text to be labeled, and obtaining a word segmentation result to be labeled corresponding to the text to be labeled.

The text to be labeled is a text which needs to be labeled automatically, and may be a common text, or a text obtained from the article introduction detail drawing in the manner of S210 to S220 in the second embodiment. Because automatic labeling of labels requires that each text label extracted from a large amount of texts is used, the text to be labeled should belong to the same subject as the text extracted from each text label. After the text to be labeled is obtained, the text to be labeled is subjected to word segmentation and word deactivation, and a word segmentation result corresponding to the text to be labeled, namely a word segmentation result to be labeled, is obtained.

S320, determining text labels contained in the text to be labeled according to the word segmentation result to be labeled and the text labels, and using the text labels as labeling labels of the text to be labeled.

And performing correlation analysis on the word segmentation result to be labeled and each text label so as to determine the text label contained in the text to be labeled. These text labels can be used for automatic labeling of the text labels of the text to be labeled. Each text label herein may include all of the modified text labels and/or similar text labels.

Exemplarily, determining the text label included in the text to be labeled according to the word segmentation result to be labeled and each text label includes:

A. and sequentially judging whether the text labels co-occur in the word segmentation result to be labeled according to a preset sliding window, and determining the co-occurring text labels as the text labels contained in the text to be labeled.

And performing correlation analysis on the word segmentation result to be labeled and each text label, and analyzing whether the word pairs coexist or not. In specific implementation, a preset sliding window is set, and the text to be marked slides according to a certain step length (such as 1 character or word). And judging whether any text label co-occurs in the sliding window of each text to be labeled. And if the co-occurrence exists, determining the co-occurrence text label as the text label contained in the text to be annotated.

Exemplarily, sequentially judging whether each text label coexists in the segmentation result to be labeled according to a preset sliding window comprises: if the preset sliding window is larger than two words, segmenting each text label to obtain a keyword group corresponding to each text label; and sequentially judging whether all key phrases co-occur in the segmentation result to be marked according to a preset sliding window.

And determining whether word segmentation processing needs to be carried out on the text label or not according to the size of a preset sliding window. For example, if the sliding window is preset to be 2 words, the word segmentation process is not needed. If the preset sliding window is larger than 2 words, word segmentation processing needs to be performed on the text label, because the text label is a combination of two words, if whether the words co-occur in 3 or more words is judged, the co-occurrence probability of the text label after word segmentation is increased (except that the two words appear in close proximity and the two words appear at intervals), and the determination stability of the text label in the text to be labeled can be improved. At this time, it is necessary to determine whether the keyword groups corresponding to the text labels co-occur in the sliding window corresponding to the text to be labeled.

B. and vectorizing the word segmentation result to be labeled and each text label by using the characteristic vector model to obtain each word segmentation vector to be labeled and each text label vector.

In order to solve the problem that each text label (namely a text label list) which is automatically extracted is incomplete, namely personalized label extraction cannot completely cover label meanings, for example, green environment protection and pollution-free belong to environment-friendly labels, if only green environment protection exists in the text label list and no pollution exists, but only pollution-free exists in each text corresponding to an article introduction detail drawing, and at the moment, a problem of wrong labeling can occur if the method is adopted. Therefore, in this embodiment, a feature vector model is introduced to obtain a text tag vector obtained by vectorizing each text tag and a word vector of a combined word of every two participles in the word segmentation result to be labeled, that is, a word segmentation vector to be labeled.

C. And determining the second text similarity between each word to be labeled and each text label in the word segmentation result to be labeled according to each word to be labeled vector and each text label vector.

And calculating the text similarity between each word segmentation vector to be labeled and each text label vector, namely the second text similarity.

D. And determining the text labels contained in the text to be labeled according to the second text similarity and a second preset similarity threshold.

And eliminating the text labels with the second text similarity smaller than a second preset similarity threshold (the similarity is empirically set according to the application requirement precision), wherein the remaining text labels are the text labels contained in the text to be labeled. The above process is to judge the similarity between "no pollution" and "green environmental protection", so as to determine each label by using the text similarity. This has the advantage that the accuracy of the determination of the label can be further improved.

It should be noted that, the step a and the steps B to D may be executed alternatively or simultaneously. When the two labeling labels are executed simultaneously, the determination results of the two labeling labels can be obtained, and at this time, the determination results of the two labeling labels can be combined to be used as the final labeling label of the text to be labeled so as to ensure the comprehensiveness of the labeling label; the determination results of the two labeling labels can be intersected, and the overlapped text label is used as the final labeling label of the text to be labeled, so that the accuracy of the labeling label is further ensured.

According to the technical scheme of the embodiment, a to-be-labeled text of a to-be-labeled label is obtained, and word segmentation and word stop removal are carried out on the to-be-labeled text, so that a to-be-labeled word segmentation result corresponding to the to-be-labeled text is obtained; and determining text labels contained in the text to be labeled according to the word segmentation result to be labeled and each text label, and using the text labels as labeling labels of the text to be labeled. The automatic labeling of the text to be labeled is realized, the problems of human errors and subjective randomness caused by manual data labeling are solved to a great extent, and the efficiency and the accuracy of labeling the text label are improved.

Example four

The present embodiment provides a text label extracting apparatus, referring to fig. 4, the apparatus specifically includes:

the text vector obtaining module 410 is configured to obtain each text of the tag to be extracted, and perform vectorization on each text to obtain a text vector corresponding to the corresponding text;

a text clustering result obtaining module 420, configured to cluster the text vectors to obtain at least one text clustering result;

a tag candidate word obtaining module 430, configured to perform keyword extraction on each text clustering result to obtain each tag candidate word corresponding to each text clustering result;

and the text label determining module 440 is configured to determine a text label of each text according to each label candidate word corresponding to each text clustering result.

Optionally, the text vector obtaining module 410 is specifically configured to:

and for each text, performing word segmentation and word deactivation on the text to obtain word segmentation results to be extracted corresponding to the text, and performing vectorization on the word segmentation results to be extracted by utilizing at least one characteristic vector model to obtain a text vector corresponding to the text.

Optionally, the text label determination module 440 includes:

the label recombination word generation submodule is used for determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to a co-occurrence matrix between words obtained in advance based on each text, screening out at least one pair of label candidate words with the co-occurrence frequency meeting a preset co-occurrence frequency threshold value, and generating each label recombination word;

and the text label determining submodule is used for determining the label restructuring words as text labels according to each label restructuring word if the label restructuring words exist in each text.

Further, the text label determination module 440 further includes:

the label candidate word screening submodule is used for determining a candidate word vector of each label candidate word corresponding to each text clustering result by using a characteristic vector model before determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to a co-occurrence matrix between words obtained in advance based on each text;

and determining the first text similarity between the two corresponding label candidate words according to the candidate word vectors of every two label candidate words, and eliminating the label candidate words of which the first text similarity is smaller than a first preset similarity threshold value in the label candidate words corresponding to each text clustering result.

Optionally, the text label determination sub-module is specifically configured to:

if the label restructuring words exist in each text, determining the label restructuring words as candidate labels;

and if the word frequency of the candidate label meets a preset word frequency threshold value, determining the label recombination word as a text label.

Optionally, the text vector obtaining module 410 includes:

the text data obtaining sub-module is used for carrying out character recognition on the article introduction detail drawing to obtain text data of the article introduction detail drawing when the application scene is that the text labels are extracted from the article introduction detail drawing;

and the text acquisition submodule is used for performing importance screening on the texts in the text data according to the text attributes in the text data and preset text screening conditions to obtain screening texts corresponding to the article introduction detail diagrams, and determining texts of the labels to be extracted corresponding to the article introduction detail diagrams according to the screening texts.

The text acquisition submodule is specifically configured to:

for every two longitudinally adjacent screening texts, if any one of a left edge difference absolute value, a center coordinate difference absolute value and a right edge difference absolute value between text boxes corresponding to the two screening texts in the text data is smaller than a set text box height, and a longitudinal difference absolute value between the two text boxes is smaller than the set text box height, merging the two screening texts to serve as a longitudinal merged text, wherein the set text box height is a larger value of the text box heights in the two text boxes, and the longitudinal difference absolute value is a distance difference absolute value between a lower edge of an upper text box and an upper edge of a lower text box;

and for every two transversely adjacent longitudinal merged texts, if the longitudinal overlapping distance between the text boxes of the two longitudinal merged texts is less than the set text box height, merging the two longitudinal merged texts to serve as the texts of the labels to be extracted corresponding to the article introduction detail graphs.

Optionally, on the basis of the above apparatus, the apparatus further includes a text label modification module, configured to:

after determining the text labels of the texts according to the label candidate words corresponding to each text clustering result, clustering the text labels to obtain at least one label clustering result;

and aiming at each label clustering result, taking any text label in the label clustering results as a corrected text label of the corresponding label clustering result, and taking each residual text label except the corrected text label in the label clustering results as a similar text label of the corrected text label.

Optionally, on the basis of the above apparatus, the apparatus further includes a text label labeling module, where the text label labeling module includes:

the to-be-labeled word segmentation result obtaining submodule is used for obtaining a to-be-labeled text of the to-be-labeled label, performing word segmentation and word stop removal on the to-be-labeled text, and obtaining a to-be-labeled word segmentation result corresponding to the to-be-labeled text;

and the labeling label determining submodule is used for determining the text labels contained in the text to be labeled according to the word segmentation result to be labeled and each text label, and the text labels are used as the labeling labels of the text to be labeled.

Optionally, the tag labeling determination sub-module is specifically configured to:

and sequentially judging whether the text labels co-occur in the word segmentation result to be labeled according to a preset sliding window, and determining the co-occurring text labels as the text labels contained in the text to be labeled.

Further, the tag label determination submodule is specifically configured to:

if the preset sliding window is larger than two words, segmenting each text label to obtain a keyword group corresponding to each text label;

and sequentially judging whether all key phrases co-occur in the segmentation result to be marked according to a preset sliding window.

vectorizing the segmentation result to be labeled and each text label by using a characteristic vector model to obtain each segmentation vector to be labeled and each text label vector;

determining a second text similarity between each word to be labeled and each text label in the word segmentation result to be labeled according to each word to be labeled and each text label vector;

and determining the text labels contained in the text to be labeled according to the second text similarity and a second preset similarity threshold.

Through the text label extraction device in the fourth embodiment of the invention, unsupervised automatic label extraction of various texts under the same theme is realized, the problems of incomplete labels and insufficient individuation caused by only extracting labels from structured texts are solved, and the individuation degree and the comprehensiveness of text labels are improved; meanwhile, unsupervised text label extraction solves the problems of low label extraction efficiency and poor extraction method expansibility caused by manual data labeling, improves the extraction efficiency of text labels, improves the applicability of methods for extracting labels under different subjects, and enhances the expandability of the label extraction method.

The text label extraction device provided by the embodiment of the invention can execute the text label extraction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the text label extraction apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

EXAMPLE five

Referring to fig. 5, the present embodiment provides an apparatus, which includes: one or more processors 520; the storage 510 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 520, the one or more processors 520 implement the text label extraction method provided in the embodiment of the present invention, including:

clustering each text vector to obtain at least one text clustering result;

Of course, those skilled in the art can understand that the processor 520 may also implement the technical solution of the text label extraction method provided in any embodiment of the present invention.

The device shown in fig. 5 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention. As shown in fig. 5, the apparatus includes a processor 520, a storage device 510, an input device 530, and an output device 540; the number of the processors 520 in the device may be one or more, and one processor 520 is taken as an example in fig. 5; the processor 520, the memory device 510, the input device 530 and the output device 540 of the apparatus may be connected by a bus or other means, such as by a bus 550 in fig. 5.

The storage device 510 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the text label extraction method in the embodiment of the present invention (for example, a text vector acquisition module, a text clustering result acquisition module, a label candidate word acquisition module, and a text label determination module in the text label extraction device).

The storage device 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 510 may further include memory located remotely from processor 520, which may be connected to devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 540 may include a display device such as a display screen.

EXAMPLE six

The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of text label extraction, the method comprising:

clustering each text vector to obtain at least one text clustering result;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the text label extraction method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, and includes several instructions to enable a device (which may be a personal computer, a server, or a network device) to execute the text label extraction method provided in the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A text label extraction method is characterized by comprising the following steps:

acquiring texts of labels to be extracted under the same theme, and vectorizing each text to obtain a text vector corresponding to the corresponding text;

clustering each text vector to obtain at least one text clustering result;

2. The method of claim 1, wherein vectorizing each of the texts to obtain a text vector corresponding to the corresponding text comprises:

and performing word segmentation and word deactivation on each text to obtain word segmentation results to be extracted corresponding to the text, and performing vectorization on each word segmentation result to be extracted by using at least one characteristic vector model to obtain a text vector corresponding to the text.

3. The method of claim 1, wherein determining the text label of each text according to each label candidate word corresponding to each text clustering result comprises:

determining the co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to a co-occurrence matrix between words obtained in advance based on each text, screening out at least one pair of label candidate words with the co-occurrence frequency meeting a preset co-occurrence frequency threshold value, and generating each label reorganized word;

and for each label restructuring word, if the label restructuring word exists in each text, determining the label restructuring word as the text label.

4. The method according to claim 3, before determining a co-occurrence frequency between every two label candidate words corresponding to each text clustering result according to a co-occurrence matrix between words obtained in advance based on each text, further comprising:

determining a candidate word vector of each label candidate word corresponding to each text clustering result by using a feature vector model;

according to the candidate word vectors of every two label candidate words, determining a first text similarity between the corresponding two label candidate words, and eliminating the label candidate words with the first text similarity smaller than a first preset similarity threshold value in the label candidate words corresponding to each text clustering result.

5. The method of claim 3, wherein determining the tagged reformulated words as the text tags if the tagged reformulated words exist in each of the texts comprises:

and if the word frequency of the candidate label meets a preset word frequency threshold value, determining the label recombination word as the text label.

6. The method according to claim 1, wherein when the application scenario is extracting text labels from an article introduction detail drawing, acquiring each text of the labels to be extracted comprises:

performing character recognition on the article introduction detail drawing to obtain text data of the article introduction detail drawing;

and according to the text attributes in the text data and preset text screening conditions, performing importance screening on the texts in the text data to obtain screening texts corresponding to the article introduction detail diagrams, and determining the texts of the labels to be extracted corresponding to the article introduction detail diagrams according to the screening texts.

7. The method according to claim 6, wherein determining each text of the label to be extracted corresponding to the item introduction detail drawing according to each filtering text comprises:

for every two longitudinally adjacent screening texts, if any difference absolute value of a left edge difference absolute value, a center coordinate difference absolute value and a right edge difference absolute value between text boxes corresponding to the two screening texts in the text data is smaller than a set text box height, and a longitudinal difference absolute value between the two text boxes is smaller than the set text box height, merging the two screening texts to serve as a longitudinal merged text, wherein the set text box height is the larger value of the text box height in the two text boxes, and the longitudinal difference absolute value is the distance difference absolute value between the lower edge of an upper text box and the upper edge of a lower text box;

and for every two transversely adjacent longitudinal merged texts, if the longitudinal overlapping distance between the text boxes of the two longitudinal merged texts is less than the set text box height, merging the two longitudinal merged texts to serve as the texts of the labels to be extracted corresponding to the article introduction detail diagrams.

8. The method of claim 1, wherein after determining the text label of each text according to each label candidate word corresponding to each text clustering result, the method further comprises:

clustering each text label to obtain at least one label clustering result;

9. The method of claim 1, wherein after determining the text label of each text according to each label candidate word corresponding to each text clustering result, the method further comprises:

acquiring a text to be labeled of a label to be labeled, performing word segmentation and word deactivation on the text to be labeled, and acquiring a word segmentation result to be labeled corresponding to the text to be labeled;

and determining text labels contained in the text to be labeled according to the word segmentation result to be labeled and each text label, wherein the text labels are used as labeling labels of the text to be labeled.

10. The method of claim 9, wherein determining the text labels included in the text to be labeled according to the segmentation result to be labeled and each text label comprises:

11. The method of claim 10, wherein sequentially determining whether each text label co-occurs in the segmentation result to be labeled according to a preset sliding window comprises:

if the preset sliding window is larger than two words, performing word segmentation on each text label to obtain a keyword group corresponding to each text label;

and sequentially judging whether each keyword group co-occurs in the segmentation result to be marked according to the preset sliding window.

12. The method of claim 9, wherein determining the text labels included in the text to be labeled according to the segmentation result to be labeled and each text label comprises:

vectorizing the word segmentation result to be labeled and each text label by using a characteristic vector model to obtain each word segmentation vector to be labeled and each text label vector;

13. A text label extraction apparatus, comprising:

14. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a text label extraction method as recited in any of claims 1-12.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the text label extraction method according to any one of claims 1-12.