CN111125332A

CN111125332A - Method, device, equipment and storage medium for calculating TF-IDF value of word

Info

Publication number: CN111125332A
Application number: CN201911330916.2A
Authority: CN
Inventors: 陈建华; 崔朝辉; 赵立军; 张霞
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08
Anticipated expiration: 2039-12-20
Also published as: CN111125332B

Abstract

The application discloses a method, a device, equipment and a storage medium for calculating TF-IDF value of a word, comprising the following steps: according to the target word traversal tree-shaped data structure, sub-nodes under the root node are determined; acquiring a target set corresponding to the target child node, wherein the target set comprises a text identifier and a TF value of the target word in a text corresponding to the text identifier; and determining the IDF value of the target word according to the number of the text identifications included in the target set and the total number of the texts indicated by the root node, and further calculating the TF-IDF value of the target word. Therefore, after the target set corresponding to the target word is determined, which texts in the text library contain the target word and the TF value of the target word in each text can be directly determined, so that the calculation resources consumed for determining the texts can be effectively reduced, and the efficiency of determining the TF-IDF value corresponding to the target word is improved.

Description

Method, device, equipment and storage medium for calculating TF-IDF value of word

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for calculating a TF-IDF value of a word.

Background

Compared with the traditional customer service system, the intelligent question answering system has the advantages of high efficiency, low cost and the like. At present, more and more enterprises use intelligent question-answering systems to provide conversation services for users.

When the intelligent question-answering system makes an intelligent conversation with a user, the intelligent question-answering system generally needs to aim at a question posed by the user, and a corresponding answer is determined from a text library according to a term frequency-inverse text frequency (TF-IDF) value of a keyword contained in the question in the text library. However, as the text base of the intelligent question-answering system is increased, the amount of calculation required for calculating the TF-IDF value of the keyword is increased, so that the intelligent question-answering system consumes a large amount of calculation resources in the process of determining the answer according to the TF-IDF value of the keyword.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for calculating a TF-IDF value of a word, so that the calculation amount required for calculating the TF-IDF value of the word is reduced.

In a first aspect, an embodiment of the present application provides a method for calculating a TF-IDF value of a word, the method including:

determining a first target child node according to child nodes under a root node of a first target word traversal tree-shaped data structure, wherein a word segmentation indicated by the first target child node is the first target word;

acquiring a first target set corresponding to the first target child node, wherein the first target set comprises a first text identifier and a first word frequency (TF) value of a first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other;

determining a first inverse text frequency IDF value of the first target word according to the number of first text identifications included in the first target set and the total number of texts indicated by the root node;

and calculating a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In one possible embodiment, the method further comprises:

traversing virtual child nodes under the first target child nodes according to a second target word to determine a first virtual child node, wherein a pointer of the first virtual child node points to a second target child node under the root node, a word segmentation indicated by the second target child node is the second target word, and the second target word and the first target word are continuous in the same text;

acquiring a second target set corresponding to the second target child node, wherein the second target set comprises a second text identifier and a second TF value of the second target word in a text corresponding to the second text identifier, and texts corresponding to different second text identifiers are different from each other;

determining a second IDF value of the second target word according to the number of second text identifications included in the second target set and the total number of texts indicated by the root node;

and calculating a TF-IDF value corresponding to the second target word according to the second TF value and the second IDF value.

In one possible implementation, the first target word and the second target word are participles in a participle set, the participle set is obtained by participling a question text, and the method further includes:

determining a keyword set of a text corresponding to each first text identifier and a keyword set of a text corresponding to each second text identifier, wherein a TF-IDF value corresponding to each keyword in the keyword set is not lower than TF-IDF values corresponding to other participles in the text, and the keyword set comprises the first target word and/or the second target word;

determining a text corresponding to a target keyword set as an answer text corresponding to the question text, wherein the similarity between the target keyword set and the participle set is higher than the similarity between other keyword sets and the participle set.

In one possible implementation, the number of keywords included in the keyword set does not exceed the number of participles included in the participle set.

In one possible embodiment, the method further comprises:

performing word segmentation on a text to be processed to obtain a word segmentation set of the text to be processed, wherein the word segmentation set at least comprises a first word segmentation, and the text to be processed is any one text in a text library or a newly added text to be added into the text library;

when the participles indicated by the subnodes under the root node of the tree-shaped data structure do not comprise the first participle, creating a first subnode and a set corresponding to the first subnode under the root node of the tree-shaped data structure, wherein the participle indicated by the first subnode is the first participle, and adding the text identifier of the text to be processed and a first TF value of the first participle in the text to be processed in the set corresponding to the first subnode;

when the participle indicated by the first child node is the first participle in the child nodes under the root node, adding the text identifier of the text to be processed and a first TF value of the first participle in the text to be processed in a set corresponding to the first child node.

In one possible implementation, the set of participles of the text to be processed further includes a second participle, and the second participle is consecutive to the first participle in the text to be processed, and the method further includes:

when the participles indicated by the virtual subnodes under the first subnode do not contain the second participles, creating target virtual subnodes under the first subnode, establishing pointers pointing to second subnodes under the root node for the target virtual subnodes, wherein the participles indicated by the second subnodes are the second participles, and adding text identifications of the texts to be processed and second TF values of the second participles in the texts to be processed in a set corresponding to the second subnodes;

when the participle indicated by the target virtual child node is the second participle in the virtual child nodes under the first child node, adding the text identifier of the text to be processed and a second TF value of the second participle in the text to be processed in a set corresponding to the second child node pointed by the pointer of the target virtual child node.

In one possible implementation, the set of participles further includes a third participle, and the method further includes:

when the third segmentation is determined to be the same as the first segmentation, not adding the text identifier of the text to be processed and the TF value of the third segmentation in the text to be processed in the set corresponding to the first child node.

In a second aspect, an embodiment of the present application further provides an apparatus for calculating a TF-IDF value of a word, where the apparatus includes:

the first determining module is used for determining a first target child node according to child nodes under a root node of a first target word traversal tree-shaped data structure, wherein a word segmentation indicated by the first target child node is the first target word;

a first obtaining module, configured to obtain a first target set corresponding to the first target child node, where the first target set includes a first text identifier and a first word frequency TF value of the first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other;

a second determining module, configured to determine, according to the number of first text identifiers included in the first target set and a total number of texts indicated by the root node, a first inverse text frequency IDF value of the first target word;

and the first calculating module is used for calculating a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In a possible embodiment, the apparatus further comprises:

a third determining module, configured to traverse virtual child nodes under the first target child node according to a second target word, and determine a first virtual child node, where a pointer of the first virtual child node points to a second target child node under the root node, a word segment indicated by the second target child node is the second target word, and the second target word is continuous in the same text as the first target word;

a second obtaining module, configured to obtain a second target set corresponding to the second target child node, where the second target set includes a second text identifier and a second TF value of the second target word in a text corresponding to the second text identifier, and texts corresponding to different second text identifiers are different from each other;

a fourth determining module, configured to determine a second IDF value of the second target word according to the number of second text identifiers included in the second target set and a total number of texts indicated by the root node;

and the second calculating module is used for calculating a TF-IDF value corresponding to the second target word according to the second TF value and the second IDF value.

In some possible embodiments, the first target word and the second target word are participles in a participle set, the participle set is obtained by participling a question text, and the apparatus further includes:

a fifth determining module, configured to determine a keyword set of the text corresponding to each first text identifier and a keyword set of the text corresponding to each second text identifier, where a TF-IDF value corresponding to each keyword in the keyword set is not lower than TF-IDF values corresponding to other participles in the text, and the keyword set includes the first target word and/or the second target word;

a sixth determining module, configured to determine a text corresponding to a target keyword set as an answer text corresponding to the question text, where a similarity between the target keyword set and the participle set is higher than a similarity between other keyword sets and the participle set.

In some possible embodiments, the number of keywords included in the keyword set does not exceed the number of participles included in the participle set.

In some possible embodiments, the apparatus further comprises:

the word segmentation module is used for segmenting a text to be processed to obtain a word segmentation set of the text to be processed, wherein the word segmentation set at least comprises a first word segmentation, and the text to be processed is any one text in a text library or a newly added text to be added to the text library;

a first creating module, configured to create a first child node and a set corresponding to the first child node under a root node of the tree-shaped data structure when a participle indicated by the child node under the root node of the tree-shaped data structure does not include the first participle, where the participle indicated by the first child node is the first participle, and add a text identifier of the text to be processed and a first TF value of the first participle in the text to be processed in the set corresponding to the first child node;

a first adding module, configured to add, when a participle indicated by a first child node exists in the child nodes below the root node and is the first participle, a text identifier of the to-be-processed text and a first TF value of the first participle in the to-be-processed text in a set corresponding to the first child node.

In some possible embodiments, the set of participles of the to-be-processed text further includes a second participle, and the second participle is consecutive to the first participle in the to-be-processed text, and the apparatus further includes:

a second creating module, configured to create a target virtual child node under the first child node when a word segmentation indicated by the virtual child node under the first child node does not include the second word segmentation, and create a pointer pointing to a second child node under the root node for the target virtual child node, where the word segmentation indicated by the second child node is the second word segmentation, and add a text identifier of the text to be processed and a second TF value of the second word segmentation in the text to be processed in a set corresponding to the second child node;

a second adding module, configured to, when a participle indicated by a target virtual child node is the second participle in the virtual child nodes under the first child node, add, in a set corresponding to the second child node pointed by the pointer of the target virtual child node, a text identifier of the text to be processed and a second TF value of the second participle in the text to be processed.

In some possible embodiments, the set of participles further includes a third participle, and the apparatus further includes:

and the non-adding module is used for not adding the text identifier of the text to be processed and the TF value of the third participle in the text to be processed in the set corresponding to the first child node when the third participle is determined to be the same as the first participle.

In a third aspect, an embodiment of the present application further provides an apparatus, where the apparatus includes a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method for calculating a TF-IDF value of a word according to any of the embodiments of the first aspect according to instructions in the program code.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method for calculating a TF-IDF value of a word according to any one of the embodiments of the first aspect.

In the implementation manner of the embodiment of the application, when the TF-IDF value of the target word needs to be calculated, a target child node can be determined according to the target word traversing child nodes under the root node of the pre-constructed tree data structure, and a word segmentation indicated by the target child node is the target word; after the target child node is determined, a target set corresponding to the target child node can be further obtained, the target set comprises text identifications and TF values of the target word in texts corresponding to the text identifications, and it can be understood that the text corresponding to each text identification in the target set comprises the target word; then, the IDF value of the target word may be determined according to the number of text labels included in the target set and the total number of texts indicated by the root node, and then after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word may be further calculated.

Therefore, in the process of calculating the TF-IDF value of the target word, after the target set corresponding to the target word is determined, which texts (i.e., texts corresponding to text identifiers in the target set) in the text library contain the target word can be directly determined from the target set, and the texts containing the target word in the text library are determined in a manner of traversing the text library by consuming a large amount of computing resources, so that not only can computing resources required for determining the texts be effectively reduced, that is, computing resources required for calculating the IDF value of the target word be effectively reduced, but also the efficiency of determining the IDF value of the target word can be effectively improved; meanwhile, the TF value of the target word in each text containing the target word can be determined according to the target set, so that the TF value of the target word does not need to be repeatedly calculated for each text, the calculation amount required for repeatedly calculating the TF value of the target word can be reduced, the TF value of the target word is obtained in a query set mode, and the efficiency of obtaining the TF value of the target word can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for calculating an IF-IDF value of a word according to an embodiment of the present application;

FIG. 3 is a diagram of a tree data structure;

FIG. 4 is a schematic diagram of another tree data structure;

FIG. 5 is a diagram illustrating an apparatus for calculating TF-IDF values of a word according to an embodiment of the present application;

fig. 6 is a schematic hardware structure diagram of an apparatus in an embodiment of the present application.

Detailed Description

When a user proposes a question to the intelligent question-answering system, the intelligent question-answering system determines a keyword from the question presented by the user and retrieves an answer for replying to the user from the text base based on the keyword. In particular, the intelligent question-answering system can locate texts containing the keywords from the text base according to the keywords, and the texts may be texts used for answering the user questions, at this time, in order to improve the accuracy of answering the user questions by the intelligent question-answering system as much as possible (i.e. to avoid "answering questions not all), the intelligent question-answering system can further calculate the TF-IDF value of the keywords included in each located text, so that the texts with the keywords with higher TF-IDF values can be used as texts used for answering the answers of the user questions.

The TF value of the keyword can be calculated by the formula (1):

wherein n is the number of times of the keyword appearing in a text, sigma_kn_iThe total number of all the participles in the text.

The IDF value of the keyword can be calculated by the following formula (2):

where | D | is the total number of texts in the text library, | { j: { t_i∈d_jAnd | is the total number of texts containing the keyword.

And then, calculating the product of the TF and the IDF of the keyword to obtain the TF-IDF value corresponding to the keyword.

However, each time the IDF value of a keyword is calculated, it is necessary to traverse each of the text librariesWords included in the text to determine from a large number of texts the text containing the keyword, and as the corpus of texts of the intelligent question-and-answer system increases, this traversal text determines | { j: { t_i∈d_jThe | mode needs to consume larger computing resources of the intelligent question-answering system; moreover, for the same keyword, the TF value of the keyword in each text needs to be repeatedly calculated each time, so that the consumed resources are also large.

Based on this, the embodiment of the application provides a method for calculating the TF-IDF value of a word, aiming at reducing the calculation amount required for calculating the TF-IDF value of the word. In specific implementation, when the TF-IDF value of a target word needs to be calculated, a target child node can be determined according to the child nodes under the root node of a tree data structure constructed in advance traversed by the target word, and a word segmentation indicated by the target child node is the target word; after the target child node is determined, a target set corresponding to the target child node can be further obtained, the target set comprises text identifications and TF values of the target word in texts corresponding to the text identifications, and it can be understood that the text corresponding to each text identification in the target set comprises the target word; then, the IDF value of the target word may be determined according to the number of text labels included in the target set and the total number of texts indicated by the root node, and then after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word may be further calculated.

As an example, the embodiment of the present application may be applied to an exemplary application scenario as shown in fig. 1. In this scenario, the user 101 may have a voice interaction with the smart question-and-answer system, which is hosted on the terminal 102. The user 101 can input the target word to the terminal 102 by voice; the intelligent question-answering system on the terminal 102 can determine a target child node according to child nodes under the root node of the target word traversing tree-shaped data structure, and the participle indicated by the target child node is the target word; then, the intelligent question-answering system on the terminal 102 may obtain a target set corresponding to the target child node, where the target set includes a text identifier and a TF value of the target word in a text corresponding to the text identifier; then, the intelligent question-answering system on the terminal 102 may determine the IDF value of the target word according to the number of the text identifiers included in the target set and the total number of the texts indicated by the root node, and further may further calculate the TF-IDF value of the target word with respect to the text corresponding to each text identifier in the target set after obtaining the TF value and the IDF value of the target word. In this way, the intelligent question-answering system on the terminal 102 can take the text corresponding to the maximum TF-IDF value of the target word as the text of the reply user, and convert the text into voice to inform the user 101.

It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario. For example, in other possible application scenarios, the intelligent question-answering system may also be loaded on a server, the terminal 102 is responsible for converting a question input by a user through voice into a text and sending the text to the server, the server performs the above-mentioned process of calculating the TF-IDF value of the target word, and returns the finally determined text to the terminal 102, so that the terminal 102 presents the text to the user 101; of course, the process of calculating the TF-IDF value corresponding to the target word may be applied not only to the scenario where the user performs question-answer interaction with the intelligent question-answer system, but also to other application scenarios such as keyword retrieval. In summary, the embodiments of the present application may be applied in any applicable scenario and are not limited to the scenario examples described above.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application examples are described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 2, fig. 2 is a flow chart illustrating a method for determining a headword in an embodiment of the present application. The method may be executed by the intelligent question-answering system on the terminal 102 shown in fig. 1, or may be executed by a server or other devices. The method specifically comprises the following steps:

s201: and determining a first target child node according to child nodes under the root node of the first target word traversing tree-shaped data structure, wherein the participle indicated by the first target child node is the first target word.

S202: and acquiring a first target set corresponding to the first target child node, wherein the first target set comprises a first text identifier and a first TF value of a second target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other.

In this embodiment, a tree data structure may be constructed in advance based on a text library (alternatively referred to as a corpus) including a plurality of texts. In the tree data structure, a root node may indicate a total number of texts contained in the text repository, and child nodes under the root node may indicate word segmentations included in the texts. Where each child node may indicate a participle and different child nodes under the root node may indicate different participles. The specific implementation process for building the tree data structure is described in detail later.

It should be noted that each child node may have a set corresponding thereto, and the set may include one or more text identifiers, and the text corresponding to each text identifier includes the word segmentation indicated by the child node. The text identifier may uniquely identify the text, and texts corresponding to different text identifiers are different from each other, for example, different texts may be identified by gradually increasing numerical values. Thus, based on the text identifiers recorded in the set, it can be determined which texts in the text corpus contain the segmentation word, or in which texts in the text corpus the segmentation word appears. Meanwhile, a TF value of the word in the text corresponding to each text identifier may also be recorded in the set, and as an example, the TF value of the word in the text corresponding to each text identifier may be calculated in advance in the process of constructing the tree data structure, and the TF value and the text identifier are stored in a corresponding manner. For example, for a child node N, its corresponding set may include [ id ]₁，TF₁]、[id₂，TF₂]、[id₃，TF₃]Etc. characterizing the word in id₁、id₂And id₃Appears in the identified text and, at id₁The TF value in the identified text is TF₁At id₂The TF value in the identified text is TF₂At id₃The TF value in the identified text is TF₃And the like.

Therefore, for the word segmentation indicated by each child node, the text identifier corresponding to the text containing the word segmentation and the TF value of the word segmentation in the text can be determined by searching the set corresponding to the child node, so that traversal for each text from a text library is not needed, repeated calculation of the TF value of the same word segmentation in each text is not needed, consumption of computing resources is reduced, and meanwhile, the efficiency of obtaining the TF value of the word segmentation in each text can be improved.

When a TF-IDF value corresponding to a word (hereinafter referred to as a first target word) needs to be calculated, traversal may be performed on child nodes under a root node of the tree data structure, specifically, traversal may be performed on a word segmentation indicated by which child node under the root node is the first target word, so as to determine a first target child node indicating the first target word. As shown in fig. 3, assuming that the first target word is "go", a plurality of child nodes under the root node may be traversed (5 child nodes are exemplarily shown in fig. 3, respectively indicating the participles "your", "good", "go", "where" and "bar"), so that it may be determined that the participle indicated by the 3 rd child node (from left to right) is "go", and the 3 rd child node is the above-mentioned first target child node.

After the first target child node is determined, a set (hereinafter referred to as a first target set for convenience of description) corresponding to the first target child node may be further determined, so that the text identifier recorded in the text library and the TF value of the first target word in the text corresponding to the text identifier may be determined according to information included in the first target set. When the first target child node is created, a set may be created for the first target child node at the same time, and the set is associated with the first target child node, so as to establish a correspondence relationship between the first target child node and the set.

S203: according to the number of the first text identifications included in the first target set and the total number of texts indicated by the root node, a first IDF value of the first target word can be determined.

It is to be understood that the first text identifier in the first target set may identify the text containing the first target word, and then, the number of text identifiers included in the first target set, that is, the number of texts in the text library containing the first target word, is included in the first target set. Therefore, when calculating the IDF value of the first target word (for convenience of description, hereinafter referred to as a first IDF value), the number of first text labels in the first target set may be counted, and the IDF value of the first target word may be calculated according to the number of first text labels and the total number of texts in the text base indicated by the root node by using the foregoing formula (2).

Therefore, in the process of calculating the IDF value of the first target word, because the first text identifier is recorded in the first target set, the number of texts in the text base containing the first target word is determined without traversing the text base by consuming a large amount of computing resources, so that the consumption of computing resources can be effectively reduced, and the efficiency of calculating the IDF value of the first target word is improved.

S204: and calculating a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In a specific implementation, after the first TF value and the first IDF value corresponding to the first target word are calculated according to the information recorded in the first target set, a product of the first TF value and the first IDF value may be calculated, so as to obtain the TF-IDF value corresponding to the first target word. It should be noted that the determined TF-IDF value is a TF-IDF value that the first target word has for a text corresponding to each first text identifier in the first target set, specifically, when the first target set only includes one first text identifier, based on the TF value corresponding to the first text identifier, the calculated TF-IDF value is also one; when the first target set comprises a plurality of first text identifiers, based on the TF value corresponding to each first text identifier, the TF-IDF value corresponding to the TF value can be calculated, so that a plurality of TF-IDF values can be obtained.

In many practical scenarios, it may be necessary to continuously calculate TF-IDF values corresponding to a plurality of participles. For example, in a scenario in which a user performs question-answer interaction with a terminal, a question text presented by the user may include a plurality of participles, and the terminal needs to calculate a TF-IDF value corresponding to each participle. Therefore, in this embodiment, after the TF-IDF value corresponding to the first target word is calculated, the second target word continuous to the first target word may be further calculated. The first target word and the second target word are continuous, which means that the first target word and the second target word appear continuously in the same sentence. For example, for the sentence "which you go" when the first target word is "your", the second target word is the word "go" continuous with "your", and when the first target word is "which", the second target word is the word "play" continuous with "which".

When the TF-IDF value corresponding to the second target word is calculated, similar to the first target word, the child nodes under the root node may also be traversed, so that a second target child node indicating the second target word may be determined, and according to information recorded in a second target set corresponding to the second target child node, the TF value corresponding to the second target word in each text and the IDF value corresponding to the second target word are respectively determined, so as to obtain the TF-IDF value corresponding to the second target word.

In other possible embodiments, the first target child node may further include a virtual child node, a pointer of each virtual child node may point to a corresponding child node below the root node, and a segment indicated by the virtual child node is a segment indicated by a corresponding child node below the root node. In this way, after the first target child node is traversed for the first target word, a virtual child node (hereinafter referred to as a first virtual child node) indicating a second target word may be traversed from the virtual child nodes under the first target child node, and then a second target child node (which is a child node under the root node) pointed by the pointer of the first virtual child node may be determined. It can be understood that, in practical applications, the number of the participles consecutive to the first target word (i.e., the number of the virtual child nodes under the first target child node) is usually much smaller than the number of the child nodes under the root node, and therefore, only the virtual child nodes under the first target child node are traversed, and then the second target child node is determined according to the first virtual child node obtained by the traversal, so that the calculation resources required to determine the second target child node indicating the second target word can be effectively reduced, and the calculation resource consumption for calculating the TF-IDF value of the second target word can be further reduced.

As shown in fig. 4, after the first target child node corresponding to the first target word "you" is determined, the virtual child node under the first target child node (including two virtual child nodes "good" and "go" and shown by a dashed line coil in the figure) may be traversed, and the virtual child node indicating the second target word "go" may be further determined, and the pointer of the virtual child node may point to the child node "go" (shown by a solid line coil) under the root node, as shown by a dashed arrow in fig. 4, so that when the second target child node corresponding to the second target word "go" is determined, traversal from the child node under the root node is not needed.

Of course, if the virtual child node under the first target child node is traversed and it is determined that there is no virtual child node indicating the second target word, the traversal may be performed from the child nodes under the root node to determine the second target child node corresponding to the second target word. If the child node indicating the second target word does not exist in the child nodes under the root node, a new child node can be created under the root node, and the word segmentation indicated by the new child node is the second target word.

After the target second child node is determined, a second target set corresponding to the second target child node may be further obtained, so that a text identifier (for convenience of distinguishing, referred to as a second text identifier) recorded in the second target set and a second TF value of the second target word in a text corresponding to the second text identifier may be obtained. Wherein, the texts corresponding to different second text identifications are different from each other, and the plurality of text identifications correspond to the plurality of second TF values. Then, according to the number of second target text identifiers included in the second target set and the total number of texts indicated by the root node, a second IDF value of the second target word may be determined, and further, according to the second TF value and the second IDF value, a TF-IDF value corresponding to the second target word may be calculated. When the second target set comprises a plurality of second text identifications, based on the TF value corresponding to each second text identification, the TF-IDF value corresponding to the TF value can be calculated, and therefore a plurality of TF-IDF values can be obtained.

It is to be noted that, in this embodiment, the calculation of the first target word and the second target word is taken as an example to describe the technical solution of the embodiment of the present application. In a further possible embodiment, in addition to calculating the first target word and the second target word, the calculation of a third target word, a fourth target word, etc. may be continued. When the TF-IDF value corresponding to the third target word is calculated, traversal may be performed from the virtual child nodes under the first virtual child node indicating the second target word, for example, in fig. 4, traversal may be performed from the virtual child nodes under the first virtual child node indicating "go" of the second target word (the virtual child nodes indicating "which" and "bar" respectively) when the TF-IDF value corresponding to the third target word is calculated, so as to determine the virtual child node under the first virtual child node indicating "which" of the third target word, and similarly, the pointer of "which" of the virtual child node also points to "which" of the corresponding child node under the root node. After determining the child node ("which") corresponding to the root node pointed by the pointer of the virtual child node ("which") below the first virtual child node, for a subsequent process of calculating the TF-IDF value corresponding to the third target word according to the set corresponding to the child node, reference may be made to the process of calculating the TF-IDF value corresponding to the second target word, which is not described herein again.

In a scenario of practical application, the first target word and the second target word may be words in a question text, and when an answer text for replying the question needs to be determined from a text library, first, the question text may be obtained, and word segmentation processing is performed on the question text to obtain a word segmentation set corresponding to the question text, where the word segmentation set at least includes the first target word and the second target word.

Then, the above process may be performed to determine a first text identifier corresponding to the text containing the first target word and a second text identifier corresponding to the text containing the second target word, respectively, so that the text containing the first target word and/or the second target word may be determined from the plurality of texts contained in the text library.

Then, for each determined text, a TF-IDF value corresponding to each participle in the text can be calculated, wherein the calculation process of the TF-IDF corresponding to each participle can refer to the calculation process of the TF-IDF of the first target word and the second target word, so as to determine a keyword set corresponding to the text according to the TF-IDF value corresponding to each participle in the text, the TF-IDF value corresponding to each keyword in the keyword set is not lower than the TF-IDF values corresponding to other participles in the text, and the keywords included in each keyword set are different from each other. In a specific implementation, for each text, a participle with a larger TF-IDF value may be selected from the text as a keyword, and a keyword set corresponding to the text is formed.

In this way, each determined text (including the first target word and/or the second target word) has a corresponding keyword set, and then the similarity between the keyword set corresponding to each text and the participle set corresponding to the question text, such as the cosine similarity between two keyword sets, can be calculated, so that the keyword set with the highest similarity to the participle set (for convenience of description, hereinafter referred to as the target keyword set) can be determined from the keyword sets, that is, the similarity between the target keyword set and the participle set is higher than the similarity between the other keyword set and the participle set, and the text corresponding to the target keyword set is determined as the answer text corresponding to the question text.

Of course, in a further possible implementation, when determining the keyword set corresponding to each determined text, the number of keywords included in the keyword set may also be limited. As an example, the number of keywords included in the keyword set may be limited not to exceed the number of participles included in the participle set, for example, if the participles in the participle set are 8, the keyword set included in each keyword set may be 8 or 7, and the like. Of course, in practical applications, the number of keywords included in the keyword set may also exceed the number of participles included in the participle set, for example, the number of participles in the participle set is 8, and the number of keyword sets included in each keyword set may be 9.

It should be noted that, when the participles included in the participle set further include other words besides the first target word and the second target word, the text determined from the text library may also be a text including other participles. Correspondingly, the determined target keyword set may also be a keyword set corresponding to a text including other participles.

It should be noted that, in this embodiment, the set corresponding to the child node under each root node may specifically be a list created for the child node, that is, a text identifier of a text where a word segmentation indicated by the child node is located and a TF value of the word segmentation in the text are correspondingly recorded in the list corresponding to the child node; of course, in other possible embodiments, the set corresponding to the child node may also exist in the form of a key-value pair. That is, in the key value pair ("key value") corresponding to the child node, "key" may be a text identifier of the text where the word is located, and "value" may be a TF value of the word in the text. In practical applications, the set corresponding to the child node may also exist in other forms, which is not limited in this embodiment.

Next, a process of constructing a tree data structure based on a plurality of texts in a text library will be described in detail:

in the process of constructing the tree data structure, for any text (for convenience of description, hereinafter referred to as a to-be-processed text) in the text library, performing word segmentation on the to-be-processed text to obtain a word segmentation set of the to-be-processed text, where the word segmentation set at least includes a first word segmentation. It is understood that, in practical applications, the participle set may include a plurality of participles, and here, a process of constructing the tree data structure is exemplified by taking one of the participles (hereinafter, referred to as a first participle) as an example. First, it may traverse whether the first participle is included in the participle indicated by the child node under the root node of the tree data structure, and when the first participle is not included (for example, when the data structure is initially constructed, there is no child node under the root node, or there are fewer child nodes under the root node, there is no child node indicating the first participle, etc.), a first child node and a set corresponding to the first child node may be created under the root node of the tree data structure, and a text identifier of the text to be processed and a first TF value of the first participle in the text to be processed are added to the set corresponding to the first child node, where the participle indicated by the first child node is the first participle, and the first TF value may be calculated by the foregoing formula (1). And when the participle indicated by the first child node exists in the child nodes under the root node and is the first participle, adding the text identifier of the text to be processed and the first TF value of the first participle in the text to be processed in a set corresponding to the first child node, under the condition that the first child node is created in advance, adding the text identifier and the corresponding first TF value in the set corresponding to the first child node, wherein the set at least comprises two different text identifiers and two TF values. Therefore, the process of constructing the tree data structure based on the first word segmentation in the text to be processed is completed.

Further, when the tree data structure is constructed based on the second participle in the participle set of the text to be processed, the tree data structure can also be constructed according to a construction process similar to the first participle.

In yet another possible implementation, after the process of building the tree data structure according to the first segmentation is completed, for a second segmentation that is continuous with the first segmentation in the text to be processed, whether the segmentation indicated by the virtual child node below the first child node includes the second segmentation may be first traversed, and when the second segmentation is not included, a target virtual child node may be created below the first child node, and a pointer pointing to the second child node below the root node may be created for the target virtual child node. The participles indicated by the target virtual child node and the second child node are both second participles, and the second child node may be determined by traversing the child nodes under the root node (similar to the process of determining its own child node according to the first participle), and of course, if the child nodes under the root node do not include a child node indicating the second participle, a set corresponding to the second child node and the second child node may be created for the second participle. Meanwhile, the text identification of the text to be processed and the second TF value of the second participle in the text to be processed can be added into the set corresponding to the second child node. Thus, when the TF value corresponding to the second word segmentation needs to be queried, the virtual child node under the first child node may be traversed to determine the target virtual child node indicating the second word segmentation, so as to determine the second child node based on the pointer of the target virtual child node, and further obtain the TF value recorded in the set corresponding to the second child node (of course, the text identifier of the text to be processed may also be obtained in the set). Because the number of the virtual child nodes is usually much smaller than that of the child nodes under the root node, the child nodes corresponding to the second word segmentation are determined by traversing the virtual child nodes, so that the required computing resources can be effectively reduced, and meanwhile, the processing efficiency can be improved.

When the word segmentation indicated by the virtual child node under the first child node includes the second word segmentation, the second child node can be determined according to the pointer of the virtual child node indicating the second word segmentation, and then the identifier of the text to be processed and the second TF value of the second word segmentation in the text to be processed are added to the set corresponding to the second child node. Therefore, the process of constructing the tree data structure based on the second word in the text to be processed is completed.

It should be noted that, in practical application, a participle set of a to-be-processed text may further include a third participle that is the same as the first participle, and since the first participle and the third participle are participles in the same to-be-processed text, and a TF of the first participle and the third participle in the to-be-processed text is the same, after adding a text identifier of the to-be-processed text and a first TF value of the first participle in the to-be-processed text in a set corresponding to a first child node based on the first participle, it is not necessary to add the same text identifier and TF value to the set corresponding to the first child node repeatedly for the third participle that is the same as the first participle (of course, even if the same text identifier and TF value are added repeatedly, the same document identifier and the same TF value may be merged subsequently).

Based on this, in an exemplary embodiment, when it is determined that the third participle in the participle set is the same as the first participle, the text identifier of the text to be processed and the TF value of the third participle in the text to be processed may not be added to the set corresponding to the first child node.

For each word segmentation in each text in the text library, the tree data structure can be constructed according to the above process, so that the tree data structure constructed based on the text library can be obtained.

In addition, when a new text is required to be added to the text base, the process of building the tree data structure by using the text base is similar to the process of building the tree data structure by using the text base. Of course, the total number of texts indicated by the root node in the tree data structure increases, and the increased number is the number of the newly added texts.

In this embodiment, when the TF-IDF value of the target word needs to be calculated, a target child node may be determined according to a child node under a root node of a tree data structure constructed in advance through traversal of the target word, and a word segmentation indicated by the target child node is the target word; after the target child node is determined, a target set corresponding to the target child node can be further obtained, the target set comprises text identifications and TF values of the target word in texts corresponding to the text identifications, and it can be understood that the text corresponding to each text identification in the target set comprises the target word; then, the IDF value of the target word may be determined according to the number of text labels included in the target set and the total number of texts indicated by the root node, and then after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word may be further calculated. Therefore, in the process of calculating the TF-IDF value of the target word, after the target set corresponding to the target word is determined, which texts in the text base contain the target word can be directly determined from the target set, and the texts in the text base containing the target word are determined without traversing the text base by consuming a large amount of computing resources, so that not only can computing resources required for determining the texts be effectively reduced, that is, computing resources required for calculating the IDF value of the target word be effectively reduced, but also the efficiency of determining the IDF value of the target word can be effectively improved; meanwhile, the TF value of the target word in each text containing the target word can be determined according to the target set, so that the TF value of the target word does not need to be repeatedly calculated for each text, the calculation amount required for repeatedly calculating the TF value of the target word can be reduced, the TF value of the target word is obtained in a query set mode, and the efficiency of obtaining the TF value of the target word can be effectively improved.

In addition, the embodiment of the application also provides a device for calculating the TF-IDF value of the word. Referring to fig. 5, fig. 5 shows an apparatus for calculating a TF-IDF value of a word according to an embodiment of the present application, where the apparatus 500 includes:

a first determining module 501, configured to determine a first target child node according to a child node under a root node of a first target word traversal tree data structure, where a word segmentation indicated by the first target child node is the first target word;

a first obtaining module 502, configured to obtain a first target set corresponding to the first target child node, where the first target set includes a first text identifier and a first word frequency TF value of the first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other;

a second determining module 503, configured to determine a first inverse text frequency IDF value of the first target word according to the number of the first text identifiers included in the first target set and the total number of texts indicated by the root node;

a first calculating module 504, configured to calculate a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In a possible implementation, the apparatus 500 further includes:

In some possible embodiments, the first target word and the second target word are participles in a participle set, the participle set is obtained by participling a question text, and the apparatus 500 further includes:

In some possible embodiments, the apparatus 500 further comprises:

In some possible embodiments, the participle set of the to-be-processed text further includes a second participle, and the second participle is consecutive to the first participle in the to-be-processed text, and the apparatus 500 further includes:

In some possible embodiments, the set of participles further includes a third participle, and the apparatus 500 further includes:

It should be noted that, for the contents of information interaction, execution process, and the like between the modules and units of the apparatus, since the same concept is based on the method embodiment in the embodiment of the present application, the technical effect brought by the contents is the same as that of the method embodiment in the embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment in the embodiment of the present application, and are not described herein again.

In addition, the embodiment of the application also provides equipment. Referring to fig. 6, fig. 6 shows a schematic hardware structure diagram of an apparatus in an embodiment of the present application, where the apparatus 600 includes a processor 601 and a memory 602:

the memory 602 is used for storing program codes and transmitting the program codes to the processor 601;

the processor 601 is configured to perform the method for calculating the TF-IDF value of a word as described in the above method embodiments according to instructions in the program code.

The embodiment of the application also provides a computer readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer storage media and communication media, and may include any medium that can communicate a computer program from one place to another. A storage medium may be any target medium that can be accessed by a computer.

As an alternative design, a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that is targeted for carriage or stores desired program code in the form of instructions or data structures and that is accessible by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Such a concatenation is also intended to be included within the scope of computer readable media.

It should be noted that "of, corresponding to" and "corresponding" may be sometimes used in combination in the present application, and it should be noted that the intended meaning is consistent when the difference is not emphasized.

It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the present application, "at least one" means one or more. "plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any concatenation of these items, including any concatenation of single item(s) or plural item(s). For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of calculating a TF-IDF value of a word, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the first target word and the second target word are participles in a participle set, and the participle set is obtained by participling a question text, and the method further comprises:

4. The method of claim 3, wherein the number of keywords included in the set of keywords does not exceed the number of participles included in the set of participles.

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein the set of tokens of the text to be processed further comprises a second token that is consecutive to the first token in the text to be processed, the method further comprising:

7. The method of claim 5, wherein the set of participles further comprises a third participle, the method further comprising:

8. An apparatus for calculating a TF-IDF value of a word, the apparatus comprising:

9. An apparatus, comprising a processor and a memory;

the processor for performing the method of calculating a TF-IDF value of a word according to any of claims 1 to 7 according to instructions in the program code.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of calculating a TF-IDF value of a word according to any one of claims 1 to 7.