CN111125332B

CN111125332B - Method, device, equipment and storage medium for calculating TF-IDF value of word

Info

Publication number: CN111125332B
Application number: CN201911330916.2A
Authority: CN
Inventors: 陈建华; 崔朝辉; 赵立军; 张霞
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-07-21
Anticipated expiration: 2039-12-20
Also published as: CN111125332A

Abstract

The application discloses a method, a device, equipment and a storage medium for calculating TF-IDF values of words, comprising the following steps: traversing the child nodes under the root node of the tree data structure according to the target word, and determining the target child nodes; acquiring a target set corresponding to the target child node, wherein the target set comprises a text identifier and a TF value of the target word in a text corresponding to the text identifier; and determining the IDF value of the target word according to the number of text identifiers included in the target set and the total text number indicated by the root node, and further calculating the TF-IDF value of the target word. Therefore, after the target set corresponding to the target word is determined, it can be directly determined which texts in the text library contain the target word and the TF value of the target word in each text, so that the calculation resources required to determine the texts can be effectively reduced, and the efficiency of determining the TF-IDF value corresponding to the target word is improved.

Description

Method, device, equipment and storage medium for calculating TF-IDF value of word

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for calculating a TF-IDF value of a word.

Background

The intelligent question and answer is importantly applied to the field of artificial intelligence, and has the advantages of high efficiency, low cost and the like compared with the traditional customer service system. Currently, more and more enterprises use intelligent question-answering systems to provide dialogue services to users.

When the intelligent question-answering system performs intelligent dialogue with a user, it is generally required to determine corresponding answers from a text library according to word frequency-inverse text frequency (TF-IDF) values of keywords included in the questions in the text library, for questions posed by the user. However, with the increasing of the text library of the intelligent question-answering system, the amount of calculation required for calculating the TF-IDF value of the keyword is increased, so that the intelligent question-answering system needs to consume larger calculation resources in the process of determining the answer according to the TF-IDF value of the keyword.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for calculating the TF-IDF value of a word, so as to reduce the calculation amount required for calculating the TF-IDF value of the word.

In a first aspect, embodiments of the present application provide a method for calculating TF-IDF values of words, the method including:

According to the first target word traversing the sub-nodes under the root node of the tree data structure, determining a first target sub-node, wherein the word segmentation indicated by the first target sub-node is the first target word;

acquiring a first target set corresponding to the first target child node, wherein the first target set comprises a first text identifier and a first word frequency TF value of the first target word in texts corresponding to the first text identifier, and texts corresponding to different first text identifiers are different;

determining a first inverse text frequency (IDF) value of the first target word according to the number of first text identifiers included in the first target set and the total text indicated by the root node;

and calculating a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In one possible embodiment, the method further comprises:

traversing the virtual sub-node under the first target sub-node according to a second target word, determining a first virtual sub-node, wherein a pointer of the first virtual sub-node points to the second target sub-node under the root node, the segmentation indicated by the second target sub-node is the second target word, and the second target word and the first target word are continuous in the same text;

Acquiring a second target set corresponding to the second target child node, wherein the second target set comprises a second text identifier and a second TF value of the second target word in texts corresponding to the second text identifier, and texts corresponding to different second text identifiers are different;

determining a second IDF value of the second target word according to the number of second text identifiers included in the second target set and the total text number indicated by the root node;

and calculating a TF-IDF value corresponding to the second target word according to the second TF value and the second IDF value.

In one possible implementation manner, the first target word and the second target word are word segments in a word segment set, where the word segment set is obtained by word segment processing on a question text, and the method further includes:

determining a keyword set of a text corresponding to each first text identifier and a keyword set of a text corresponding to each second text identifier, wherein a TF-IDF value corresponding to each keyword in the keyword set is not lower than TF-IDF values corresponding to other segmentation words in the text, and the keyword set comprises the first target word and/or the second target word;

And determining the text corresponding to the target keyword set as the answer text corresponding to the question text, wherein the similarity between the target keyword set and the word segmentation set is higher than the similarity between other keyword sets and the word segmentation set.

In one possible implementation, the number of keywords included in the keyword set does not exceed the number of tokens included in the token set.

In one possible embodiment, the method further comprises:

word segmentation is carried out on a text to be processed to obtain a word segmentation set of the text to be processed, wherein the word segmentation set at least comprises a first word segmentation, and the text to be processed is any text in a text library or a newly added text to be added into the text library;

when the first word is not included in the word segmentation indicated by the child node under the root node of the tree data structure, a first child node and a set corresponding to the first child node are created under the root node of the tree data structure, the word segmentation indicated by the first child node is the first word segmentation, and a text identifier of the text to be processed and a first TF value of the first word in the text to be processed are added in the set corresponding to the first child node;

When the word indicated by the first sub-node exists in the sub-nodes under the root node as the first word, adding a text identifier of the text to be processed and a first TF value of the first word in the text to be processed in a set corresponding to the first sub-node.

In a possible implementation manner, the word segmentation set of the text to be processed further includes a second word segment, where the second word segment is continuous with the first word segment in the text to be processed, and the method further includes:

when the word segmentation indicated by the virtual child node under the first child node does not contain the second word segmentation, a target virtual child node is created under the first child node, a pointer pointing to a second child node under the root node is established for the target virtual child node, the word segmentation indicated by the second child node is the second word segmentation, and a text identifier of the text to be processed and a second TF value of the second word in the text to be processed are added in a set corresponding to the second child node;

when the word indicated by the target virtual child node exists in the virtual child nodes under the first child node and is the second word, adding a text identifier of the text to be processed and a second TF value of the second word in the text to be processed in a set corresponding to the second child node pointed by the pointer of the target virtual child node.

In a possible implementation manner, the word segmentation set further includes a third word segmentation, and the method further includes:

and when the third word segmentation is determined to be the same as the first word segmentation, not adding the text identifier of the text to be processed and the TF value of the third word segmentation in the text to be processed in the set corresponding to the first child node.

In a second aspect, embodiments of the present application further provide an apparatus for calculating a TF-IDF value of a word, where the apparatus includes:

the first determining module is used for traversing the sub-nodes under the root node of the tree data structure according to the first target word, determining the first target sub-node, wherein the word segmentation indicated by the first target sub-node is the first target word;

the first acquisition module is used for acquiring a first target set corresponding to the first target child node, wherein the first target set comprises a first text identifier and a first word frequency TF value of the first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different;

a second determining module, configured to determine a first inverse text frequency IDF value of the first target word according to a number of first text identifiers included in the first target set and a total number of texts indicated by the root node;

And the first calculation module is used for calculating the TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In one possible embodiment, the apparatus further comprises:

the third determining module is used for traversing the virtual sub-node under the first target sub-node according to a second target word, determining a first virtual sub-node, wherein a pointer of the first virtual sub-node points to the second target sub-node under the root node, the second target word is divided into the second target word indicated by the second target sub-node, and the second target word and the first target word are continuous in the same text;

the second obtaining module is used for obtaining a second target set corresponding to the second target child node, the second target set comprises a second text identifier and a second TF value of the second target word in a text corresponding to the second text identifier, and texts corresponding to different second text identifiers are different;

a fourth determining module, configured to determine a second IDF value of the second target word according to a number of second text identifiers included in the second target set and a total number of texts indicated by the root node;

And the second calculation module is used for calculating the TF-IDF value corresponding to the second target word according to the second TF value and the second IDF value.

In some possible implementations, the first target word and the second target word are word segments in a word segment set, where the word segment set is obtained by word segment processing on a question text, and the apparatus further includes:

a fifth determining module, configured to determine a keyword set of a text corresponding to each first text identifier and a keyword set of a text corresponding to each second text identifier, where a TF-IDF value corresponding to each keyword in the keyword set is not lower than TF-IDF values corresponding to other segmentation words in the text, and the keyword set includes the first target word and/or the second target word;

and a sixth determining module, configured to determine a text corresponding to a target keyword set as an answer text corresponding to the question text, where a similarity between the target keyword set and the word segmentation set is higher than a similarity between other keyword sets and the word segmentation set.

In some possible embodiments, the number of keywords included in the keyword set does not exceed the number of tokens included in the token set.

In some possible embodiments, the apparatus further comprises:

the word segmentation module is used for segmenting a text to be processed to obtain a word segmentation set of the text to be processed, wherein the word segmentation set at least comprises a first word segmentation, and the text to be processed is any text in a text library or a new text to be added into the text library;

a first creating module, configured to create a first child node and a set corresponding to the first child node under a root node of the tree data structure when the first word is not included in the word segments indicated by the child nodes under the root node of the tree data structure, where the word segments indicated by the first child node are the first word segments, and add a text identifier of the text to be processed and a first TF value of the first word segment in the text to be processed in the set corresponding to the first child node;

and the first adding module is used for adding the text identifier of the text to be processed and the first TF value of the first word in the text to be processed in a set corresponding to the first sub-node when the word indicated by the first sub-node exists in the sub-node under the root node as the first word.

In some possible embodiments, the word segmentation set of the text to be processed further includes a second word segment, the second word segment being continuous with the first word segment in the text to be processed, and the apparatus further includes:

the second creating module is used for creating a target virtual child node under the first child node and establishing a pointer pointing to a second child node under the root node for the target virtual child node when the second word is not included in the word indicated by the virtual child node under the first child node, wherein the word indicated by the second child node is the second word, and a text identifier of the text to be processed and a second TF value of the second word in the text to be processed are added in a set corresponding to the second child node;

and the second adding module is used for adding the text identifier of the text to be processed and a second TF value of the second word in the text to be processed in a set corresponding to the second sub-node pointed by the pointer of the target virtual sub-node when the second word indicated by the target virtual sub-node exists in the virtual sub-node under the first sub-node.

In some possible embodiments, the word segmentation set further includes a third word segment, and the apparatus further includes:

and the non-adding module is used for not adding the text identifier of the text to be processed and the TF value of the third word in the text to be processed in the set corresponding to the first child node when the third word is determined to be identical to the first word.

In a third aspect, embodiments of the present application further provide an apparatus, including a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the method for calculating the TF-IDF value of the word according to any one of the embodiments of the first aspect according to an instruction in the program code.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, including instructions, which when executed on a computer, cause the computer to perform the method for calculating a TF-IDF value of a word according to any one of the embodiments of the first aspect.

In the above implementation manner of the embodiment of the present application, when a TF-IDF value of a target word needs to be calculated, a target child node may be determined by traversing child nodes under a root node of a pre-constructed tree data structure according to the target word, where a word segment indicated by the target child node is the target word; after determining the target sub-node, further obtaining a target set corresponding to the target sub-node, wherein the target set comprises text identifiers and TF values of the target words in texts corresponding to the text identifiers, and it can be understood that the texts corresponding to each text identifier in the target set comprise the target words; then, according to the number of text labels included in the target set and the total number of texts indicated by the root node, the IDF value of the target word can be determined, and after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word can be further calculated.

Therefore, in the process of calculating the TF-IDF value of the target word, after the target set corresponding to the target word is determined, it can be determined directly which texts exist in the text base (namely, texts corresponding to text identifiers in the target set) contain the target word from the target set, and the texts containing the target word in the text base are determined without traversing the text base in a mode of consuming a large amount of computing resources, so that not only the computing resources required for determining the texts, namely, the computing resources required for calculating the IDF value of the target word, but also the efficiency of determining the IDF value of the target word can be effectively improved; meanwhile, the TF value of the target word in each text containing the target word can be determined according to the target set, so that repeated calculation of the TF value of the target word is not needed for each text, the calculated amount required by repeated calculation of the TF value of the target word can be reduced, the TF value of the target word can be obtained in a query set mode, and the efficiency of obtaining the TF value of the target word can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

Fig. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;

FIG. 2 is a flowchart of a method for calculating the IF-IDF value of a word according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a tree data structure;

FIG. 4 is a schematic diagram of another tree data structure;

FIG. 5 is a schematic diagram of an apparatus for calculating TF-IDF values of words according to an embodiment of the present application;

fig. 6 is a schematic hardware structure of an apparatus according to an embodiment of the present application.

Detailed Description

When a user presents a question to the intelligent question-answering system, the intelligent question-answering system determines a keyword from the question presented by the user, and retrieves an answer for replying to the user from a text library based on the keyword. In particular, the intelligent question-answering system may locate, from the text library, the text containing the keyword according to the keyword, where the text may be the text for replying to the user question, and at this time, in order to improve the accuracy of replying to the user question by the intelligent question-answering system as much as possible (i.e. avoid "answering questions as much as possible"), the intelligent question-answering system may further calculate TF-IDF values of the keyword included in each located text, so that the text of the keyword with a higher TF-IDF value may be used as the text for replying to the answer of the user question.

The TF value of the keyword can be calculated by the formula (1):

where n is the number of occurrences of the keyword in a text, Σ _k n _i Is the total number of all the segmentations in the text.

The IDF value of the keyword can be calculated by the following formula (2):

where |d| is the total number of text in the text library, | { j: { t _i ∈d _j And } | is the total number of text containing the keyword.

Then, the product of TF and IDF of the key word is calculated, and then the TF-IDF value corresponding to the key word can be obtained.

However, each time the IDF value of a keyword is calculated, it is necessary to traverse the words included in each text in the text library to determine the text containing the keyword from among a plurality of texts, and as the text library of the intelligent question-answering system increases, this traversed text determines | { j: { t _i ∈d _j The mode of } | needs to consume larger computing resources of the intelligent question-answering system; in addition, for the same keyword, TF values of the keyword in each text need to be repeatedly calculated each time, so that resources consumed by the TF values are also large.

Based on this, the embodiment of the application provides a method for calculating the TF-IDF value of a word, aiming at reducing the calculation amount required for calculating the TF-IDF value of the word. When the TF-IDF value of the target word is required to be calculated, a sub-node under a root node of a pre-constructed tree data structure can be traversed according to the target word, a target sub-node is determined, and the word indicated by the target sub-node is the target word; after determining the target sub-node, further obtaining a target set corresponding to the target sub-node, wherein the target set comprises text identifiers and TF values of the target words in texts corresponding to the text identifiers, and it can be understood that the texts corresponding to each text identifier in the target set comprise the target words; then, according to the number of text labels included in the target set and the total number of texts indicated by the root node, the IDF value of the target word can be determined, and after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word can be further calculated.

As an example, the embodiments of the present application may be applied to an exemplary application scenario as shown in fig. 1. In this scenario, user 101 may interact with a voice interaction with an intelligent question-answering system that is carried on terminal 102. The user 101 may input a target word to the terminal 102 by voice; the intelligent question-answering system on the terminal 102 can traverse the sub-nodes under the root node of the tree data structure according to the target word, and determine the target sub-node, wherein the word indicated by the target sub-node is the target word; then, the intelligent question-answering system on the terminal 102 can acquire a target set corresponding to the target child node, wherein the target set comprises a text identifier and a TF value of the target word in a text corresponding to the text identifier; then, the intelligent question-answering system on the terminal 102 can determine the IDF value of the target word according to the number of text identifiers included in the target set and the total number of texts indicated by the root node, and further can further calculate the TF-IDF value of the target word relative to the text corresponding to each text identifier in the target set after obtaining the TF value and the IDF value of the target word. Thus, the intelligent question-answering system on the terminal 102 can take the text corresponding to the maximum TF-IDF value of the target word as the text of the replying user, and convert the text into voice to inform the user 101.

It is understood that the above scenario is only one example of a scenario provided in the embodiments of the present application, and the embodiments of the present application are not limited to this scenario. For example, in other possible application scenarios, the intelligent question-answering system may also be carried on a server, where the terminal 102 is responsible for converting the problem of the user's voice input into text, sending the text to the server, and the server performs the above-mentioned process of calculating the TF-IDF value of the target word, and returns the finally determined text to the terminal 102, so that the terminal 102 presents the text to the user 101; of course, the process of calculating the TF-IDF value corresponding to the target word may be applied not only to the scenario where the user performs question-answer interaction with the intelligent question-answer system, but also to other application scenarios such as keyword search. In summary, the embodiments of the present application may be applied to any applicable scenario, and are not limited to the scenario examples described above.

In order to make the above objects, features and advantages of the present application more comprehensible, various non-limiting implementations in accordance with embodiments of the present application are described below with reference to the accompanying drawings. It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 2, fig. 2 is a flow chart illustrating a method for determining a center word in an embodiment of the present application. The method may be performed by the intelligent question-answering system on the terminal 102 shown in fig. 1, or may be performed by a server or other device. The method specifically comprises the following steps:

s201: and traversing the child nodes under the root node of the tree data structure according to the first target word, and determining the first target child node, wherein the segmentation indicated by the first target child node is the first target word.

S202: the method comprises the steps of obtaining a first target set corresponding to a first target child node, wherein the first target set comprises a first text identifier and a first TF value of a second target word in texts corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other.

In this embodiment, a tree data structure may be constructed in advance based on a text library (or referred to as a corpus) containing a plurality of texts. In the tree data structure, a root node may indicate the total number of texts contained in the text library, and child nodes under the root node may indicate the segmentation included in the texts. Wherein each child node may indicate a word and different child nodes under the root node may indicate different words. The specific implementation of the construction of the tree data structure will be described in detail later.

It is noted that each child node may have a set corresponding to the child node, where one or more text identifiers may be included in the set, and each text identifier corresponds to a text that includes a word segment indicated by the child node. The text identifier may uniquely identify the text, and the texts corresponding to different text identifiers are different from each other, for example, the different texts may be identified by gradually increasing numerical values. Thus, based on the text identifications recorded in the collection, it can be determined which texts in the text library contain the segmentation, or which texts in the text library the segmentation appears in. Meanwhile, the TF value of the word in the text corresponding to each text identifier may be recorded in the set, which may be, for example, a TF value of the word in the text corresponding to each text identifier is calculated in advance and stored in correspondence with the text identifier in a process of constructing a tree data structure. For example, toIn child node N, its corresponding set may include [ id ] ₁ ，TF ₁ ]、[id ₂ ，TF ₂ ]、[id ₃ ，TF ₃ ]Etc. characterize the word in id ₁ 、id ₂ Id (id) ₃ Appears in the identified text and, at id ₁ The TF value in the identified text is TF ₁ At id ₂ The TF value in the identified text is TF ₂ At id ₃ The TF value in the identified text is TF ₃ Etc.

Therefore, for the word segmentation indicated by each child node, the text identifier corresponding to the text containing the word segmentation and the TF value of the word segmentation in the text can be determined by searching the set corresponding to the child node, so that traversing is not required for each text from a text library, repeated calculation of the TF value of the same word segmentation in each text is not required, consumption of calculation resources is reduced, and meanwhile efficiency of acquiring the TF value of the word segmentation in each text is improved.

When a TF-IDF value corresponding to a certain word (hereinafter referred to as a first target word for convenience of description) needs to be calculated, a child node under a root node of the tree data structure may be traversed, specifically, a child node under the root node is traversed, where a word segment indicated by the child node is the first target word, so as to determine a first target child node indicating the first target word. As shown in fig. 3, assuming that the first target word is "go", a plurality of sub-nodes (5 sub-nodes are shown in fig. 3 by way of example, and the words "your", "good", "go", "which", "bar" are respectively indicated) under the root node may be traversed, so that it may be determined that the word indicated by the 3 rd sub-node (from left to right) is "go", and the 3 rd sub-node is the first target sub-node.

After the first target sub-node is determined, a set corresponding to the first target sub-node (for convenience of description, hereinafter referred to as a first target set) may be further determined, so that a text identifier recorded in the text library and a TF value of the first target word in a text corresponding to the text identifier may be determined according to information included in the first target set. When the first target child node is created, a set can be created for the first target child node at the same time, and the set is associated with the first target child node, so that a corresponding relationship between the first target child node and the set is established.

S203: a first IDF value for the first target word may be determined based on a number of first text identifiers included in the first target set and a total number of text indicated by the root node.

It may be appreciated that, if the first text identifier in the first target set may identify a text that includes the first target word, the number of text identifiers included in the first target set is the number of texts that include the first target word in the text library. Therefore, when calculating the IDF value of the first target word (hereinafter referred to as the first IDF value for convenience of description), the number of the first text labels in the first target set may be counted, and the IDF value of the first target word may be calculated using the foregoing formula (2) according to the number of the first text labels and the total number of texts in the text base indicated by the root node.

Therefore, in the process of calculating the IDF value of the first target word, since the first text identifier is recorded in the first target set, the text quantity containing the first target word in the text library is not required to be determined in a way of traversing the text library by consuming a large amount of calculation resources, so that the consumption of calculation resources can be effectively reduced, and the efficiency of calculating the IDF value of the first target word is improved.

S204: and calculating a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

Specifically, when the TF-IDF value corresponding to the first target word is calculated according to the information recorded in the first target set, the product of the TF value and the IDF value may be calculated, so as to obtain the TF-IDF value corresponding to the first target word. It should be noted that, the determined TF-IDF value is a TF-IDF value of the first target word corresponding to each text corresponding to the first text identifier in the first target set, specifically, when the first target set includes only one first text identifier, based on the TF value corresponding to the first text identifier, the calculated TF-IDF value is also one; when the first target set includes a plurality of first text identifiers, a TF-IDF value corresponding to the TF value can be calculated based on the TF value corresponding to each of the first text identifiers, thereby obtaining a plurality of TF-IDF values.

In many practical applications, TF-IDF values corresponding to a plurality of word segments may need to be calculated continuously. For example, in a scenario where a user performs question-answer interaction with a terminal, a text of a question posed by the user may include a plurality of words, and the terminal needs to calculate a TF-IDF value corresponding to each word. Therefore, in this embodiment, after calculating the TF-IDF value corresponding to the first target word, the calculation of the second target word continuous with the first target word may be continued. The first target word and the second target word are continuous, which means that the first target word and the second target word continuously appear in the same sentence. For example, for the sentence "your go to which play", when the first target word is "your", the second target word is the word "go to" continuous with "your", and when the first target word is "which", the second target word is the word "play" continuous with "which".

When the TF-IDF value corresponding to the second target word is calculated, similar to the first target word, the sub-nodes under the root node can be traversed, so that a second target sub-node indicating the second target word can be determined, and the TF value corresponding to the second target word in each text and the IDF value corresponding to the second target word are respectively determined according to the information recorded in the second target set corresponding to the second target sub-node, so that the TF-IDF value corresponding to the second target word is obtained.

Since the second target sub-node indicating the second target word needs to be traversed again from the plurality of word nodes under the root node when the second target sub-node is determined for the second target word, and certain computing resources are also required to be consumed for traversing the second target sub-node in the plurality of sub-nodes (of course, the computing resources consumed for traversing the sub-node are far less than the computing resources required for traversing the word segment of each text in the text library), in other possible embodiments, the first target sub-node may further have a virtual sub-node, and the pointer of each virtual sub-node may point to the corresponding sub-node under the root node, where the word segment indicated by the virtual sub-node is the word segment indicated by the corresponding sub-node under the root node. Thus, after the first target sub-node is traversed for the first target word, a virtual sub-node indicating the second target word (hereinafter referred to as a first virtual sub-node) can be traversed from the virtual sub-nodes under the first target sub-node, and further, the second target sub-node (which is the sub-node under the root node) pointed by the pointer of the first virtual sub-node can be determined. It can be understood that, in practical application, the number of word segments continuous with the first target word (that is, the number of virtual child nodes under the first target child node) is generally far less than the number of child nodes under the root node, so that only the virtual child nodes under the first target child node are traversed, and further the second target child node is determined according to the first virtual child node obtained by traversing, so that the calculation resource required for determining the second target child node indicating the second target word can be effectively reduced, and the calculation resource consumption for calculating the TF-IDF value of the second target word can be further reduced.

As shown in fig. 4, after determining the first target child node corresponding to the first target word "your", the virtual child node (including "good" and "go" two virtual child nodes and shown by dotted line coils) under the first target child node may be traversed, the virtual child node indicating the second target word "go" may be further determined, and the pointer of the virtual child node may point to the child node "go" under the root node (shown by solid line coils), as shown by dotted line arrows in fig. 4, so that when determining the second target child node corresponding to the second target word "go", the traversing from the child node under the root node is unnecessary.

Of course, if the virtual child node under the first target child node is traversed, and it is determined that there is no virtual child node indicating the second target word, the traversing may be performed from the child nodes under the root node to determine the second target child node corresponding to the second target word. If there is no child node indicating the second target word in the child nodes under the root node, a new child node may be created under the root node, and the word indicated by the new child node is the second target word.

After determining the second target child node, a second target set corresponding to the second target child node may be further obtained, so that a text identifier (hereinafter referred to as a second text identifier for convenience of distinction) recorded in the second target set and a second TF value of the second target word in a text corresponding to the second text identifier may be obtained. The texts corresponding to the different second text identifiers are different from each other, and the plurality of text identifiers correspond to a plurality of second TF values. Then, according to the number of second target text identifiers included in the second target set and the total text number indicated by the root node, a second IDF value of the second target word can be determined, and further according to the second TF value and the second IDF value, a TF-IDF value corresponding to the second target word can be calculated. When the second target set contains a plurality of second text identifiers, a TF-IDF value corresponding to the TF value can be calculated based on the TF value corresponding to each second text identifier, so that a plurality of TF-IDF values can be obtained.

It should be noted that, in this embodiment, the technical solution of the embodiment of the present application is described by taking the calculation of the first target word and the second target word as an example. In a further possible embodiment, the calculation of the third target word, the fourth target word, etc. may be continued in addition to the first target word and the second target word. When calculating the TF-IDF value corresponding to the third target word, it may be traversed from the virtual child node under the first virtual child node indicating the second target word, for example, when calculating the TF-IDF value corresponding to the third target word in fig. 4, it may be traversed from the virtual child node under the first virtual child node indicating the second target word (the virtual child nodes indicating "which" and "bar" respectively), to determine the virtual child node under the first virtual child node indicating which "corresponds to the third target word, and similarly, the pointer of which" corresponds to which "of the virtual child node also points to the corresponding child node under the root node. After determining the corresponding sub-node ("which") under the root node pointed by the pointer of the virtual sub-node ("which") under the first virtual sub-node, for the subsequent process of calculating the TF-IDF value corresponding to the third target word according to the set corresponding to the sub-node, the foregoing process of calculating the TF-IDF value corresponding to the second target word may be referred to, which is not described herein in detail.

In a practical application scenario, the first target word and the second target word may be words in a question text, when an answer text for replying to the question needs to be determined from a text library, the question text may be obtained first, and word segmentation processing may be performed on the question text, so as to obtain a word segmentation set corresponding to the question text, where the word segmentation set at least includes the first target word and the second target word.

Then, the first text identifier corresponding to the text containing the first target word and the second text identifier corresponding to the text containing the second target word can be determined by executing the above process, so that the text containing the first target word and/or the second target word can be determined from a plurality of texts contained in the text library.

Then, for each determined text, a TF-IDF value corresponding to each word segment in the text may be calculated, where the calculating process of the TF-IDF corresponding to each word segment may refer to the calculating processes of the TF-IDF of the first target word and the TF-IDF of the second target word, so that, according to the TF-IDF value corresponding to each word segment in the text, a keyword set corresponding to the text is determined, and the TF-IDF value corresponding to each keyword in the keyword set is not lower than the TF-IDF values corresponding to other word segments in the text, and keywords included in each keyword set are different from each other. In specific implementation, for each text, a word segment with a larger TF-IDF value may be selected from the text as a keyword, so as to form a keyword set corresponding to the text.

In this way, each text (including the first target word and/or the second target word) determined has a keyword set corresponding to the text, and then the similarity between the keyword set corresponding to each text and the word set corresponding to the question text can be calculated, for example, the cosine similarity between two keyword sets is calculated, so that the keyword set (hereinafter referred to as the target keyword set for convenience of description) with the highest similarity between the word sets can be determined from the keyword sets, that is, the similarity between the target keyword set and the word set is higher than the similarity between other keyword sets and the word set, and the text corresponding to the target keyword set is determined as the answer text corresponding to the question text.

Of course, in a further possible embodiment, when determining a keyword set corresponding to each determined text, the number of keywords included in the keyword set may also be limited. As an example, the number of keywords included in the keyword sets may be limited to not more than the number of tokens included in the token set, for example, 8 tokens in the token set, 8 or 7 keyword sets may be included in each keyword set, or the like. Of course, in practical application, the number of keywords included in the keyword sets may also exceed the number of segmented words included in the segmented word sets, for example, the number of segmented words in the segmented word sets is 8, and the number of keyword sets included in each keyword set may be 9.

It is noted that, when the word segment included in the word segment set further includes other words than the first target word and the second target word, the text determined from the text library may also be text including other word segments. Correspondingly, the determined target keyword set may also be a keyword set corresponding to a text including other segmentation words.

It should be noted that, in this embodiment, the set corresponding to the child node under each root node may specifically be a list created for the child node, that is, a text identifier of a text where the word indicated by the child node is located and a TF value of the word in the text are recorded in the list corresponding to the child node; of course, in other possible embodiments, the set corresponding to the child node may specifically also exist in the form of a key value pair. That is, in the key value pair ("key=value") corresponding to the child node, the "key" may be a text identifier of the text in which the word is located, and the "value" may be a TF value of the word in the text. In practical applications, the set corresponding to the child node may also exist in other forms, which is not limited in this embodiment.

Next, a process of constructing a tree data structure based on a plurality of texts in a text library will be described in detail:

in the process of constructing the tree data structure, for any text in the text library (hereinafter referred to as a text to be processed for convenience of description), the text to be processed can be segmented to obtain a segmented set of the text to be processed, wherein the segmented set at least comprises a first segmented word. It will be appreciated that in practical applications, the word segmentation set may include a plurality of word segments, where, taking one word segment (hereinafter referred to as a first word segment) of the plurality of word segments as an example, a process of building a tree data structure is described as an example. Firstly, whether the first word is included in the word segment indicated by the child node under the root node of the tree data structure may be traversed, when the first word segment is not included (for example, when the data structure is initially constructed, there are no child nodes under the root node, or fewer child nodes under the root node, including no child node indicating the first word segment, etc.), a first child node and a set corresponding to the first child node may be created under the root node of the tree data structure, and a text identifier of a text to be processed and a first TF value of the first word segment in the text to be processed may be added to the set corresponding to the first child node, where the word segment indicated by the first child node is the first word segment, and the first TF value may be calculated by the foregoing formula (1). When the word indicated by the first sub-node exists in the sub-node under the root node as the first word, the text identifier of the text to be processed and the first TF value of the first word in the text to be processed are added in the set corresponding to the first sub-node, in this case, the first sub-node completes the creation in advance, and after the text identifier and the corresponding first TF value are added in the set corresponding to the first sub-node, the set at least comprises two different text identifiers and two TF values. Thus, the construction process of the tree data structure based on the first word segmentation in the text to be processed is completed.

Further, when the tree data structure is constructed based on the second word in the word segmentation set of the text to be processed, the tree data structure may be constructed according to a construction process similar to that of the first word.

In another possible implementation manner, after completing the process of building the tree data structure according to the first word, for the second word continuous with the first word in the text to be processed, whether the second word is included in the word indicated by the virtual child node under the first child node may be traversed first, and when the second word is not included, a target virtual child node may be created under the first child node, and a pointer pointing to the second child node under the root node may be created for the target virtual child node. The target virtual child node and the second child node indicate the second word, and the second child node may be determined by traversing the child node under the root node (similar to the process of determining the child node according to the first word), where, of course, if the child node under the root node also does not include the child node indicating the second word, the second child node and the set corresponding to the second child node may be created for the second word. Meanwhile, a text identifier of the text to be processed and a second TF value of the second segmentation in the text to be processed can be added in the set corresponding to the second child node. Thus, when the TF value corresponding to the second word needs to be queried, the virtual child node under the first child node may be traversed to determine the target virtual child node indicating the second word, so that the second child node is determined based on the pointer of the target virtual child node, and the TF value recorded in the set corresponding to the second child node is further obtained (of course, the text identifier of the text to be processed may also be obtained in the set). Because the number of the virtual child nodes is generally far smaller than that of the child nodes under the root node, the child nodes corresponding to the indication second word are determined by traversing the virtual child nodes, so that the required computing resources can be effectively reduced, and meanwhile, the processing efficiency can be improved.

When the word indicated by the virtual child node under the first child node includes the second word, the second child node may be determined according to the pointer of the virtual child node indicating the second word, and then the identifier of the text to be processed and the second TF value of the second word in the text to be processed are added to the set corresponding to the second child node. Thus, the construction process of the tree data structure based on the second word segmentation in the text to be processed is completed.

It should be noted that, in practical application, the word segmentation set of the text to be processed may further include a third word segment identical to the first word segment, and since the first word segment and the third word segment are words in the same text to be processed, and TF of the first word segment and the third word segment in the text to be processed is identical, after the text identifier of the text to be processed and the first TF value of the first word segment in the text to be processed are added in the set corresponding to the first child node based on the first word segment, the same text identifier and TF value do not need to be repeatedly added to the set corresponding to the first child node (of course, even if the same text identifier and TF value are repeatedly added, the same document identifier and the same TF value can be subsequently combined).

Based on this, in an exemplary embodiment, when it is determined that the third word in the word segmentation set is the same as the first word, the text identifier of the text to be processed and the TF value of the third word in the text to be processed may not be added to the set corresponding to the first child node.

For each word in each text in the text library, the construction of the tree data structure can be performed according to the above process, so that the tree data structure constructed based on the text library can be obtained.

In addition, when new text needs to be added to the text library, the process of constructing the tree data structure by using the text library is similar to that described above. Of course, the total number of texts indicated by the root node in the tree data structure increases, the increased number being the number of the newly added texts.

In this embodiment, when the TF-IDF value of the target word needs to be calculated, a target child node may be determined by traversing child nodes under a root node of a pre-constructed tree data structure according to the target word, where the word segmentation indicated by the target child node is the target word; after determining the target sub-node, further obtaining a target set corresponding to the target sub-node, wherein the target set comprises text identifiers and TF values of the target words in texts corresponding to the text identifiers, and it can be understood that the texts corresponding to each text identifier in the target set comprise the target words; then, according to the number of text labels included in the target set and the total number of texts indicated by the root node, the IDF value of the target word can be determined, and after the TF value and the IDF value of the target word are obtained, the TF-IDF value of the target word can be further calculated. Therefore, in the process of calculating the TF-IDF value of the target word, after the target set corresponding to the target word is determined, it can be determined directly from the target set which texts exist in the text library, and the texts containing the target word in the text library are determined without traversing the text library by consuming a large amount of calculation resources, so that not only can the calculation resources required for determining the texts be effectively reduced, namely, the calculation resources required for calculating the IDF value of the target word be effectively reduced, but also the efficiency of determining the IDF value of the target word can be effectively improved; meanwhile, the TF value of the target word in each text containing the target word can be determined according to the target set, so that repeated calculation of the TF value of the target word is not needed for each text, the calculated amount required by repeated calculation of the TF value of the target word can be reduced, the TF value of the target word can be obtained in a query set mode, and the efficiency of obtaining the TF value of the target word can be effectively improved.

In addition, the embodiment of the application also provides a device for calculating the TF-IDF value of the word. Referring to fig. 5, fig. 5 shows an apparatus for calculating TF-IDF values of words in an embodiment of the present application, where the apparatus 500 includes:

a first determining module 501, configured to traverse child nodes under a root node of a tree data structure according to a first target word, determine a first target child node, where a word segment indicated by the first target child node is the first target word;

a first obtaining module 502, configured to obtain a first target set corresponding to the first target child node, where the first target set includes a first text identifier and a first word frequency TF value of the first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different from each other;

a second determining module 503, configured to determine a first inverse text frequency IDF value of the first target word according to a number of first text identifiers included in the first target set and a total number of texts indicated by the root node;

a first calculating module 504, configured to calculate a TF-IDF value corresponding to the first target word according to the first TF value and the first IDF value.

In one possible embodiment, the apparatus 500 further comprises:

In some possible embodiments, the first target word and the second target word are word segments in a word segment set, where the word segment set is obtained by performing word segment processing on the question text, and the apparatus 500 further includes:

In some possible embodiments, the apparatus 500 further comprises:

In some possible embodiments, the word segmentation set of the text to be processed further includes a second word segment, where the second word segment is continuous with the first word segment in the text to be processed, and the apparatus 500 further includes:

In some possible embodiments, the word segmentation set further includes a third word segment, and the apparatus 500 further includes:

It should be noted that, because the content of information interaction and execution process between each module and unit of the above-mentioned apparatus is based on the same concept as the method embodiment in the embodiment of the present application, the technical effects brought by the content are the same as the method embodiment in the embodiment of the present application, and specific content can be referred to the description in the foregoing method embodiment shown in the embodiment of the present application, which is not repeated here.

In addition, the embodiment of the application also provides equipment. Referring to fig. 6, fig. 6 shows a schematic hardware structure of an apparatus according to an embodiment of the present application, where the apparatus 600 includes a processor 601 and a memory 602:

the memory 602 is used for storing program codes and transmitting the program codes to the processor 601;

the processor 601 is configured to execute the method for calculating TF-IDF values of words described in the method embodiments above according to instructions in the program code.

Embodiments of the present application also provide a computer-readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any concatenation thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer readable media can include computer storage media and communication media and can include any medium that can transfer a computer program from one place to another. The storage media may be any target media that is accessible by a computer.

As an alternative design, the computer readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium targeted for carrying or storing the desired program code in the form of instructions or data structures and accessible by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. The splice described above should also be included within the scope of computer readable media.

In this application, "english: of", corresponding to "and" corresponding to "are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In this application, "at least one" means one or more. "plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" or its similar expressions, refers to any concatenation of these items, including any concatenation of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application.

Claims

1. A method of calculating TF-IDF values of words, the method comprising:

acquiring a first target set corresponding to the first target child node, wherein the first target set comprises a first text identifier and a first word frequency TF value of the first target word in a text corresponding to the first text identifier, and texts corresponding to different first text identifiers are different;

2. The method according to claim 1, wherein the method further comprises:

Acquiring a second target set corresponding to the second target child node, wherein the second target set comprises a second text identifier and a second TF value of the second target word in a text corresponding to the second text identifier, and texts corresponding to different second text identifiers are different;

3. The method of claim 2, wherein the first target word and the second target word are part words in a part word set, the part word set being obtained by performing part word processing on a question text, the method further comprising:

4. The method of claim 3, wherein the number of keywords included in the set of keywords does not exceed the number of tokens included in the set of tokens.

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 5, wherein the set of tokens of the text to be processed further comprises a second token that is continuous with the first token in the text to be processed, the method further comprising:

7. The method of claim 5, wherein the set of tokens further comprises a third token, the method further comprising:

8. An apparatus for calculating TF-IDF values of words, said apparatus comprising:

9. An apparatus comprising a processor and a memory;

the processor being configured to perform the method of calculating TF-IDF values of words according to any of claims 1 to 7 according to instructions in said program code.

10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of calculating TF-IDF values of words according to any of claims 1 to 7.