CN113191145B

CN113191145B - Keyword processing method and device, electronic equipment and medium

Info

Publication number: CN113191145B
Application number: CN202110556831.7A
Authority: CN
Inventors: 李�浩; 庞敏辉; 冯婧超; 赵志新
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2023-08-11
Anticipated expiration: 2041-05-21
Also published as: CN113191145A

Abstract

The disclosure provides a keyword processing method, a keyword processing device, an electronic device and a keyword processing medium, and particularly relates to the technical field of computers, in particular to the technical field of artificial intelligence technology and natural language processing. The keyword processing method comprises the following steps: extracting a plurality of candidate keywords from a corpus comprising a plurality of corpora; establishing a keyword association pair between the extracted candidate keywords; identifying an entity word and an attribute word associated with the entity word in the plurality of corpora; and generating a target keyword pair based on the association pair and the identified entity word and the attribute word associated with the entity word.

Description

Keyword processing method and device, electronic equipment and medium

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence technology and natural language processing, and specifically relates to a keyword processing method, a keyword processing device, electronic equipment and a keyword processing medium.

Background

Natural language processing techniques have been widely used in various fields. The keywords are extracted from the corpus, and operations such as searching, inquiring, responding and the like are automatically carried out according to the extracted keywords, so that the processing speed can be greatly improved. Generally, keyword extraction methods are mainly based on manual rules or on deep learning. The method based on the manual rule uses a large amount of manpower, uses priori knowledge of people to classify the data of the industry so as to extract the keywords, consumes a large amount of manpower, and has professional background knowledge to classify the data. Deep learning-based methods require a large amount of training data to train the model, which takes a long time. Therefore, a rapid and accurate keyword extraction technique is required.

Disclosure of Invention

The disclosure provides a keyword processing method, a keyword processing device, electronic equipment and a medium.

According to an aspect of the present disclosure, there is provided a keyword processing method, including:

extracting a plurality of candidate keywords from a corpus comprising a plurality of corpora;

establishing a keyword association pair between the extracted candidate keywords;

identifying an entity word and an attribute word associated with the entity word in the plurality of corpora; and

and generating a target keyword pair based on the association pair, the identified entity word and the attribute word associated with the entity word.

According to another aspect of the present disclosure, there is provided a keyword processing apparatus, including:

the extraction module is used for extracting a plurality of candidate keywords from a corpus comprising a plurality of corpora;

the establishing module is used for establishing association pairs of the keywords among the extracted candidate keywords;

the recognition module is used for recognizing entity words in the plurality of corpus and attribute words associated with the entity words; and

and the generation module is used for generating a target keyword pair based on the association pair, the identified entity word and the attribute word associated with the entity word.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to an aspect of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to an aspect of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to an aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of processing keywords according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of processing keywords according to another embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of extracting candidate keywords according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of establishing association pairs of keywords among a plurality of extracted candidate keywords according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of determining an answer sentence for a received request sentence in accordance with an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a keyword processing apparatus according to an embodiment of the present disclosure; and

FIG. 7 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a method 100 of processing keywords according to an embodiment of the present disclosure.

In step S110, a plurality of candidate keywords are extracted from a corpus including a plurality of corpora. In some embodiments, the corpus may include received or collected sentences, sentences retrieved from a database, and so forth.

For different application domains, the statement may be a common question (FAQ) or intent data for that domain. For example, in the financial field, the FAQ may include "how the annual fee of the card is", "how to transact the consumption stage", etc., the intention data may include "help me to raise the amount of the card", "how to exempt the annual fee of the card", "how to apply for the consumption stage", etc., in the electronic commerce field, the FAQ may include "how to transact the return", "how to ship the order for a long time", etc., and the intention data may include "how to transact the return", "how to receive the order for the two orders by me", etc.

In step S120, a keyword association pair is established between the extracted plurality of candidate keywords. The association pair may be made up of two candidate keywords in the corpus that are associated with each other. For example, two candidate keywords associated with each other in the same cluster constitute an association pair, or two candidate keywords associated with each other in the same sentence constitute an association pair.

In some embodiments, in a case where two keywords coexist in one sentence, candidate association pairs of the two keywords are determined, and for the determined plurality of candidate association pairs, respective association values of the plurality of candidate association pairs are calculated, a plurality of association values are obtained, and candidate association pairs having association values higher than a predetermined threshold value among the plurality of candidate association pairs are selected as the association pairs. The predetermined threshold is, for example, 1/5, 1/10, etc. of the highest associated value.

The keywords for which there is an association in a sentence are typically located "adjacent" in the sentence, i.e. adjacent or separated by a number of characters, e.g. for the sentence "help me promote the card's credit" in the above example, the keywords "promote-card" constitute candidate association pairs, while for the sentence "me wants to merge the receiving addresses of two orders", the keywords "merge-order" constitutes candidate association pairs.

In step S130, an entity word and an attribute word associated with the entity word in the plurality of corpora are identified. The entity word may be a word in the corpus that has a plurality of associated attributes, e.g., for the statement "help me promote the credit of the card" in the above example, the "card" has a possible plurality of associated attributes, thus identifying the word "card" as an entity word. The attribute word may be an attribute associated with the identified entity word, for example, for the statement "help me promote the amount of the card" in the above example, the entity word "card" has been identified, and thus the attribute word "amount" associated with the entity word is identified.

In some embodiments, the entity word has an attribute word associated therewith, and the entity word and the attribute word associated with the entity word are extracted from the sentence using a predetermined rule. The use of predetermined rules may be extraction based on predetermined semantic relationships between words. For example, for two nouns in the same sentence, the predetermined semantic relationship is (attribute word) of (entity word), then for the sentence "help me promote the credit of the card" in the above example, the (credit of the card) is extracted as the entity word and the attribute word associated with the entity word.

For example, for the sentence "help me promote the credit of the card" in the above example, the (credit of the phrase (card)) is obtained, and then the entity word "card" and the attribute word "credit" associated therewith are obtained. For the sentence "I want to merge the receiving addresses of two orders", the phrase (receiving address of order) is obtained, and then the entity word "order" and the attribute word "receiving address" associated therewith are obtained.

In some embodiments, for a certain entity word, several business words that occur most often may be selected according to a threshold. For example, for intention data such as "help me promote the amount of a card", "me want to exempt from the annual fee of a card", the entity word "card" has a possible attribute word "amount", "annual fee", "overdue", "stage", etc. associated therewith, an attribute word whose frequency of occurrence is higher than a threshold value among the attribute words may be selected as an attribute word associated therewith, and an attribute word whose frequency of occurrence is lower than the threshold value may be discarded. The threshold may be, for example, 1/3, 1/5, 1/10, etc. of the highest frequency of occurrence. For sentences such as common problems, the amount of data in one cluster is small after the clustering operation, and the attribute words respectively associated with the individual entity words may not be discarded.

In step S140, a target keyword pair is generated based on the association pair and the identified entity word and the attribute word associated with the entity word.

In some embodiments, the established association pair includes two keywords co-occurring in one sentence, for example, for the sentence "help me promote the credit of the card" in the above example, the association pair is "promote-card", the attribute word associated with the entity word "card" therein is "credit", thereby generating the target keyword pairs "promote-card", "promote-credit", and "card-credit". For another example, for the statement "i want to merge the receiving addresses of two orders", the associated pair is "merge-order", and the attribute word associated with the entity word "order" therein is "receiving address", thereby generating the target keyword pair "merge-order", "merge-receiving address", and "order-receiving address".

Embodiments of the present disclosure generate a target keyword pair based on an association pair made up of two association words co-occurring in one sentence and an entity word identified from a corpus and an attribute word associated with the entity word, that is, establish an association between keywords in consideration of their semantic relationship. In this context, when performing operations such as search, query, response, etc. on sentences, search, query, response, etc. are performed using the generated target keyword pairs, instead of using keywords that are not related to each other. By the embodiment of the disclosure, the extracted target keyword pair can be more accurate, and in further application, the answer sentence can be more quickly and accurately determined for the request sentence by utilizing the target keyword pair.

Fig. 2 is a flow chart of a method 200 of processing keywords according to another embodiment of the present disclosure.

In fig. 2, steps S220 to S250 correspond to steps S110 to S140, respectively, in the method 100. In addition, before step S220, step S210 is further included to cluster the material to extract candidate keywords for each cluster. In some embodiments, the plurality of corpora may include a plurality of categories, dividing corpora belonging to one category into one cluster.

Specifically, for FAQ, standard sentences and sentences similar to the standard sentences are divided into one cluster. For example, the standard sentence is "how to increase the amount", and the similar sentence may be "how to increase the amount", or the like. For intention data, sentences belonging to the same intention are divided into one cluster. For example, the intent data may include "I want to merge the order", "I want to modify the shipping address", and so on.

In fig. 2, after step S250, step S260 is further included, in response to the received request sentence, of determining an answer sentence for the request sentence using the generated target keyword pair.

In some embodiments, the request statement may be a plurality of request statements for which normalized statements are determined using the generated target keyword pairs. For example, request statement 1: [ what the annual fee of the card is ], reply 1: 300 yearly fees per year; request statement 2: [ that help me transact this ]. Target keyword pairs are generated by the method 100: "card-annual fee", "card-transacted", and thus a normalized sentence for the plurality of sentences: help me handle card, and answer. For another example, request statement 1: [ when My order shipped ], reply 1: [ goods are shipped after being prepared ]; request statement 2: [ may merge order ], reply 2: [ may ]; request statement 2: [ can be sent to a receiving address ]. Target keyword pairs are generated by the method 100: "merge-order", "order-order address", and thus a normalized statement for the plurality of statements: help me merge shipment, and determine the corresponding answer sentence.

With the method 200, clustering the corpus before extracting the plurality of candidate keywords from the corpus including the plurality of corpora, association pairs of keywords may be established between the plurality of candidate keywords in each cluster, and association pairs of keywords associated with each other may be more quickly extracted. Further, when responding to the received request statement, the generated target keyword pairs are adopted for searching, inquiring and responding, rather than the keywords which are not related to each other, the searching, inquiring and responding can be quickly and accurately carried out.

Fig. 3 is a flow chart of a method 300 of extracting candidate keywords according to an embodiment of the present disclosure.

In step S310, word segmentation is performed on the multiple corpora to obtain multiple keywords in the corpora and parts of speech of the multiple keywords. The word segmentation and part-of-speech acquisition can be realized by adopting a generalized word segmentation model in combination with a word segmentation model of a specific application scene. The word segmentation model of the specific application scene is combined with possible entity words in the scene to segment the words.

In some embodiments, the word segmentation model of a particular application scenario may employ a word segmenter under that particular application scenario, and may correct segmentation errors for certain particular entity words. The stop word list may also be used to remove the effect of the generic word. For example, in the e-commerce field, for the statement "help me view order logistics progress", the wrong word is "order object", "flow progress", and the word segmentation device specific to the scene is used for correcting the wrong word into "order", "logistics" and "progress".

In step S320, the plurality of keywords are filtered based on the parts of speech to obtain a plurality of keywords having the target parts of speech as a plurality of candidate keywords. In some embodiments, for the part of speech obtained in step S310, a plurality of words having the target part of speech are selected as candidate keywords. For example, a plurality of words having noun parts of speech and verb parts of speech are selected as candidate keywords, and for example, a plurality of words having noun parts of speech and adjective parts of speech are selected as candidate keywords.

In extracting the candidate keywords, various keyword extraction algorithms, such as keyword extraction algorithms based on the word frequency-inverse text frequency tf-idf, may be employed.

The tf-idf-based keyword extraction algorithm first calculates tf values of each candidate keyword, namely word frequency values:

wherein ,n_i For the number of occurrences of the currently counted candidate keyword in the file,is the sum of the number of occurrences of all candidate keywords in the document.

In counting word frequencies, for the connected verbs, since verbs in which a plurality of connected verbs may appear repeatedly or appear as one word, statistics of only one verb, for example, [ under search ], [ under change ], etc., may be performed. For nouns, whether candidate keywords are connected or not is not considered, and all word frequencies are recorded.

Next, the idf value of the candidate keyword, i.e., the inverse text frequency index, is calculated:

idf _i document number/(document number of occurrence of the candidate keyword+1)

When counting the number of documents of the candidate keyword, the ranking of the word frequency of the candidate keyword in the current cluster can be considered. If the word frequency is low, e.g., at the last 1/4, then the candidate keyword is considered to occur less frequently, and is deleted from the cluster. Finally, the tf-idf value of the candidate keyword is calculated according to tf value and idf value, for example, the following equation may be calculated:

tf-idf _i ＝tf _i *idf _i ，

wherein tf_idf _i Tf-idf value representing the current candidate keyword.

After calculating tf-idf values of the candidate keywords, a part of the candidate keywords may be selected as final candidate keywords according to a predetermined rule. For example, the first several high-frequency candidate keywords are selected as final candidate keywords, or the first several high-frequency candidate keywords with the target parts of speech are selected as final candidate keywords.

According to the embodiment of the disclosure, the plurality of words with the target parts of speech can be more accurately obtained as candidate keywords by segmenting the plurality of corpus and filtering based on the parts of speech.

Fig. 4 is a flow chart of establishing a keyword association pair between extracted candidate keywords with each other according to an embodiment of the present disclosure.

In step S410, it is determined whether two candidate keywords co-occur in one sentence. In some embodiments, where a first keyword having a first part of speech and a second keyword having a second part of speech are included in one sentence and the first keyword is adjacent to the second keyword, a co-occurrence of the first keyword and the second keyword is determined. The first part of speech may be a noun and the second part of speech may be a verb.

In some embodiments, the co-occurrence frequency of two keywords may also be calculated. For two keywords adjacent to each other in a sentence, calculating the occurrence frequency of a second keyword with a second part of speech corresponding to a first keyword with a first part of speech, and calculating the occurrence frequency of the first keyword with the first part of speech corresponding to the second keyword with the second part of speech, obtaining the co-occurrence frequency of the first keyword with the first part of speech and the second keyword with the second part of speech through weighting calculation.

In step S420, in the case where it is determined that two candidate keywords co-occur in one sentence, candidate association pairs of the two candidate keywords are determined. In some embodiments, for the determined plurality of candidate association pairs, respective association values of the plurality of candidate association pairs are calculated, resulting in a plurality of association values, and candidate association pairs having association values of the plurality of candidate association pairs above a predetermined threshold are selected as association pairs.

In some embodiments, the tf-idf value tf_idf of the first keyword having the first part of speech is calculated ₁ And calculates tf-idf value tf_idf of the second keyword having the second part of speech ₂ The correlation value is obtained according to the following equation:

(tf_idf ₁ +tf_idf ₂ )*f _co

wherein ,f_co Is the co-occurrence frequency of a first keyword having a first part of speech and a second keyword having a second part of speech.

In a further embodiment, the predetermined threshold may be set according to a highest correlation value of the plurality of correlation values. For example, 1/10 of the highest association value may be used as the threshold, and candidate association pairs with association values less than the threshold may be discarded.

Embodiments of the present disclosure select two keywords adjacent to each other as an association pair of keywords, and obtain association values of the two keywords according to tf-idf values of each of the two keywords and co-occurrence frequencies of the two keywords, so that a candidate association pair having an association value higher than a predetermined threshold may be selected as an association pair.

Fig. 5 is a schematic diagram of determining an answer sentence for a received request sentence according to an embodiment of the present disclosure.

As shown in fig. 5, an association pair 510 of keywords is established using the method 100 or 200. For example, for the statement "help me promote the amount of cards" in the above example, the keyword "promote-cards" constitutes candidate association pair 510, while for the statement "me wants to merge the receiving addresses of two orders", the keyword "merge-orders" constitutes candidate association pair 510.

Further, the entity word and the attribute word associated with the entity word are identified, resulting in an entity word-associated attribute word 520. For the sentence "help me promote the credit of the card" in the above example, the (credit of the phrase (card)) is obtained, and then the entity word "card" and the attribute word "credit" associated with the entity word "card" are obtained. For the sentence "I want to merge the receiving addresses of two orders", the phrase (receiving address of order) is obtained, and then the entity word "order" and the attribute word "receiving address" associated therewith are obtained.

A target keyword pair 530 is generated based on the association pair 510 and the entity word and the attribute word 520 associated with the entity word. For example, for the statement "help me promote the credit of the card" in the above example, the associated pair is "promote-card", the attribute word associated with the entity word "card" therein is "credit", and thus the target keyword pair 530 is generated: "promote-card", "promote-credit" and "card-credit". For another example, for the statement "i want to merge the receiving addresses of two orders," the associated pair is "merge-order," and the attribute word associated with the entity word "order" therein is "receiving address," thereby generating the target keyword pair 530: "merge-order", "merge-receive address" and "order-receive address".

After the target keyword pair is generated, a normalized sentence 550 for the request sentence may also be determined using the generated target keyword pair in response to the received request sentence 540, so as to generate an answer sentence 560 for the request sentence 540.

In some embodiments, the request statement may be a plurality of request statements for which a normalized statement 550 is determined using the generated target keyword pair 530. For example, request statement 1: [ what the annual fee of the card is ], reply 1: 300 yearly fees per year; request statement 2: [ that help me transact this ]. By target keyword pair 530: "card-annual fee", "card-transacted", a normalized sentence 550 for the plurality of sentences is obtained: help me transact card ], and generates an answer sentence 560. For another example, request statement 1: [ when My order shipped ], reply 1: [ goods are shipped after being prepared ]; request statement 2: [ may merge order ], reply 2: [ may ]; request statement 2: [ can be sent to a receiving address ]. By target keyword pair 530: "merge-order", "order-order address", resulting in normalized statement 550 for the plurality of statements: help me merge shipment, and generate an answer sentence 560.

When responding to the received request statement, the embodiment of the disclosure adopts the generated target keyword pair to search, inquire and respond, rather than adopting keywords which are not related to each other to search, inquire and respond, so that the response can be quickly and accurately performed.

The disclosure also provides a keyword processing device for executing any one of the above methods.

Fig. 6 is a schematic diagram of a keyword processing apparatus 600 according to an embodiment of the present disclosure.

As shown in fig. 6, the keyword processing apparatus 600 includes an extraction module 610, a creation module 620, an identification module 630, and a generation module 640.

The extraction module 610 extracts a plurality of candidate keywords from a corpus comprising a plurality of corpora. In some embodiments, the corpus may include received or collected sentences, sentences retrieved from a database, and so forth.

The establishing module 620 establishes a keyword association pair between the extracted plurality of candidate keywords with each other. The association pair may be made up of two candidate keywords in the corpus that are associated with each other. For example, two candidate keywords associated with each other in the same cluster constitute an association pair, or two candidate keywords associated with each other in the same sentence constitute an association pair.

The recognition module 630 recognizes an entity word in the plurality of corpora and an attribute word associated with the entity word. The entity word may be a word in the corpus that has a plurality of associated attributes, e.g., for the statement "help me promote the credit of the card" in the above example, the "card" has a possible plurality of associated attributes, thus identifying the word "card" as an entity word. The attribute word may be an attribute associated with the identified entity word, for example, for the statement "help me promote the amount of the card" in the above example, the entity word "card" has been identified, and thus the attribute word "amount" associated with the entity word is identified.

The generation module 640 generates a target keyword pair based on the association pair and the identified entity word and the attribute word associated with the entity word.

Embodiments of the present disclosure generate a target keyword pair based on an association pair made up of two association words co-occurring in one sentence and an entity word identified from a corpus and an attribute word associated with the entity word, that is, establish an association between keywords in consideration of their semantic relationship. In this context, when performing operations such as search, query, response, etc. on sentences, search, query, response, etc. are performed using the generated target keyword pairs, instead of using keywords that are not related to each other. According to the embodiment of the disclosure, the extracted target keyword pair can be more accurate, and in further application, the answer sentence can be more quickly and accurately determined for the request sentence by utilizing the target keyword pair.

Further, the extraction module 610 in fig. 6 may further include a word segmentation module 610a and a filtering module 610b.

The word segmentation module 610a performs word segmentation on the plurality of corpora to obtain a plurality of keywords in the corpora and parts of speech of the plurality of keywords. The word segmentation and part-of-speech acquisition can be realized by adopting a generalized word segmentation model in combination with a word segmentation model of a specific application scene. The word segmentation model of the specific application scene is combined with possible entity words in the scene to segment the words.

The filtering module 610b filters the plurality of keywords based on the part of speech to obtain a plurality of keywords having the target part of speech as a plurality of candidate keywords. In some embodiments, for the part of speech obtained by the word segmentation module 610a, a plurality of words having the target part of speech are selected as candidate keywords. For example, a plurality of words having noun parts of speech and verb parts of speech are selected as candidate keywords, and for example, a plurality of words having noun parts of speech and adjective parts of speech are selected as candidate keywords.

tf_idf _i ＝tf _i *idf _i

Wherein tf_idf _i Tf-idf value representing the current candidate keyword.

Further, the setup module 620 in fig. 6 may further include a first determination module 620a and a second determination module 620b.

The first determination module 620a determines whether two candidate keywords co-occur in a sentence. In some embodiments, where a first keyword having a first part of speech and a second keyword having a second part of speech are included in one sentence and the first keyword is adjacent to the second keyword, a co-occurrence of the first keyword and the second keyword is determined. The first part of speech may be a noun and the second part of speech may be a verb.

The second determination module 620b determines candidate association pairs of two candidate keywords in the event that it is determined that the two candidate keywords co-occur in one sentence.

In some embodiments, the second determination module 620b may include a calculation module 6201 and a selection module 6202 (not shown). The calculating module 6201 calculates, for the determined plurality of candidate association pairs, association values of each of the plurality of candidate association pairs, resulting in a plurality of association values. The selection module 6202 selects a candidate association pair having an association value above a predetermined threshold among the plurality of candidate association pairs as the association pair.

In some embodiments, the calculation module 6201 calculates the tf-idf value tf_idf of the first keyword having the first part of speech ₁ And calculates tf-idf value tf_idf of the second keyword having the second part of speech ₂ The correlation value is obtained according to the following equation:

(tf_idf ₁ +tf_idf ₂ )*f _co

In a further embodiment, the second determining module 620b may further include a setting module 6203 (not shown in the figure), the setting module 6203 setting the predetermined threshold according to a highest association value of the plurality of association values. For example, 1/10 of the highest association value may be used as the threshold, and candidate association pairs with association values less than the threshold may be discarded.

Further, the processing device 600 may also include a response module 650. The answer module 650 determines an answer sentence for the request sentence using the generated target keyword pair in response to the received request sentence.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as method 100 or 200. For example, in some embodiments, the methods described above may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A keyword processing method comprises the following steps:

generating a target keyword pair based on the association pair and the identified entity word and the attribute word associated with the entity word,

Wherein, for the determined plurality of candidate association pairs, respective association values of the plurality of candidate association pairs are calculated to obtain a plurality of association values, and a candidate association pair having an association value above a predetermined threshold value of the plurality of candidate association pairs is selected as the association pair, and

wherein, for each candidate association pair, the association value is calculated based on the respective word frequency-inverse text frequency tf-idf values of the two keywords in the candidate association pair and the co-occurrence frequency of the two keywords.

2. The method of claim 1, wherein the extracting a plurality of candidate keywords from a corpus comprising a plurality of corpora comprises:

word segmentation is carried out on the plurality of corpus to obtain a plurality of keywords in the corpus and parts of speech of the keywords;

and filtering the keywords based on the part of speech to obtain a plurality of keywords with target part of speech as the candidate keywords.

3. The method of claim 1 or 2, wherein the establishing keyword association pairs between the extracted plurality of candidate keywords with each other further comprises:

determining whether two candidate keywords co-occur in one sentence;

in the case that two candidate keywords are determined to co-occur in one sentence, candidate association pairs of the two candidate keywords are determined.

4. The method of claim 3, wherein the determining whether two keywords co-occur in a sentence comprises:

and determining that the first keyword and the second keyword coexist when the first keyword with the first part of speech and the second keyword with the second part of speech are included in the sentence and the first keyword is adjacent to the second keyword.

5. The method of claim 4, wherein the first part of speech is a noun and the second part of speech is a verb.

6. The method of claim 1, wherein the predetermined threshold is set according to a highest correlation value of the plurality of correlation values.

7. The method of claim 1, wherein the correlation value is calculated by the following equation:

，

wherein ,is the tf-idf value of one of the two keywords, ++>Is the tf-idf value of the other of the two keywords, +.>Is the co-occurrence frequency of the two keywords.

8. The method of claim 5, generating a target keyword pair based on the association pair and the identified entity word and an attribute word associated with the entity word comprises:

generating a target keyword pair comprising a first sub-keyword pair, a second sub-keyword pair and a third sub-keyword pair based on the established association pair, the entity word and the attribute word associated with the entity word, wherein the first sub-keyword pair is a keyword pair formed by the first keyword with the first part of speech and the entity word, the second sub-keyword pair is a keyword pair formed by the first keyword with the first part of speech and the attribute word, and the third sub-keyword pair is a keyword pair formed by the entity word and the attribute word.

9. The method of claim 1, further comprising:

in response to the received request sentence, an answer sentence for the request sentence is determined using the generated target keyword pair.

10. A keyword processing apparatus, comprising:

a generation module for generating a target keyword pair based on the association pair, the identified entity word and the attribute word associated with the entity word,

wherein, the establishment module includes:

the calculating module is used for calculating the association values of the candidate association pairs aiming at the determined candidate association pairs to obtain a plurality of association values; and

a selection module for selecting a plurality of candidate association pairs with association values higher than a predetermined threshold as the association pairs, and

the calculation module calculates the association value according to each candidate association pair based on respective word frequency-inverse text frequency tf-idf values of two keywords in the candidate association pair and co-occurrence frequencies of the two keywords.

11. The apparatus of claim 10, wherein the extraction module further comprises:

the word segmentation module is used for carrying out word segmentation on the plurality of corpus to obtain a plurality of keywords in the corpus and parts of speech of the plurality of keywords; and

and the filtering module is used for filtering the keywords based on the part of speech to obtain a plurality of keywords with target part of speech as the candidate keywords.

12. The apparatus of claim 10 or 11, wherein the means for establishing further comprises:

the first determining module is used for determining whether the two candidate keywords co-occur in one sentence or not; and

and the second determining module is used for determining candidate association pairs of the two candidate keywords under the condition that the two candidate keywords are determined to co-occur in one sentence.

13. The apparatus of claim 12, wherein the first determination module determines that a first keyword and a second keyword co-occur if the first keyword is adjacent to the second keyword and the first keyword includes a first keyword having a first part of speech and a second keyword having a second part of speech in the one sentence.

14. The apparatus of claim 13, wherein the first part of speech is a noun and the second part of speech is a verb.

15. The apparatus of claim 10, wherein the means for establishing further comprises means for setting the predetermined threshold based on a highest associated value of the plurality of associated values.

16. The apparatus of claim 10, wherein the calculation module calculates the association value by:

，

17. The apparatus of claim 14, wherein the generation module generates a target keyword pair comprising a first sub-keyword pair, a second sub-keyword pair, and a third sub-keyword pair based on the established association pair and the entity word and the attribute word associated with the entity word, the first sub-keyword pair being a keyword pair formed by the first keyword having the first part of speech and the entity word, the second sub-keyword pair being a keyword pair formed by the first keyword having the first part of speech and the attribute word, the third sub-keyword pair being a keyword pair formed by the entity word and the attribute word.

18. The apparatus of claim 10, further comprising an answer module that determines an answer sentence for the request sentence using the generated target keyword pair in response to the received request sentence.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.