TW202030625A

TW202030625A - Method and device for extracting main words through reinforcement learning

Info

Publication number: TW202030625A
Application number: TW108132431A
Authority: TW
Inventors: 劉佳; 崔恆斌
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2019-02-13
Filing date: 2019-09-09
Publication date: 2020-08-16
Also published as: TWI717826B; CN110008332A; WO2020164336A1; CN110008332B

Abstract

The embodiment of the invention provides a method and a device for extracting main words through reinforcement learning, and the method comprises the steps: firstly, training a classification network for sentence classification by utilizing a sentence sample set; secondly, performing main word extraction on sample sentences in the sentence sample set by utilizing a strategy network under the current strategy parameters to obtain a main word set, and determining current first loss according to the number of words in the sample sentences and the number of words in the main word set; and then, carrying out classification processing on the alternative sentences formed by the main word set by utilizing a classification network to obtain classification results of the alternative sentences, and determining current second loss according to the classification results and classification labels of the sample sentences. Therefore, the current total loss can be determined according to the current first loss and the second loss. Furthermore, the reinforcement learning system is updated in the direction in which the total loss is reduced, including at least updating the policy network for extracting the main word from the sentence to be analyzed.

Description

通過強化學習提取主幹詞的方法及裝置Method and device for extracting main words through reinforcement learning

本說明書一個或多個實施例涉及機器學習領域，尤其涉及利用強化學習的方式提取句子中的主幹詞的方法和裝置。One or more embodiments of this specification relate to the field of machine learning, and more particularly to methods and devices for extracting stem words in sentences by means of reinforcement learning.

電腦執行的自然語言處理和文本分析，例如意圖識別，事件抽取等，已經應用到多種技術場景中，例如智慧客服。在智慧客服中，需要對使用者的描述問題進行意圖識別，進而將其匹配到知識庫中的知識點，從而自動地回答用戶的問題。然而，用戶在進行問題描述時，特別是通過語音進行問題描述，例如電話交流溝通時，經常有一些口語描述，比如『嗯』『啊』『那個』『哦』『就是』等等，或者包含一些非重點的，不必要的詞語。這就需要把句子中主要的詞，即主幹詞提取出來，以便後續做語義分析和意圖識別。在進行事件抽取時，也需要排除一些停用詞，提取出主幹詞，從而優化事件提取的效果。因此，希望能有改進的方案，可以有效地對句子中的主幹詞進行提取，從而優化文本分析效果。Computer-executed natural language processing and text analysis, such as intent recognition and event extraction, have been applied to a variety of technical scenarios, such as smart customer service. In smart customer service, it is necessary to identify the user's intent to describe the problem, and then match it to the knowledge points in the knowledge base, so as to automatically answer the user's question. However, when the user describes the problem, especially the problem description through voice, such as telephone communication, there are often some spoken descriptions, such as "um", "ah", "that", "oh", "yes", etc., or include Some non-essential, unnecessary words. This requires extracting the main words in the sentence, that is, the main words for subsequent semantic analysis and intention recognition. When performing event extraction, it is also necessary to exclude some stop words and extract the main words, so as to optimize the effect of event extraction. Therefore, it is hoped that there will be an improved scheme that can effectively extract the main words in the sentence, so as to optimize the effect of text analysis.

本說明書一個或多個實施例描述了一種利用強化學習系統提取主幹詞的方法和裝置。通過實施例中的方法和裝置，利用強化學習的方式進行主幹詞提取的訓練，從而減少人工標注成本，提高主幹詞提取效率，優化文本分析效果。根據第一態樣，提供了一種通過強化學習提取主幹詞的方法，包括：利用句子樣本集，訓練用於句子分類的分類網路；利用當前策略參數下的策略網路，對所述句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；根據所述當前的第一損失和當前的第二損失，確定當前的總損失；在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。在一個實施例中，策略網路包括第一嵌入層，第一處理層和第二處理層，所述利用策略網路對所述句子樣本集中的第一樣本句子進行主幹詞提取包括：在所述第一嵌入層，獲得所述第一樣本句子中的各個詞的詞嵌入向量；在所述第一處理層，根據所述詞嵌入向量，確定所述各個詞作為主幹詞的機率；在所述第二處理層，至少根據所述機率，從所述各個詞中選擇至少一部分詞，構成所述第一主幹詞集合。在一個進一步的實施例中，在所述第二處理層，從所述各個詞中選擇機率值大於預設閾值的詞，構成所述第一主幹詞集合。根據一種實施方式，分類網路包括第二嵌入層和第三處理層，所述利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理包括：在所述第二嵌入層，獲得所述第一備選句子對應的句子嵌入向量；在所述第三處理層，根據所述句子嵌入向量，確定所述第一備選句子的第一分類結果。在一種實施方式中，策略網路和/或分類網路基於循環神經網路RNN。在一個實施例中，上述方法還包括，確定所述總損失減小的方向，包括：分別利用N群組原則參數下的所述策略網路處理所述第一樣本句子，獲得對應的N個主幹詞集合，並分別確定N個第一損失；利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子進行分類處理，獲得N個分類結果，並分別確定N個第二損失；根據N個第一損失和N個第二損失，確定對應的N個總損失，以及所述N個總損失的均值；確定損失值小於等於所述均值的至少一個第一總損失，以及損失值大於所述均值的至少一個第二總損失；基於所述至少一個第一總損失和所述至少一個第二總損失，確定所述總損失減小的方向。進一步的，在一個實施例中，上述N個分類結果是利用同一組分類參數下的所述分類網路，對所述N個備選句子分別進行分類處理而獲得；在這樣的情況下，所述N個總損失對應於所述N群組原則參數；此時，確定所述總損失減小的方向，包括：確定所述至少一個第一總損失對應的至少一組第一策略參數相對於當前策略參數的梯度的累積，作為正方向；確定所述至少一個第二總損失對應的至少一組第二策略參數相對於當前策略參數的梯度的累積，作為負方向；將所述正方向與所述負方向的相反方向疊加，作為所述總損失減小的方向。進一步的，在上述情況下，可以在所述總損失減小的方向，更新所述策略網路中的當前策略參數。在另一實施例中，所述N個分類結果是利用M組分類參數下的所述分類網路，對所述N個備選句子進行分類處理而獲得，其中M＜=N；在這樣的情況下，所述N個總損失對應於N個參數集，其中第i參數集包括第i群組原則參數，和處理第i備選句子時所述分類網路對應的分類參數；此時，確定所述總損失減小的方向包括：確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前策略參數的梯度的累積，作為第一正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前策略參數的梯度的累積，作為第一負方向；將所述第一正方向與所述第一負方向的相反方向疊加，作為第一調整方向；確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將所述第二正方向與所述第二負方向的相反方向疊加，作為第二調整方向；將所述第一調整方向和第二調整方向的總和作為所述總損失減小的方向。進一步的，在上述情況下，可以在所述第一調整方向，更新所述策略網路的當前策略參數；在所述第二調整方向，更新所述分類網路的當前分類參數。根據一種實施方式，上述方法還包括：將待分析的第二句子輸入所述策略網路；根據所述策略網路的輸出，確定所述第二句子中的主幹詞。根據第二態樣，提供了一種通過強化學習提取主幹詞的裝置，包括：分類網路訓練單元，配置為利用句子樣本集，訓練用於句子分類的分類網路；第一確定單元，配置為利用當前策略參數下的策略網路，對所述句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；第二確定單元，配置為利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；總損失確定單元，配置為根據所述當前的第一損失和當前的第二損失，確定當前的總損失；更新單元，配置為在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。根據第三態樣，提供了一種電腦可讀儲存媒體，其上儲存有電腦程式，當所述電腦程式在電腦中執行時，令電腦執行第一態樣的方法。根據第四態樣，提供了一種計算設備，包括記憶體和處理器，其特徵在於，所述記憶體中儲存有可執行程式碼，所述處理器執行所述可執行程式碼時，實現第一態樣的方法。根據本說明書實施例提供的方法和裝置，通過強化學習的方式，進行主幹詞提取的學習和訓練。更具體的，採用actor-critic方式的強化學習系統進行主幹詞提取，其中在強化學習系統中，策略網路作為actor，用於主幹詞提取；分類網路作為critic，用於對句子進行分類。可以利用現有的句子樣本庫作為訓練預料訓練分類網路，從而避免主幹詞標注的人工成本。經過初步訓練的分類網路即可對策略網路提取的主幹詞構成的句子進行分類，如此評估主幹詞提取的效果。通過對策略網路和分類網路的輸出結果均設置損失，根據總損失反覆訓練策略網路和分類網路，可以得到理想的強化學習系統。如此，可以在無需主幹詞人工標注的情況下，訓練得到理想的網路系統，實現對主幹詞的有效提取。One or more embodiments of this specification describe a method and device for extracting backbone words using a reinforcement learning system. Through the method and device in the embodiment, the training of stem word extraction is carried out in the manner of reinforcement learning, thereby reducing the cost of manual labeling, improving the efficiency of stem word extraction, and optimizing the effect of text analysis. According to the first aspect, a method for extracting stem words through reinforcement learning is provided, including: Use sentence sample sets to train a classification network for sentence classification; Use the strategy network under the current strategy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the number of words in the first sample sentence and the The number of words in the first stem word set determines the current first loss; Use the classification network to classify the first candidate sentence formed by the first stem word set to obtain the first classification result of the first candidate sentence, and according to the first classification result and the State the classification label of the first sample sentence and determine the current second loss; Determine the current total loss according to the current first loss and the current second loss; In the direction where the total loss is reduced, at least the strategy network is updated to extract the main words from the sentence to be analyzed. In one embodiment, the strategy network includes a first embedding layer, a first processing layer, and a second processing layer. The use of the strategy network to extract the stem words of the first sample sentence in the sentence sample set includes: In the first embedding layer, obtaining the word embedding vector of each word in the first sample sentence; In the first processing layer, according to the word embedding vector, determine the probability of each word as a backbone word; In the second processing layer, at least a part of the words is selected from the various words at least according to the probability to form the first trunk word set. In a further embodiment, in the second processing layer, words with a probability value greater than a preset threshold are selected from the various words to form the first trunk word set. According to an embodiment, the classification network includes a second embedding layer and a third processing layer, and the classification processing of the first candidate sentence formed by the first set of stem words using the classification network includes: In the second embedding layer, obtain the sentence embedding vector corresponding to the first candidate sentence; In the third processing layer, the first classification result of the first candidate sentence is determined according to the sentence embedding vector. In one embodiment, the strategy network and/or classification network is based on a recurrent neural network RNN. In an embodiment, the above method further includes determining the direction in which the total loss is reduced, including: Processing the first sample sentence by using the strategy network under the N group principle parameters to obtain corresponding N trunk word sets, and respectively determining N first losses; Use the classification network to classify the N candidate sentences corresponding to the N trunk word sets, obtain N classification results, and determine N second losses respectively; Determine the corresponding N total losses and the average value of the N total losses according to the N first losses and N second losses; Determine at least one first total loss whose loss value is less than or equal to the average value, and at least one second total loss whose loss value is greater than the average value; Based on the at least one first total loss and the at least one second total loss, a direction in which the total loss decreases is determined. Further, in an embodiment, the foregoing N classification results are obtained by using the classification network under the same set of classification parameters to perform classification processing on the N candidate sentences; in this case, The N total losses correspond to the N group principle parameters; At this time, determining the direction in which the total loss is reduced includes: Determining the accumulation of gradients of at least one group of first strategy parameters corresponding to the at least one first total loss with respect to the current strategy parameters as a positive direction; Determining the accumulation of gradients of at least one set of second strategy parameters corresponding to the at least one second total loss with respect to the current strategy parameters as a negative direction; The direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases. Further, in the above case, the current strategy parameters in the strategy network can be updated in the direction in which the total loss is reduced. In another embodiment, the N classification results are obtained by classifying the N candidate sentences using the classification network under M groups of classification parameters, where M<=N; In this case, the N total losses correspond to N parameter sets, where the i-th parameter set includes the i-th group principle parameter and the classification parameter corresponding to the classification network when processing the i-th candidate sentence; At this time, determining the direction of the total loss reduction includes: Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current strategy parameter as the first positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss relative to the current strategy parameter as the first negative direction; Superimposing the first positive direction and the opposite direction of the first negative direction as the first adjustment direction; Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current classification parameter as the second positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss with respect to the current classification parameter as the second negative direction; Superimposing the second positive direction and the opposite direction of the second negative direction as a second adjustment direction; The sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced. Further, in the above case, the current strategy parameters of the strategy network can be updated in the first adjustment direction; and the current classification parameters of the classification network can be updated in the second adjustment direction. According to an embodiment, the above method further includes: Input the second sentence to be analyzed into the strategy network; Determine the main word in the second sentence according to the output of the strategy network. According to the second aspect, a device for extracting stem words through reinforcement learning is provided, including: The classification network training unit is configured to use the sentence sample set to train the classification network for sentence classification; The first determining unit is configured to use the strategy network under the current strategy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the first sample sentence The number of words in and the number of words in the first stem word set to determine the current first loss; The second determining unit is configured to use the classification network to classify the first candidate sentence formed by the first stem word set, to obtain the first classification result of the first candidate sentence, and according to the Describe the first classification result and the classification label of the first sample sentence, and determine the current second loss; The total loss determining unit is configured to determine the current total loss according to the current first loss and the current second loss; The update unit is configured to update at least the strategy network in a direction in which the total loss is reduced, so as to extract the main words from the sentence to be analyzed. According to the third aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed in the computer, the computer is caused to execute the method of the first aspect. According to a fourth aspect, there is provided a computing device, including a memory and a processor, wherein the memory stores executable program code, and when the processor executes the executable program code, the first The same method. According to the method and device provided in the embodiment of this specification, the learning and training of the extraction of the main words are carried out by means of reinforcement learning. More specifically, an actor-critic reinforcement learning system is used for stem word extraction. In the reinforcement learning system, the strategy network is used as an actor for stem word extraction; the classification network is used as a critic to classify sentences. The existing sentence sample library can be used as a training prediction to train the classification network, thereby avoiding the labor cost of the main word tagging. The classification network after preliminary training can classify the sentences composed of the main words extracted by the strategy network, and thus evaluate the effect of main word extraction. By setting losses for both the output results of the strategy network and the classification network, and repeatedly training the strategy network and the classification network according to the total loss, an ideal reinforcement learning system can be obtained. In this way, an ideal network system can be trained without manual labeling of the main words, and the effective extraction of main words can be realized.

下面結合圖式，對本說明書提供的方案進行描述。如前所述，在多種文本分析場景中，都需要對句子的主幹詞進行提取。為了能夠自動地進行主幹詞提取，在一種方案中，可以通過有監督的機器學習方法來訓練主幹詞提取模型。根據常規的監督學習方法，為了訓練這樣的主幹詞提取模型，就需要大量人工標注的標注資料，這些標注資料需要對句子中各個詞是否為主幹詞進行標注，人工成本很大。根據本說明書實施例的構思，採用強化學習的方式進行主幹詞提取，降低人工標注成本，優化主幹詞提取效果。如本領域技術人員所知，強化學習是一種基於序列行為的回饋，進行的無標注的學習策略的方法。一般地，強化學習系統包括智慧體和執行環境，智慧體通過與執行環境的交流和回饋，不斷進行學習，優化其策略。具體而言，智慧體觀察並獲得執行環境的狀態(state)，根據一定策略，針對當前執行環境的狀態確定要採取的行為或動作(action)。這樣的行為作用於執行環境，會改變執行環境的狀態，同時產生一個回饋給智慧體，該回饋又稱為獎勵分數(reward)。智慧體根據獲得的獎勵分數來判斷，之前的行為是否正確，策略是否需要調整，進而更新其策略。通過反覆不斷地觀察狀態、確定行為、收到回饋，使得智慧體可以不斷更新策略，最終目標是能夠學習到一個策略，使得獲得的獎勵分數累積最大化。存在多種演算法來進行智慧體中策略的學習和優化，其中Actor-Critic方法是用於強化學習的一種策略梯度方法。圖1示出採用Actor-Critic方式的深度強化學習系統的示意圖。如圖1所示，系統包括作為actor的策略模型和作為critic的評估模型。策略模型從環境獲得環境狀態s，根據一定策略，輸出在當前環境狀態下要採取的動作a。評估模型獲取上述環境狀態s，以及策略模型輸出的動作a，對策略模型在狀態s下採取動作a的本次決策進行評分，並將該評分回饋給策略模型。策略模型根據評估模型的評分來調整策略，以期獲得更高的評分。也就是說，策略模型訓練的目標是，獲得評估模型的盡可能高的評分。另一態樣，評估模型也會不斷調整其評分方式，使得評分更好的反映環境回饋的獎勵分數r的累積。如此，反覆訓練評估模型和策略模型，使得評估模型的評分越來越準確，越來越接近環境回饋的獎勵，於是，策略模型採取的策略也越來越優化合理，得到更多的環境的獎勵。基於以上的特點，根據本說明書的實施例，通過採用Actor-Critic方式的強化學習系統進行主幹詞提取。圖2為本說明書披露的一個實施例的強化學習系統示意圖。如圖2所示，用於主幹詞提取的強化學習系統包括策略網路100和分類網路200。策略網路100用於從句子中提取主幹詞，它對應於圖1所示的策略模型，作用為Actor；分類網路200用於對句子進行分類，它對應於圖1所示的評估模型，作用為Critic。策略網路100和分類網路200均為神經網路。為了對策略網路100和分類網路200進行訓練，可以採用具有句子分類標籤的樣本句子。在訓練過程中，將樣本句子(對應於環境狀態s)輸入到策略網路100。通過一定策略，策略網路100從該樣本句子中提取出若干主幹詞，形成主幹詞集合(相當於採取的一個動作a)，該主幹詞集合可以對應於一個主幹句。分類網路200獲取主幹詞集合，並對該主幹詞集合對應的主幹句子進行分類，得到分類結果。通過比對該分類結果與原始樣本句子的分類標籤，來評估該主幹詞集合提取得是否正確。可以分別為策略網路100的主幹詞提取過程和分類網路200的分類過程設置損失(圖中的損失1和損失2)，基於該損失反覆訓練策略網路100和分類網路200，使得損失更小，分類更准。如此訓練得到的策略網路100，就可以用於對待分析的句子進行主幹詞提取。下面描述以上系統的訓練過程和處理過程。圖3示出根據一個實施例的訓練用於主幹詞提取的強化學習系統的方法流程圖。可以理解，該方法可以通過任何具有計算、處理能力的裝置、設備、平臺、設備群集來執行。如圖3所示，該方法包括：步驟31，利用句子樣本集，訓練用於句子分類的分類網路；步驟32，利用當前策略參數組下的策略網路，對句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；步驟33，利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；步驟34，根據當前的第一損失和當前的第二損失，確定當前的總損失；步驟35，在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。下面描述以上各個步驟的具體執行方式。如以上結合圖2所述，策略網路100用於從句子中提取主幹詞，分類網路200用於對句子進行分類，進而評估策略網路提取主幹詞的品質。這兩個神經網路互相交流，需要反覆進行訓練，才可以獲得理想的網路參數。為了促進模型儘快收斂，在第一階段，單獨訓練分類網路200，使其可以實現基本的句子分類。因此，首先，在步驟31，利用句子樣本集，訓練用於句子分類的分類網路。句子分類，或稱為文本分類，是文本分析中的常見任務，因此，已經存在大量豐富的樣本語料，可以用於進行分類訓練。因此，在步驟31，可以從已有語料庫中獲取一些句子樣本，構成句子樣本集，這裡的句子樣本包括原始句子，以及為該原始句子添加的分類標籤。利用這樣的具有分類標籤的句子樣本構成的句子樣本集，就可以訓練句子分類網路。訓練的方式可以採用經典的監督訓練的方式進行。如此，通過步驟31，可以得到初步訓練的分類網路，該分類網路可以用於對句子進行分類。在此基礎上，就可以利用上述分類網路對策略網路進行評估，從而訓練強化學習系統。具體地，在步驟32，利用當前策略參數組下的策略網路，對句子樣本集中的任意的一個樣本句子，下文稱為第一樣本句子，進行主幹詞提取，獲得對應的主幹詞集合，稱為第一主幹詞集合。可以理解，初始地，策略網路中的策略參數可以是隨機初始化的；隨著策略網路的訓練，策略參數會不斷調整和更新。當前的策略參數組可以是初始狀態下隨機的參數組，也可以是訓練過程中，某一狀態下的策略參數。策略網路的一群組原則參數可以認為對應於一種策略。相應的，在步驟32，策略網路根據當前策略，對輸入的第一樣本句子進行處理，從中提取出主幹詞。在一個實施例中，策略網路可以包括多個網路層，通過該多個網路層實現主幹詞提取。圖4示出根據一個實施例的策略網路的結構示意圖。如圖4所示，策略網路100可以包括，嵌入層110，第一處理層120和第二處理層130。嵌入層110獲得樣本句子，對於句子中的各個詞，計算其詞嵌入向量。例如，對於第一樣本句子，將其分詞後可以得到詞序列{W₁ ,W₂ ,…,W_n }，其中包括n個詞。嵌入層針對每個詞W_i 計算對應的詞嵌入向量E_i ，於是得到{E₁ ,E₂ ,…,E_n }。第一處理層120根據以上的詞嵌入向量，確定各個詞作為主幹詞的機率。例如，對於n個詞的詞嵌入向量{E₁ ,E₂ ,…,E_n }，確定各個詞的作為主幹詞的機率{P₁ ，P₂ ,…,P_n }。第二處理層130根據上述機率，從各個詞中選擇至少一部分詞，作為主幹詞，構成主幹詞集合。在一個實施例中，預先設置一個機率閾值。第二處理層從各個詞中，選出機率大於上述閾值的詞，作為主幹詞。以上嵌入層110、第一處理層120和第二處理層130中各層網路參數的整體，構成策略參數。在一個實施例中，策略網路100採用循環神經網路RNN。更具體的，可以通過RNN實現以上的嵌入層110，從而在進行各個詞的詞嵌入時，考慮詞的時序影響。第一處理層120和第二處理層130可以通過全連接處理層實現。在其他實施例中，策略網路100也可以採用不同的神經網路架構，例如基於RNN改進的長短期記憶LSTM神經網路，GRU神經網路，或者深度神經網路DNN，等等。通過以上的策略網路，可以對樣本句子進行主幹詞提取。例如，對於第一樣本句子中的n個詞，策略網路通過當前策略，從中選擇了m個詞(m＜=n)作為主幹詞，這m個主幹詞表示為{w₁ ,w₂ ,…,w_m }。如此，得到主幹詞集合。在得到主幹詞集合的基礎上，可以通過一個損失函數，下文稱為第一損失函數，衡量主幹詞提取過程的損失，下文稱為第一損失，記為LK(Loss_Keyword)。也就是，在步驟32，在獲得第一主幹詞集合的基礎上，根據第一樣本句子中的詞語數目和第一主幹詞集合中的詞語數目，確定當前的第一損失。在一個實施例中，第一損失函數被設定為，提取的主幹詞的數目越少，損失值越低；主幹詞數目越多，損失值越高。在一個實施例中，還可以根據提取的主幹詞相對於樣本句子的占比來確定第一損失，占比越高，損失值越大，占比越小，損失值越低。這都是考慮到，希望訓練完成的理想狀態下，策略網路100可以從原始句子中排除儘量多的無用詞，保留盡可能少的詞作為主幹詞。例如，在一個例子中，第一損失函數可以設置為： LK=Num_Reserve/Num_Total 其中，Num_Reserve為作為主幹詞保留下來的詞語數目，即主幹詞集合中的詞語數目，Num_Total為樣本句子中的詞語數目。在以上例子中，假定第一樣本句子中包含n個詞，策略網路通過當前策略，從中選擇了m個詞，那麼當前的第一損失為LK=m/n。接下來，在步驟33，利用分類網路對由第一主幹詞集合構成的第一備選句子進行分類處理，獲得第一備選句子的第一分類結果。可以理解，通過步驟31的初步訓練，確定出了分類網路的初步分類參數，這樣的分類網路可以用於對句子進行分類。此外，在步驟32，策略網路100可以輸出針對第一樣本句子提取的第一主幹詞集合，該第一主幹詞集合可以對應一個備選句子，即第一備選句子。該第一備選句子可以理解為，對第一樣本句子排除停用詞、無意義詞，僅保留主幹詞後得到的句子。相應的，在步驟33，可以用分類網路對該第一備選句子進行分類處理，得到分類結果。在一個實施例中，分類網路可以包括多個網路層，通過該多個網路層實現句子分類。圖5示出根據一個實施例的分類網路的結構示意圖。如圖5所示，分類網路200可以包括，嵌入層210，全連接處理層220。嵌入層210獲得策略網路100輸出的主幹詞集合，對於各個詞，計算其詞嵌入向量，進而計算出該主幹詞集合所構成的備選句子的句子嵌入向量。例如，對於第一主幹詞集合{w₁ ,w₂ ,…,w_m }，可以分別計算各個詞的詞嵌入向量{e₁ ,e₂ ,…,e_m }，然後基於各個詞嵌入向量，得到第一備選句子的句子嵌入向量Es。在不同實施例中，句子嵌入向量可以通過對各個詞嵌入向量進行拼接、平均等運算而得到。接著，全連接處理層220根據以上的句子嵌入向量Es，確定第一備選句子的分類結果，即第一分類結果。以上嵌入層210和全連接處理層220中各層網路參數的整體，構成分類參數。與策略網路100類似的，分類網路200可以採用循環神經網路RNN來實現。更具體的，可以通過RNN實現以上的嵌入層210。在其他實施例中，分類網路200也可以採用不同的神經網路架構，例如LSTM神經網路，GRU神經網路，或者深度神經網路DNN，等等。在對備選句子進行分類後，可以通過另一個損失函數，下文稱為第二損失函數，衡量分類過程的損失，下文稱為第二損失，記為LC(Loss_Classify)。也就是，在步驟33，在獲得第一分類結果的基礎上，根據該第一分類結果以及第一樣本句子的分類標籤，確定當前的第二損失。在一個實施例中，第二損失函數被設定為，基於交叉熵演算法確定第二損失LC。在其他實施例中，也可以通過其他形式和其他演算法的損失函數，基於分類結果和分類標籤之間的差異，確定出第二損失LC。相應的，通過上述第二損失函數，基於本次分類得到的第一分類結果，以及第一樣本句子對應的分類標籤之間的比對，可以確定出本次分類的分類損失，即當前的第二損失。在確定出第一損失和第二損失的基礎上，在步驟34，根據當前的第一損失和當前的第二損失，確定當前的總損失。總損失可以理解為，整個強化學習系統的損失，包括策略網路提取主幹詞過程的損失，和分類網路進行分類過程的損失。在一個實施例中，總損失定義為，上述第一損失和第二損失的加和。在另一實施例中，還可以為第一損失和第二損失各自賦予一定權重，將總損失定義為，第一損失和第二損失的加權求和。根據總損失的定義方式，基於本次提取主幹詞對應的當前的第一損失，以及本次分類對應的當前的第二損失，可以確定出當前的總損失。基於這樣的總損失，就可以對強化學習系統進行訓練，訓練的目標是使得總損失盡可能小。根據以上第一損失、第二損失和總損失的定義方式，可以理解，總損失盡可能小意味著，在策略網路100排除儘量多的無用詞、提取儘量少的主幹詞的同時，不改變句子的含義，因而分類網路200的句子分類結果儘量接近原始句子的分類標籤。為了達到總損失減小的目的，在步驟35，在總損失減小的方向，更新強化學習系統。更新強化學習系統至少包括，更新策略網路100，還可以包括，更新分類網路200。以上總損失減小的方向的確定方式，以及強化學習系統的更新方式，在不同訓練方式下、不同訓練階段中可以有所不同，下面分別進行描述。根據一種訓練方式，為了確定出總損失減小的方向，在策略網路100中用不同策略分別處理多個樣本句子，得出對應的多個主幹詞句子，以及對應的多個第一損失；然後利用分類網路200對各個主幹詞句子進行分類，得出對應的多個分類結果，以及對應的多個第二損失。於是，得到對多個樣本句子進行處理的多個總損失。比較當前損失與多個總損失，將多個總損失中比當前損失小的總損失所對應的網路參數相對於當前網路參數的梯度，確定為總損失減小的方向。根據另一種訓練方式，為了確定出總損失減小的方向，對同一樣本句子進行多次處理得到多個總損失，基於這樣的多個總損失，確定總損失減小的方向。圖6示出在該訓練方式下確定總損失減小方向的步驟流程圖。為了探索出更多更好的策略，在策略網路100中，可以在當前策略的基礎上加入一定隨機性而產生N個策略，這N個策略對應於N群組原則參數。結合圖4所示的網路結構，可以對嵌入層的嵌入演算法加入隨機擾動，得到新的策略；可以對第一處理層中確定主幹詞機率的演算法進行變動，得到新的策略；也可以對機率選擇的規則演算法，例如對機率閾值進行變動，得到新的策略。通過以上各種變動方式的組合，可以得到N種策略，對應於N群組原則參數。相應的，在步驟61，分別利用上述N群組原則參數下的策略網路處理第一樣本句子，獲得對應的N個主幹詞集合。並且，可以根據如前所述的第一損失函數，分別確定出N個第一損失。然後，在步驟62，利用分類網路200，對所述N個主幹詞集合分別對應的N個備選句子進行分類處理，獲得N個分類結果。並且，根據前述的第二損失函數，分別確定N個分類結果對應的N個第二損失。在步驟63，根據N個第一損失和N個第二損失，確定對應的N個總損失，記為L1，L2，…,Ln。並且，還可以確定出上述N個總損失的均值La。在步驟64，確定損失值小於等於均值的至少一個第一總損失，以及損失值大於均值的至少一個第二總損失。換而言之，將上述N個總損失劃分為，小於等於均值La的總損失，稱為第一總損失，以及大於均值La的總損失，稱為第二總損失。在步驟65，基於上述第一總損失和第二總損失，確定總損失減小的方向。更具體而言，上述第一總損失由於損失較小，可以對應於正向學習的方向，上述第二總損失由於損失較大，可以對應於負向學習的方向。因此，在步驟65，綜合正向學習的方向，和負向學習方向的反方向，可以得到總的學習方向，即總損失減小的方向。對於以上的訓練方式，在不同的訓練階段，也可以有不同的具體執行方式。如前所述，在整個強化學習系統訓練的第一階段，單獨訓練分類網路，如步驟31所示。為了加速模型的收斂，在一個實施例，在接下來的第二階段，固定住上述分類網路，僅訓練和更新策略網路；然後，在第三階段，同時訓練更新策略網路和分類網路。下面分別描述第二階段和第三階段，圖6流程的執行方式。具體的，在第二階段中，分類網路被固定，也就是分類網路中的分類參數不變，不進行調整。那麼相應的，在圖6的步驟62中，是利用同一組分類參數下的分類網路，對前述N個備選句子進行分類處理，也就是基於同樣的分類方式進行分類，得到了所述N個分類結果。由於分類參數不變，在這樣的情況下，步驟63中確定的N個總損失，實際上對應於策略網路的N個策略，進而對應於N群組原則參數。也就是，第i個總損失Li，對應於第i群組原則參數PSi。然後在步驟64，在確定出第一總損失和第二總損失的基礎上，確定出第一總損失對應的第一策略參數，以及第二總損失對應的第二策略參數。換而言之，如果總損失Li小於等於均值La，則將該總損失歸為第一總損失，相應的策略參數組PSi則歸為第一策略參數；如果總損失Li大於均值La，則將該總損失歸為第二總損失，相應的策略參數組PSi則歸為第二策略參數。接下來，在步驟65，通過以下方式確定總損失減小的方向：確定至少一組第一策略參數相對於當前策略參數的梯度的累積，作為正方向；確定至少一組第二策略參數相對於當前策略參數的梯度的累積，作為負方向；將所述正方向與所述負方向的相反方向疊加，作為總損失減小的方向。這是由於，第一策略參數對應於損失值小於等於平均值的總損失，或者說，損失值較小的總損失，因此，可以認為第一策略參數所對應的策略選擇方向是正確的，是系統學習的“正樣本”，應該進行正向學習；而第二策略參數對應於損失值大於平均值的總損失，是損失值較大的總損失，因此，可以認為第二策略參數所對應的策略選擇方向是錯誤的，是系統學習的“負樣本”，應該進行反向學習。一般地，損失值小於等於平均值的第一總損失可以是多個，相應的，第一策略參數可以是多組第一策略參數。該多組第一策略參數有可能對樣本句子不同位置的主幹詞提取有不同的效果，因此，在一個實施例中，對該多組第一策略參數均進行正向學習，確定各組第一策略參數相對於當前策略參數的梯度，將其進行累積，得到上述正方向。相應的，第二策略參數也可以是多組第二策略參數。在一個實施例中，對該多組第二策略參數均進行負向學習，確定各組第二策略參數相對於當前策略參數的梯度，將其進行累積，得到上述負方向。最後，將負方向取反，與正方向進行疊加，作為總損失減小的方向。以上總損失減小的方向可以表示為：

其中，PSi為第一策略參數，PSj為第二策略參數，

為當前策略參數。在一個具體例子中，假定N=10，其中L1-L6小於損失均值，為第一總損失，相應的策略參數組PS1-PS6為第一策略參數；假定L7-L10大於損失均值，為第二總損失，相應的策略參數組PS7-PS10為第二策略參數。在一個實施例中，分別計算PS1-PS5這6群組原則參數相對於當前策略參數的梯度，將其進行累積，得到上述正方向；分別計算PS7-PS10這4群組原則參數相對於當前策略參數的梯度，將其進行累積，得到上述負方向，進而得到總損失減小的方向。如此，在系統訓練的第二階段的一個實施例中，通過以上方式確定出總損失減小的方向。於是，在圖3的步驟35，在總損失減小的方向，更新策略網路100中的當前策略參數組。通過不斷執行以上過程，在分類網路200的分類方式不變的情況下，探索更多的主幹詞提取策略，並不斷更新、優化策略網路100中的策略參數，從而針對性的訓練策略網路100。在策略網路的訓練達到一定訓練目標之後，強化學習系統的訓練可以進入第三階段，同時訓練和更新策略網路100和分類網路200。下面描述在第三階段，圖6的執行方式。在第三階段，在步驟61，仍然利用N組不同的策略參數下的策略網路處理第一樣本句子，獲得對應的N個主幹詞集合，這N個主幹詞集合可以對應於N個備選句子。然而，不同的是，在第三階段中，分類網路不固定，也就是說，分類網路中的分類參數也可以進行調整。那麼相應的，在步驟62中，是利用M組不同的分類參數下的分類網路，對步驟61得到的N個備選句子進行分類處理，得到N個備選句子對應的N個分類結果。其中，M＜=N。在M=N的情況下，相當於，對於N個備選句子，分別採用了M=N種不同的分類方法(對應於N組分類參數)進行分類；在M＜N的情況下，相當於，對前述N個備選句子進行分類所採用的分類參數，不完全相同。接著在步驟63，根據N個第一損失和N個第二損失，確定對應的N個總損失。需要理解的是，在以上得到N個分類結果的過程中，策略網路和分類網路的網路參數均發生了變化。此時，N個總損失對應於N個參數集，其中第i參數集Si包括第i群組原則參數PSi，和處理第i備選句子時分類網路對應的分類參數CSi。換而言之，上述參數集是策略網路100和分類網路200的網路參數的整體集合。此外，與前述類似的，可以確定出N個總損失的均值La。然後，在步驟64，將上述N個總損失劃分為，小於等於均值La的第一總損失，以及大於均值La的第二總損失。並且，在確定出第一總損失和第二總損失的基礎上，可以相應確定出第一總損失對應的第一參數集，以及第二總損失對應的第二參數集。換而言之，如果總損失Li小於等於均值La，則將該總損失歸為第一總損失，相應的參數集Si則歸為第一參數集；如果總損失Li大於均值La，則將該總損失歸為第二總損失，相應的參數集Si則歸為第二參數集。接下來，在步驟65，通過以下方式確定總損失減小的方向：確定至少一組第一參數集相對於當前策略參數的梯度的累積，作為第一正方向；確定至少一組第二參數集相對於當前策略參數的梯度的累積，作為第一負方向；將第一正方向與第一負方向的相反方向疊加，作為第一調整方向，即策略參數優化方向；確定至少一組第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；確定至少一組第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將第二正方向與第二負方向的相反方向疊加，作為第二調整方向，即分類參數優化方向。以上確定總損失減小的方向，也就是參數調整方向的構思與第二階段相同，也就是，將損失值較小的總損失所對應的參數集，即第一參數集，作為系統學習的“正樣本”，進行正向學習；將損失值較大的總損失所對應的參數集，即第二參數集，作為系統學習的“負樣本”，進行反向學習。學習時，對於策略網路和分類網路，分別確定各自對應的策略參數和分類參數的調整優化方向。具體的，對於策略網路的策略參數，其調整方向的確定與第二階段類似，只是計算梯度時，是計算整個參數集相對於當前策略參數的梯度。一般的，參數集中策略參數和分類參數是兩套相互獨立的參數，因此，實際梯度運算中，仍然是通過計算參數集中的策略參數部分相對於當前策略參數的梯度，獲得前述的第一正方向和第一負方向，進而確定出第一調整方向，即策略參數優化方向。以上第一調整方向可以表示為：

其中，Si為第一參數集，Sj為第二參數集，

為當前策略參數。對於分類網路中的分類參數，其調整方向的確定與策略參數相似，具體地，計算第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；計算第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將第二正方向與第二負方向的相反方向疊加，作為分類優化方向。如前所述，由於策略參數和分類參數通常相互獨立，在實際梯度運算中，可以通過計算各個參數集中的分類參數部分相對於當前分類參數的梯度，獲得前述的第二正方向和第二負方向，進而確定出第二調整方向，作為分類參數優化方向。以上第二調整方向可以表示為：

其中，Si為第一參數集，Sj為第二參數集，

為當前分類參數。於是，可以將第一調整方向和第二調整方向的總和，作為總損失減小的方向，即整個系統的調整方向。如此，在系統訓練的第三階段的一個實施例中，通過以上方式確定出總損失減小的方向。於是，在圖3的步驟35，在總損失減小的方向，更新強化學習系統包括，按照上述第一調整方向，更新策略網路100中的當前策略參數，按照上述第二調整方向，更新分類網路中的當前分類參數。如此，在第三階段，同時訓練策略網路和分類網路。可以理解，儘管以上實施例中描述了在第一階段單獨訓練分類網路之後，在第二階段固定住分類網路，單獨訓練策略網路，然後在第三階段，同時訓練策略網路和分類網路的訓練過程，但是，在其他實施例中，也可以在第一階段之後，跳過第二階段而直接進入第三階段，同時訓練策略網路和分類網路。通過不斷訓練策略網路和分類網路，可以探索、確定出更優化的主幹詞提取策略以及分類演算法，不斷優化整個強化學習系統，使得系統的總損失不斷減小，實現訓練目標。在達成訓練目標的情況下，策略網路可以準確地提取出儘量少的主幹詞，以使得句子表達更加精煉，同時不影響句子的含義，也就是不影響該句子的語義分類結果。在實現訓練目標的情況下，就可以將訓練得到的策略網路用於主幹詞提取。在這樣的情況下，可以將待分析的句子輸入給策略網路，策略網路利用訓練得到的策略參數，對該句子進行處理。根據策略網路的輸出，就可以確定該句子中的主幹詞。這些主幹詞的集合可以對應於一個主幹句子，用於後續的意圖識別，語義匹配等進一步文本分析，優化後續文本分析的效果。綜合以上，通過強化學習的方式，進行主幹詞提取的學習和訓練。在強化學習系統中，策略網路作為actor，用於主幹詞提取；分類網路作為critic，用於對句子進行分類。可以利用現有的句子樣本庫作為訓練預料訓練分類網路，從而避免主幹詞標注的人工成本。經過初步訓練的分類網路即可對策略網路提取的主幹詞構成的句子進行分類，如此評估主幹詞提取的效果。通過對策略網路和分類網路的輸出結果均設置損失，根據總損失反覆訓練策略網路和分類網路，可以得到理想的強化學習系統。如此，可以在無需主幹詞人工標注的情況下，訓練得到理想的網路系統，實現對主幹詞的有效提取。根據另一態樣的實施例，還提供一種通過強化學習提取主幹詞的裝置。該裝置可以部署在任何具有計算、處理能力的設備或平臺上。圖7示出根據一個實施例的裝置示意圖。如圖7所示，該裝置700包括：分類網路訓練單元71，配置為利用句子樣本集，訓練用於句子分類的分類網路；第一確定單元72，配置為利用當前策略參數下的策略網路，對所述句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；第二確定單元73，配置為利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；總損失確定單元74，配置為根據所述當前的第一損失和當前的第二損失，確定當前的總損失；更新單元75，配置為在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。在一個實施例中，策略網路包括第一嵌入層，第一處理層和第二處理層。第一確定單元72具體配置為：在所述第一嵌入層，獲得所述第一樣本句子中的各個詞的詞嵌入向量；在所述第一處理層，根據所述詞嵌入向量，確定所述各個詞作為主幹詞的機率；在所述第二處理層，至少根據所述機率，從所述各個詞中選擇至少一部分詞，構成所述第一主幹詞集合。進一步的，在一個實施例中，在所述第二處理層，從所述各個詞中選擇機率值大於預設閾值的詞，構成所述第一主幹詞集合。在一個實施例中，分類網路包括第二嵌入層和第三處理層，第二確定單元73具體配置為：在所述第二嵌入層，獲得所述第一備選句子對應的句子嵌入向量；在所述第三處理層，根據所述句子嵌入向量，確定所述第一備選句子的第一分類結果。根據一種實施方式，所述策略網路和/或所述分類網路基於循環神經網路RNN。在一種實施方式中，第一確定單元72還配置為，分別利用N群組原則參數下的所述策略網路處理所述第一樣本句子，獲得對應的N個主幹詞集合，並分別確定N個第一損失；所述第二確定單元73還配置為，利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子進行分類處理，獲得N個分類結果，並分別確定N個第二損失；所述總損失確定單元74還配置為，根據N個第一損失和N個第二損失，確定對應的N個總損失，以及所述N個總損失的均值；以及，確定損失值小於等於所述均值的至少一個第一總損失，以及損失值大於所述均值的至少一個第二總損失。此外，更新單元75包括方向確定模組751以及更新模組752。其中，方向確定模組751配置為，基於所述至少一個第一總損失和所述至少一個第二總損失，確定所述總損失減小的方向；更新模組752配置為，根據方向確定模組751確定的方向，執行網路更新。更具體的，在一個實施例中，第二確定單元73配置為：利用同一組分類參數下的所述分類網路，對所述N個備選句子分別進行分類處理，獲得所述N個分類結果；在這樣的情況下，所述N個總損失對應於所述N群組原則參數；如此，所述方向確定模組751配置為：確定所述至少一個第一總損失對應的至少一組第一策略參數相對於當前策略參數的梯度的累積，作為正方向；確定所述至少一個第二總損失對應的至少一組第二策略參數相對於當前策略參數的梯度的累積，作為負方向；將所述正方向與所述負方向的相反方向疊加，作為所述總損失減小的方向。與此相應的，在一個實施例中，更新模組752配置為：在所述總損失減小的方向，更新所述策略網路中的當前策略參數。在另一實施例中，第二確定單元73配置為：利用M組分類參數下的所述分類網路，對所述N個備選句子進行分類處理，獲得所述N個備選句子對應的N個分類結果，其中M＜=N；在這樣的情況下，所述N個總損失對應於N個參數集，其中第i參數集包括第i群組原則參數，和處理第i備選句子時所述分類網路對應的分類參數；此時，所述方向確定模組751配置為：確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前策略參數的梯度的累積，作為第一正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前策略參數的梯度的累積，作為第一負方向；將所述第一正方向與所述第一負方向的相反方向疊加，作為第一調整方向；確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將所述第二正方向與所述第二負方向的相反方向疊加，作為第二調整方向；將所述第一調整方向和第二調整方向的總和作為所述總損失減小的方向。與此相應的，在一個實施例中，所述更新模組752配置為：在所述第一調整方向，更新所述策略網路的當前策略參數；在所述第二調整方向，更新所述分類網路的當前分類參數。根據一種實施方式，所述裝置700還包括預測單元(未示出)，配置為：將待分析的第二句子輸入所述策略網路；根據所述策略網路的輸出，確定所述第二句子中的主幹詞。通過以上裝置，利用深度強化學習系統，實現主幹詞的提取。根據另一態樣的實施例，還提供一種電腦可讀儲存媒體，其上儲存有電腦程式，當所述電腦程式在電腦中執行時，令電腦執行結合圖2和圖4所描述的方法。根據再一態樣的實施例，還提供一種計算設備，包括記憶體和處理器，所述記憶體中儲存有可執行程式碼，所述處理器執行所述可執行程式碼時，實現結合圖3和圖6所述的方法。本領域技術人員應該可以意識到，在上述一個或多個示例中，本發明所描述的功能可以用硬體、軟體、韌體或它們的任意組合來實現。當使用軟體實現時，可以將這些功能儲存在電腦可讀媒體中或者作為電腦可讀媒體上的一個或多個指令或程式碼進行傳輸。以上所述的具體實施方式，對本發明的目的、技術方案和有益效果進行了進一步詳細說明，所應理解的是，以上所述僅為本發明的具體實施方式而已，並不用於限定本發明的保護範圍，凡在本發明的技術方案的基礎之上，所做的任何修改、等同替換、改進等，均應包括在本發明的保護範圍之內。The following describes the solutions provided in this specification in conjunction with the drawings. As mentioned earlier, in a variety of text analysis scenarios, it is necessary to extract the main words of sentences. In order to be able to automatically extract the stem words, in a scheme, a supervised machine learning method can be used to train the stem word extraction model. According to conventional supervised learning methods, in order to train such a stem word extraction model, a large amount of manually labeled annotation data is needed. These annotation data need to label each word in a sentence whether it is a stem word, and the labor cost is high. According to the concept of the embodiment of this specification, the method of reinforcement learning is adopted to extract the main words, which reduces the cost of manual labeling and optimizes the effect of the main words extraction. As those skilled in the art know, reinforcement learning is a method of unlabeled learning strategy based on the feedback of sequence behavior. Generally, a reinforcement learning system includes a smart body and an execution environment. The smart body continuously learns and optimizes its strategy through communication and feedback with the execution environment. Specifically, the intelligent body observes and obtains the state of the execution environment, and according to a certain strategy, determines the action or action to be taken for the current state of the execution environment. Such behavior acts on the execution environment, will change the state of the execution environment, and at the same time generate a feedback to the intelligent body, which is also called reward. The intelligent body judges whether the previous behavior is correct according to the reward score obtained, and whether the strategy needs to be adjusted, and then updates its strategy. By repeatedly observing the state, determining the behavior, and receiving feedback, the intelligent body can continuously update the strategy. The ultimate goal is to learn a strategy that maximizes the accumulation of reward points. There are a variety of algorithms to learn and optimize strategies in the intelligent body, and the Actor-Critic method is a strategy gradient method for reinforcement learning. Figure 1 shows a schematic diagram of a deep reinforcement learning system using the Actor-Critic method. As shown in Figure 1, the system includes a strategy model as an actor and an evaluation model as a critic. The strategy model obtains the environment state s from the environment, and outputs the action a to be taken in the current environment state according to a certain strategy. The evaluation model obtains the above-mentioned environmental state s and the action a output by the strategy model, scores this decision of the strategy model taking action a in the state s, and feeds the score back to the strategy model. The strategy model adjusts the strategy according to the score of the evaluation model in order to obtain a higher score. In other words, the goal of strategy model training is to obtain the highest possible score for the evaluation model. In another aspect, the evaluation model will continuously adjust its scoring method so that the scoring better reflects the accumulation of the reward score r of environmental feedback. In this way, repeated training of the evaluation model and strategy model makes the evaluation model's score more and more accurate, getting closer and closer to the rewards of environmental feedback, so the strategies adopted by the strategy model are more and more optimized and reasonable, and more environmental rewards are obtained. . Based on the above characteristics, according to the embodiment of this specification, the main word extraction is performed by the reinforcement learning system adopting the Actor-Critic method. Fig. 2 is a schematic diagram of a reinforcement learning system according to an embodiment disclosed in this specification. As shown in FIG. 2, the reinforcement learning system used for stem word extraction includes a strategy network 100 and a classification network 200. The strategy network 100 is used to extract the main words from the sentence, which corresponds to the strategy model shown in Fig. 1 and functions as an Actor; the classification network 200 is used to classify sentences, which corresponds to the evaluation model shown in Fig. 1, The role is Critic. Both the strategy network 100 and the classification network 200 are neural networks. In order to train the strategy network 100 and the classification network 200, sample sentences with sentence classification tags can be used. During the training process, sample sentences (corresponding to the environment state s) are input to the policy network 100. Through a certain strategy, the strategy network 100 extracts a number of stem words from the sample sentence to form a stem word set (equivalent to an action a taken), and the stem word set can correspond to a stem sentence. The classification network 200 obtains the main word set, and classifies the main sentence corresponding to the main word set to obtain the classification result. By comparing the classification result with the classification label of the original sample sentence, it is evaluated whether the main word set is extracted correctly. The loss (loss 1 and loss 2 in the figure) can be set for the main word extraction process of the strategy network 100 and the classification process of the classification network 200 respectively. Based on the loss, the strategy network 100 and the classification network 200 are repeatedly trained to make the loss Smaller, more accurate classification. The strategy network 100 thus trained can be used to extract the main words of the sentence to be analyzed. The training process and processing process of the above system are described below. Fig. 3 shows a flowchart of a method for training a reinforcement learning system for stem word extraction according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 3, the method includes: step 31, using a sentence sample set to train a classification network for sentence classification; step 32, using the strategy network under the current strategy parameter set to perform the same The main word extraction of this sentence is performed to obtain the first main word set, and the current first loss is determined according to the number of words in the first sample sentence and the number of words in the first main word set; step 33, Use the classification network to classify the first candidate sentence formed by the first stem word set to obtain the first classification result of the first candidate sentence, and according to the first classification result and the State the classification label of the first sample sentence, determine the current second loss; step 34, determine the current total loss based on the current first loss and the current second loss; step 35, in the direction of the total loss reduction, At least the strategy network is updated to extract the main words from the sentence to be analyzed. The following describes the specific implementation of the above steps. As described above in conjunction with FIG. 2, the strategy network 100 is used to extract main words from a sentence, and the classification network 200 is used to classify sentences, and then evaluate the quality of the main words extracted by the strategy network. These two neural networks communicate with each other and require repeated training to obtain ideal network parameters. In order to promote the model to converge as soon as possible, in the first stage, the classification network 200 is trained separately so that it can realize basic sentence classification. Therefore, first, in step 31, use sentence sample sets to train a classification network for sentence classification. Sentence classification, or text classification, is a common task in text analysis. Therefore, there are already a large number of rich sample corpora that can be used for classification training. Therefore, in step 31, some sentence samples can be obtained from the existing corpus to form a sentence sample set, where the sentence samples include the original sentence and the classification label added to the original sentence. Using such a sentence sample set composed of sentence samples with classification labels, the sentence classification network can be trained. The training method can be carried out by the classic supervised training method. In this way, through step 31, a preliminary trained classification network can be obtained, and the classification network can be used to classify sentences. On this basis, the above classification network can be used to evaluate the strategy network to train the reinforcement learning system. Specifically, in step 32, use the strategy network under the current strategy parameter group to extract the main word from any sample sentence in the sentence sample set, which is referred to as the first sample sentence hereinafter, to obtain the corresponding main word set. It is called the first trunk word set. It can be understood that initially, the strategy parameters in the strategy network can be initialized randomly; as the strategy network is trained, the strategy parameters will be continuously adjusted and updated. The current strategy parameter group can be a random parameter group in the initial state, or it can be a strategy parameter in a certain state during the training process. A group of policy parameters of the strategy network can be considered to correspond to a strategy. Correspondingly, in step 32, the strategy network processes the input first sample sentence according to the current strategy, and extracts the main word from it. In an embodiment, the policy network may include multiple network layers, and the main word extraction is realized through the multiple network layers. Fig. 4 shows a schematic structural diagram of a policy network according to an embodiment. As shown in FIG. 4, the policy network 100 may include an embedded layer 110, a first processing layer 120, and a second processing layer 130. The embedding layer 110 obtains a sample sentence, and calculates its word embedding vector for each word in the sentence. For example, for the first sample sentence, the word sequence {W ₁ , W ₂ ,..., W _n } can be obtained after word segmentation, which includes n words. Calculates the corresponding word embedded layer embedded vector E _i for each word W _i, thereby obtaining _{_{{E 1, E 2, ...}} , E n}. The first processing layer 120 determines the probability of each word as a backbone word according to the above word embedding vector. For example, for a word embedding vector {E ₁ , E ₂ ,..., En} of _n words, determine the probability {P ₁ , P ₂ ,..., P _n } of each word as a backbone word. The second processing layer 130 selects at least a part of the words from each word according to the aforementioned probability as the main word to form a main word set. In one embodiment, a probability threshold is preset. The second processing layer selects words with a probability greater than the above threshold from each word as the main word. The entirety of the network parameters of each layer in the above embedded layer 110, the first processing layer 120, and the second processing layer 130 constitutes a strategy parameter. In one embodiment, the strategy network 100 uses a recurrent neural network RNN. More specifically, the above embedding layer 110 can be implemented by RNN, so that when embedding each word, the timing effect of the word is considered. The first processing layer 120 and the second processing layer 130 may be implemented by a fully connected processing layer. In other embodiments, the strategy network 100 may also adopt different neural network architectures, such as an improved long and short-term memory LSTM neural network based on RNN, a GRU neural network, or a deep neural network DNN, etc. Through the above strategy network, the main words of the sample sentences can be extracted. For example, for the n words in the first sample sentence, the strategy network selects m words (m<=n) as the stem words through the current strategy, and these m stem words are represented as {w ₁ ,w ₂ ,…,W _m }. In this way, the main word set is obtained. On the basis of obtaining the set of trunk words, a loss function can be used, which is called the first loss function below, to measure the loss of the trunk word extraction process, which is called the first loss below, and is denoted as LK(Loss_Keyword). That is, in step 32, on the basis of obtaining the first stem word set, the current first loss is determined according to the number of words in the first sample sentence and the number of words in the first stem word set. In one embodiment, the first loss function is set such that the smaller the number of extracted stem words, the lower the loss value; the more the number of stem words, the higher the loss value. In an embodiment, the first loss can also be determined according to the proportion of the extracted stem words relative to the sample sentence. The higher the proportion, the greater the loss value, and the smaller the proportion, the lower the loss value. This is all considered. In the ideal state where the training is expected to be completed, the strategy network 100 can exclude as many useless words as possible from the original sentence and retain as few words as possible as the main words. For example, in an example, the first loss function can be set as: LK=Num_Reserve/Num_Total where Num_Reserve is the number of words retained as the backbone word, that is, the number of words in the trunk word set, and Num_Total is the number of words in the sample sentence . In the above example, assuming that the first sample sentence contains n words and the strategy network selects m words from the current strategy, the current first loss is LK=m/n. Next, in step 33, the classification network is used to classify the first candidate sentence formed by the first stem word set to obtain the first classification result of the first candidate sentence. It can be understood that through the preliminary training in step 31, the preliminary classification parameters of the classification network are determined, and such a classification network can be used to classify sentences. In addition, in step 32, the policy network 100 may output the first set of stem words extracted for the first sample sentence, and the first set of stem words may correspond to a candidate sentence, that is, the first candidate sentence. The first candidate sentence can be understood as a sentence obtained after excluding stop words and meaningless words from the first sample sentence, and only retaining the main word. Correspondingly, in step 33, the classification network can be used to classify the first candidate sentence to obtain the classification result. In one embodiment, the classification network may include multiple network layers, and sentence classification is implemented through the multiple network layers. Fig. 5 shows a schematic structural diagram of a classification network according to an embodiment. As shown in FIG. 5, the classification network 200 may include an embedded layer 210 and a fully connected processing layer 220. The embedding layer 210 obtains the stem word set output by the strategy network 100, calculates the word embedding vector for each word, and then calculates the sentence embedding vector of the candidate sentence formed by the stem word set. For example, for the first set of stem words {w ₁ ,w ₂ ,...,w _m }, the word embedding vector {e ₁ ,e ₂ ,...,e _m } of each word can be calculated separately, and then based on each word embedding vector, Obtain the sentence embedding vector Es of the first candidate sentence. In different embodiments, the sentence embedding vector can be obtained by performing operations such as splicing and averaging on each word embedding vector. Next, the fully connected processing layer 220 determines the classification result of the first candidate sentence according to the above sentence embedding vector Es, that is, the first classification result. The entirety of the network parameters of each layer in the above embedded layer 210 and the fully connected processing layer 220 constitutes a classification parameter. Similar to the strategy network 100, the classification network 200 can be implemented using a recurrent neural network RNN. More specifically, the above embedding layer 210 can be implemented by RNN. In other embodiments, the classification network 200 may also adopt different neural network architectures, such as LSTM neural network, GRU neural network, or deep neural network DNN, etc. After classifying the candidate sentences, another loss function, referred to as the second loss function below, can be used to measure the loss of the classification process, referred to as the second loss below, denoted as LC (Loss_Classify). That is, in step 33, on the basis of obtaining the first classification result, the current second loss is determined according to the first classification result and the classification label of the first sample sentence. In one embodiment, the second loss function is set to determine the second loss LC based on a cross-entropy algorithm. In other embodiments, the second loss LC may also be determined based on the difference between the classification result and the classification label through loss functions of other forms and other algorithms. Correspondingly, through the above-mentioned second loss function, based on the comparison between the first classification result obtained in this classification and the classification labels corresponding to the first sample sentence, the classification loss of this classification can be determined, that is, the current The second loss. After the first loss and the second loss are determined, in step 34, the current total loss is determined according to the current first loss and the current second loss. The total loss can be understood as the loss of the entire reinforcement learning system, including the loss of the process of extracting the main words of the strategy network, and the loss of the classification process of the classification network. In one embodiment, the total loss is defined as the sum of the above-mentioned first loss and second loss. In another embodiment, a certain weight may be assigned to the first loss and the second loss, and the total loss is defined as the weighted sum of the first loss and the second loss. According to the definition of the total loss, the current total loss can be determined based on the current first loss corresponding to the main word extracted this time, and the current second loss corresponding to this classification. Based on this total loss, the reinforcement learning system can be trained. The goal of training is to make the total loss as small as possible. Based on the above definitions of the first loss, second loss and total loss, it can be understood that the total loss as small as possible means that the strategy network 100 eliminates as many useless words as possible and extracts as few main words as possible without changing The meaning of the sentence, so the sentence classification result of the classification network 200 is as close as possible to the classification label of the original sentence. In order to achieve the goal of reducing the total loss, in step 35, the reinforcement learning system is updated in the direction of the total loss reduction. Updating the reinforcement learning system includes at least updating the strategy network 100 and may also include updating the classification network 200. The method for determining the direction of the above total loss reduction and the updating method of the reinforcement learning system may be different in different training methods and different training stages, which are described separately below. According to a training method, in order to determine the direction of the total loss reduction, different strategies are used in the strategy network 100 to process multiple sample sentences respectively to obtain multiple corresponding stem word sentences and multiple corresponding first losses; Then, the classification network 200 is used to classify each stem word sentence, and multiple corresponding classification results and multiple corresponding second losses are obtained. Thus, multiple total losses for processing multiple sample sentences are obtained. The current loss is compared with multiple total losses, and the gradient of the network parameter corresponding to the total loss smaller than the current loss in the multiple total losses relative to the current network parameter is determined as the direction in which the total loss decreases. According to another training method, in order to determine the direction of total loss reduction, multiple processing of the same sample sentence is performed to obtain multiple total losses. Based on such multiple total losses, the direction of total loss reduction is determined. Fig. 6 shows a flowchart of steps for determining the direction of total loss reduction in this training mode. In order to explore more and better strategies, in the strategy network 100, a certain randomness can be added to the current strategy to generate N strategies, and these N strategies correspond to N group principle parameters. Combined with the network structure shown in Figure 4, random disturbances can be added to the embedding algorithm of the embedding layer to obtain a new strategy; the algorithm for determining the probability of the main word in the first processing layer can be changed to obtain a new strategy; also The rule algorithm for probability selection, such as changing the probability threshold, can be used to obtain a new strategy. Through the combination of the above various changes, N strategies can be obtained, corresponding to the N group principle parameters. Correspondingly, in step 61, the first sample sentence is processed by using the strategy network under the above-mentioned N group principle parameters to obtain the corresponding set of N main words. Moreover, N first losses can be determined respectively according to the first loss function as described above. Then, in step 62, the classification network 200 is used to classify the N candidate sentences respectively corresponding to the N trunk word sets to obtain N classification results. In addition, according to the aforementioned second loss function, N second losses corresponding to the N classification results are respectively determined. In step 63, according to the N first losses and N second losses, the corresponding N total losses are determined, denoted as L1, L2, ..., Ln. In addition, the average value La of the above N total losses can also be determined. In step 64, at least one first total loss whose loss value is less than or equal to the mean value and at least one second total loss whose loss value is greater than the mean value are determined. In other words, the above-mentioned N total losses are divided into total losses less than or equal to the average value La, called the first total loss, and total losses greater than the average value La, called the second total loss. In step 65, based on the above-mentioned first total loss and second total loss, the direction in which the total loss is reduced is determined. More specifically, the first total loss may correspond to the direction of positive learning because the loss is small, and the second total loss may correspond to the direction of negative learning because of the large loss. Therefore, in step 65, the direction of positive learning and the opposite direction of the negative learning direction are combined to obtain the total learning direction, that is, the direction in which the total loss is reduced. For the above training methods, there can also be different specific execution methods in different training stages. As mentioned earlier, in the first stage of the training of the entire reinforcement learning system, the classification network is trained separately, as shown in step 31. In order to accelerate the convergence of the model, in one embodiment, in the next second stage, the above classification network is fixed, and only the strategy network is trained and updated; then, in the third stage, the update strategy network and the classification network are trained at the same time road. The following describes the second stage and the third stage, the execution method of the process in Figure 6. Specifically, in the second stage, the classification network is fixed, that is, the classification parameters in the classification network remain unchanged and no adjustment is made. Correspondingly, in step 62 in Figure 6, the classification network under the same set of classification parameters is used to classify the aforementioned N candidate sentences, that is, based on the same classification method, to obtain the N Classification results. Since the classification parameters are unchanged, in this case, the N total losses determined in step 63 actually correspond to the N strategies of the strategy network, and thus correspond to the N group principle parameters. That is, the i-th total loss Li corresponds to the i-th group principle parameter PSi. Then in step 64, on the basis of determining the first total loss and the second total loss, the first strategy parameter corresponding to the first total loss and the second strategy parameter corresponding to the second total loss are determined. In other words, if the total loss Li is less than or equal to the mean value La, the total loss is classified as the first total loss, and the corresponding strategy parameter group PSi is classified as the first strategy parameter; if the total loss Li is greater than the mean value La, the total loss is classified as the first total loss. The total loss is classified as the second total loss, and the corresponding strategy parameter group PSi is classified as the second strategy parameter. Next, in step 65, the direction of total loss reduction is determined by: determining the accumulation of the gradient of at least one set of first strategy parameters relative to the current strategy parameter as the positive direction; determining the relative direction of at least one set of second strategy parameters The accumulation of the gradient of the current strategy parameter is taken as the negative direction; the positive direction and the opposite direction of the negative direction are superimposed as the direction in which the total loss decreases. This is because the first strategy parameter corresponds to the total loss whose loss value is less than or equal to the average value, or the total loss whose loss value is smaller. Therefore, it can be considered that the strategy selection direction corresponding to the first strategy parameter is correct. The “positive sample” learned by the system should be forward learning; and the second strategy parameter corresponds to the total loss whose loss value is greater than the average value, which is the total loss whose loss value is larger. Therefore, it can be considered that the second strategy parameter corresponds to The direction of strategy selection is wrong, it is a "negative sample" of system learning, and reverse learning should be performed. Generally, there may be multiple first total losses whose loss value is less than or equal to the average value. Correspondingly, the first strategy parameter may be multiple sets of first strategy parameters. The multiple sets of first strategy parameters may have different effects on the extraction of stem words at different positions of the sample sentence. Therefore, in one embodiment, the multiple sets of first strategy parameters are all forward-learned to determine the first strategy parameters in each group. The gradient of the strategy parameter relative to the current strategy parameter is accumulated to obtain the above-mentioned positive direction. Correspondingly, the second strategy parameter may also be multiple sets of second strategy parameters. In one embodiment, negative learning is performed on the multiple sets of second strategy parameters, the gradient of each set of second strategy parameters with respect to the current strategy parameter is determined, and the gradients are accumulated to obtain the aforementioned negative direction. Finally, the negative direction is reversed and superimposed with the positive direction as the direction in which the total loss is reduced. The above direction of total loss reduction can be expressed as:

Among them, PSi is the first strategy parameter, PSj is the second strategy parameter,

Is the current strategy parameter. In a specific example, assume that N=10, where L1-L6 is less than the average loss, which is the first total loss, and the corresponding strategy parameter group PS1-PS6 is the first strategy parameter; assuming that L7-L10 is greater than the average loss, it is the second For the total loss, the corresponding strategy parameter group PS7-PS10 is the second strategy parameter. In one embodiment, the gradients of the 6 group principle parameters of PS1-PS5 relative to the current strategy parameters are calculated respectively, and they are accumulated to obtain the above-mentioned positive direction; the 4 group principle parameters of PS7-PS10 are calculated separately relative to the current strategy The gradient of the parameter is accumulated to obtain the above-mentioned negative direction, and then the direction in which the total loss decreases. In this way, in an embodiment of the second stage of system training, the direction in which the total loss is reduced is determined by the above method. Therefore, in step 35 of FIG. 3, the current policy parameter set in the policy network 100 is updated in the direction in which the total loss decreases. Through the continuous implementation of the above process, under the condition that the classification method of the classification network 200 remains unchanged, explore more main word extraction strategies, and continuously update and optimize the strategy parameters in the strategy network 100, so as to train the strategy network in a targeted manner Road 100. After the training of the strategy network reaches a certain training goal, the training of the reinforcement learning system can enter the third stage, and the strategy network 100 and the classification network 200 are trained and updated at the same time. The following describes the implementation of Figure 6 in the third stage. In the third stage, in step 61, the strategy network under N sets of different strategy parameters is still used to process the first sample sentence to obtain the corresponding N trunk word sets. These N trunk word sets can correspond to N backups. Choose a sentence. However, the difference is that in the third stage, the classification network is not fixed, that is, the classification parameters in the classification network can also be adjusted. Then correspondingly, in step 62, the classification network under M groups of different classification parameters is used to classify the N candidate sentences obtained in step 61 to obtain N classification results corresponding to the N candidate sentences. Among them, M<=N. In the case of M=N, it is equivalent to using M=N different classification methods (corresponding to N groups of classification parameters) for classification of N candidate sentences; in the case of M<N, it is equivalent to , The classification parameters used to classify the aforementioned N candidate sentences are not completely the same. Next, in step 63, the corresponding N total losses are determined according to the N first losses and N second losses. It should be understood that in the process of obtaining N classification results above, the network parameters of the strategy network and the classification network have changed. At this time, N total losses correspond to N parameter sets, where the i-th parameter set Si includes the i-th group principle parameter PSi, and the classification parameter CSi corresponding to the classification network when processing the i-th candidate sentence. In other words, the aforementioned parameter set is an overall set of network parameters of the policy network 100 and the classification network 200. In addition, similar to the foregoing, the average value La of N total losses can be determined. Then, in step 64, the aforementioned N total losses are divided into a first total loss less than or equal to the average value La, and a second total loss greater than the average value La. Furthermore, on the basis of determining the first total loss and the second total loss, the first parameter set corresponding to the first total loss and the second parameter set corresponding to the second total loss can be determined accordingly. In other words, if the total loss Li is less than or equal to the average value La, the total loss is classified as the first total loss, and the corresponding parameter set Si is classified as the first parameter set; if the total loss Li is greater than the average La, then the total loss The total loss is classified as the second total loss, and the corresponding parameter set Si is classified as the second parameter set. Next, in step 65, the direction of the total loss reduction is determined by: determining the accumulation of the gradient of at least one set of first parameter sets with respect to the current strategy parameters as the first positive direction; determining at least one set of second parameters The accumulation of the gradient relative to the current strategy parameter is taken as the first negative direction; the first positive direction and the opposite direction of the first negative direction are superimposed as the first adjustment direction, that is, the strategy parameter optimization direction; determine at least one set of first parameters The accumulation of the gradient of the set relative to the current classification parameter is regarded as the second positive direction; the accumulation of the gradient of at least one set of second parameter sets relative to the current classification parameter is determined as the second negative direction; the second positive direction and the second negative The opposite directions of the directions are superimposed as the second adjustment direction, that is, the direction of optimizing classification parameters. The above concept of determining the direction of total loss reduction, that is, the direction of parameter adjustment, is the same as in the second stage, that is, the parameter set corresponding to the total loss with the smaller loss value, that is, the first parameter set, is used as the system learning " Positive sample" for positive learning; the parameter set corresponding to the total loss with a larger loss value, that is, the second parameter set, is used as the "negative sample" of system learning for reverse learning. When learning, for the strategy network and classification network, respectively determine the adjustment and optimization directions of the corresponding strategy parameters and classification parameters. Specifically, for the strategy parameters of the strategy network, the determination of the adjustment direction is similar to the second stage, except that when calculating the gradient, the gradient of the entire parameter set relative to the current strategy parameter is calculated. Generally, the parameter set strategy parameter and the classification parameter are two sets of independent parameters. Therefore, in the actual gradient calculation, the gradient of the strategy parameter part in the parameter set relative to the current strategy parameter is still calculated to obtain the aforementioned first positive direction. And the first negative direction to determine the first adjustment direction, that is, the strategy parameter optimization direction. The above first adjustment direction can be expressed as:

Among them, Si is the first parameter set, Sj is the second parameter set,

Is the current strategy parameter. For the classification parameters in the classification network, the determination of the adjustment direction is similar to that of the strategy parameters. Specifically, the accumulation of the gradient of the first parameter set relative to the current classification parameter is calculated as the second positive direction; the second parameter set is calculated relative to The accumulation of the gradient of the current classification parameter is used as the second negative direction; the second positive direction and the opposite direction of the second negative direction are superimposed as the classification optimization direction. As mentioned above, since the strategy parameters and the classification parameters are usually independent of each other, in the actual gradient calculation, the aforementioned second positive direction and second negative direction can be obtained by calculating the gradient of the classification parameter part in each parameter set relative to the current classification parameter. Direction, and then determine the second adjustment direction as the optimization direction of the classification parameters. The above second adjustment direction can be expressed as:

Among them, Si is the first parameter set, Sj is the second parameter set,

Is the current classification parameter. Therefore, the sum of the first adjustment direction and the second adjustment direction can be used as the direction of total loss reduction, that is, the adjustment direction of the entire system. In this way, in an embodiment of the third stage of system training, the direction in which the total loss is reduced is determined by the above method. Therefore, in step 35 of FIG. 3, updating the reinforcement learning system in the direction in which the total loss is reduced includes updating the current strategy parameters in the strategy network 100 according to the above-mentioned first adjustment direction, and updating the classification according to the above-mentioned second adjustment direction. The current classification parameters in the network. In this way, in the third stage, both the strategy network and the classification network are trained. It can be understood that although the above embodiment describes that after the classification network is trained separately in the first stage, the classification network is fixed in the second stage, the strategy network is trained separately, and then in the third stage, the strategy network and the classification are trained simultaneously The training process of the network, however, in other embodiments, after the first stage, the second stage can be skipped and the third stage is directly entered, and the strategy network and the classification network are trained at the same time. Through continuous training of the strategy network and classification network, more optimized stem word extraction strategies and classification algorithms can be explored and determined, and the entire reinforcement learning system can be continuously optimized, so that the total loss of the system is continuously reduced and the training goal is achieved. In the case of achieving the training goal, the strategy network can accurately extract as few main words as possible to make the sentence expression more refined without affecting the meaning of the sentence, that is, the semantic classification result of the sentence. In the case of achieving the training goal, the trained strategy network can be used for stem word extraction. In this case, the sentence to be analyzed can be input to the strategy network, and the strategy network uses the strategy parameters obtained by training to process the sentence. According to the output of the strategy network, the main word in the sentence can be determined. The set of these main words can correspond to a main sentence, which is used for further text analysis such as subsequent intent recognition and semantic matching to optimize the effect of subsequent text analysis. Based on the above, the learning and training of the extraction of main words are carried out by means of reinforcement learning. In the reinforcement learning system, the strategy network is used as an actor for stem word extraction; the classification network is used as a critic to classify sentences. The existing sentence sample library can be used as a training prediction to train the classification network, thereby avoiding the labor cost of the main word tagging. The classification network after preliminary training can classify the sentences composed of the main words extracted by the strategy network, and thus evaluate the effect of main word extraction. By setting losses for both the output results of the strategy network and the classification network, and repeatedly training the strategy network and the classification network according to the total loss, an ideal reinforcement learning system can be obtained. In this way, an ideal network system can be trained without manual labeling of the main words, and the effective extraction of main words can be realized. According to another aspect of the embodiment, an apparatus for extracting main words through reinforcement learning is also provided. The device can be deployed on any equipment or platform with computing and processing capabilities. Fig. 7 shows a schematic diagram of an apparatus according to an embodiment. As shown in FIG. 7, the device 700 includes: a classification network training unit 71, configured to use a sentence sample set to train a classification network for sentence classification; a first determining unit 72, configured to use a strategy under current strategy parameters The network extracts the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the number of words in the first sample sentence and the number of the first stem word set The number of words to determine the current first loss; the second determining unit 73 is configured to use the classification network to classify the first candidate sentence formed by the first set of stem words to obtain the first backup Select the first classification result of the sentence, and determine the current second loss according to the first classification result and the classification label of the first sample sentence; the total loss determination unit 74 is configured to determine the current second loss according to the current first The loss and the current second loss determine the current total loss; the update unit 75 is configured to update at least the strategy network in the direction in which the total loss decreases, so as to extract the main words from the sentence to be analyzed. In one embodiment, the policy network includes a first embedding layer, a first processing layer, and a second processing layer. The first determining unit 72 is specifically configured to: at the first embedding layer, obtain the word embedding vector of each word in the first sample sentence; at the first processing layer, determine according to the word embedding vector The probability of each of the words as the main word; in the second processing layer, at least a part of the words is selected from the various words at least according to the probability to form the first set of main words. Further, in one embodiment, in the second processing layer, a word with a probability value greater than a preset threshold is selected from the various words to form the first main word set. In one embodiment, the classification network includes a second embedding layer and a third processing layer, and the second determining unit 73 is specifically configured to: in the second embedding layer, obtain a sentence embedding vector corresponding to the first candidate sentence ; In the third processing layer, determine the first classification result of the first candidate sentence according to the sentence embedding vector. According to one embodiment, the strategy network and/or the classification network are based on a recurrent neural network RNN. In an embodiment, the first determining unit 72 is further configured to process the first sample sentence by using the strategy network under the N group principle parameters to obtain the corresponding N trunk word sets, and respectively determine N first losses; the second determining unit 73 is further configured to use the classification network to classify the N candidate sentences corresponding to the N trunk word sets to obtain N classification results, And respectively determine N second losses; the total loss determining unit 74 is further configured to determine the corresponding N total losses and the average value of the N total losses according to the N first losses and N second losses And, determining at least one first total loss whose loss value is less than or equal to the average value, and at least one second total loss whose loss value is greater than the average value. In addition, the update unit 75 includes a direction determination module 751 and an update module 752. Wherein, the direction determining module 751 is configured to determine the direction in which the total loss is reduced based on the at least one first total loss and the at least one second total loss; the updating module 752 is configured to determine the mode according to the direction The group 751 determines the direction to perform network update. More specifically, in one embodiment, the second determining unit 73 is configured to: use the classification network under the same set of classification parameters to perform classification processing on the N candidate sentences respectively to obtain the N classifications Result; In this case, the N total losses correspond to the N group principle parameters; In this way, the direction determining module 751 is configured to: determine at least one group corresponding to the at least one first total loss The accumulation of the gradient of the first strategy parameter with respect to the current strategy parameter is taken as the positive direction; the accumulation of the gradient of the at least one set of second strategy parameters corresponding to the at least one second total loss with respect to the current strategy parameter is determined as the negative direction; The direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases. Correspondingly, in one embodiment, the update module 752 is configured to update the current strategy parameters in the strategy network in the direction in which the total loss decreases. In another embodiment, the second determining unit 73 is configured to: use the classification network under the M groups of classification parameters to classify the N candidate sentences to obtain the corresponding N candidate sentences N classification results, where M<=N; in this case, the N total losses correspond to N parameter sets, where the i-th parameter set includes the i-th group principle parameter, and the i-th candidate sentence is processed Is the classification parameter corresponding to the classification network; at this time, the direction determination module 751 is configured to: determine the gradient of the at least one first parameter set corresponding to the at least one first total loss with respect to the current strategy parameter Accumulation, as the first positive direction; determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss with respect to the current strategy parameter, as the first negative direction; and comparing the first positive direction with The superposition of the opposite directions of the first negative direction is used as the first adjustment direction; the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current classification parameter is determined as the second positive direction ; Determine the accumulation of the gradient of the at least one set of second parameter sets corresponding to the at least one second total loss with respect to the current classification parameters, as the second negative direction; compare the second positive direction and the second negative direction The opposite directions are superimposed as the second adjustment direction; the sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced. Correspondingly, in one embodiment, the update module 752 is configured to: update the current strategy parameters of the strategy network in the first adjustment direction; update the current strategy parameters in the second adjustment direction The current classification parameters of the classification network. According to an embodiment, the device 700 further includes a prediction unit (not shown) configured to: input a second sentence to be analyzed into the strategy network; determine the second sentence according to the output of the strategy network The main word in the sentence. Through the above devices, the use of deep reinforcement learning system to achieve the extraction of main words. According to another aspect of the embodiment, there is also provided a computer-readable storage medium on which a computer program is stored. When the computer program is executed in the computer, the computer is caused to execute the method described in conjunction with FIG. 2 and FIG. 4. According to still another aspect of the embodiment, there is also provided a computing device, including a memory and a processor, the memory stores executable program code, and when the processor executes the executable program code, a combination diagram is realized 3 and the method described in Figure 6. Those skilled in the art should be aware that in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or program codes on the computer-readable medium. The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in further detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. The protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.

31:步驟 32:步驟 33:步驟 34:步驟 35:步驟 61:步驟 62:步驟 63:步驟 64:步驟 65:步驟 71:分類網路訓練單元 72:第一確定單元 73:第二確定單元 74:總損失確定單元 75:更新單元 100:策略網路 110:嵌入層 120:第一處理層 130:第二處理層 200:分類網路 210:嵌入層 220:全連接處理層 751:方向確定模組 752:更新模組31: steps 32: steps 33: steps 34: steps 35: steps 61: Step 62: steps 63: steps 64: steps 65: steps 71: Classification network training unit 72: The first determination unit 73: The second determination unit 74: Total loss determination unit 75: update unit 100: Strategic Network 110: Embedded layer 120: first processing layer 130: second processing layer 200: Classification network 210: Embedded layer 220: Fully connected processing layer 751: Direction Determination Module 752: update module

為了更清楚地說明本發明實施例的技術方案，下面將對實施例描述中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是本發明的一些實施例，對於本領域普通技術人員來講，在不付出進步性勞動的前提下，還可以根據這些圖式獲得其它的圖式。圖1示出採用Actor-Critic方式的深度強化學習系統的示意圖；圖2為本說明書披露的一個實施例的強化學習系統示意圖；圖3示出根據一個實施例的訓練用於主幹詞提取的強化學習系統的方法流程圖；圖4示出根據一個實施例的策略網路的結構示意圖；圖5示出根據一個實施例的分類網路的結構示意圖；圖6示出在一種訓練方式下確定總損失減小方向的步驟流程圖；圖7示出根據一個實施例的裝置示意圖。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other schemas based on these schemas without making progressive labor. Figure 1 shows a schematic diagram of a deep reinforcement learning system using Actor-Critic mode; Figure 2 is a schematic diagram of a reinforcement learning system according to an embodiment disclosed in this specification; Fig. 3 shows a flowchart of a method for training a reinforcement learning system for stem word extraction according to an embodiment; Figure 4 shows a schematic structural diagram of a policy network according to an embodiment; Figure 5 shows a schematic structural diagram of a classification network according to an embodiment; Figure 6 shows a flow chart of steps for determining the direction of total loss reduction in a training mode; Fig. 7 shows a schematic diagram of an apparatus according to an embodiment.

100:策略網路 100: Strategic Network

200:分類網路 200: Classification network

Claims

一種通過強化學習提取主幹詞的方法，包括：利用句子樣本集，訓練用於句子分類的分類網路；利用當前策略參數下的策略網路，對所述句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；根據所述當前的第一損失和當前的第二損失，確定當前的總損失；在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。A method for extracting main words through reinforcement learning, including: Use sentence sample sets to train a classification network for sentence classification; Use the strategy network under the current strategy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the number of words in the first sample sentence and the The number of words in the first stem word set determines the current first loss; Use the classification network to classify the first candidate sentence formed by the first stem word set to obtain the first classification result of the first candidate sentence, and according to the first classification result and the State the classification label of the first sample sentence and determine the current second loss; Determine the current total loss according to the current first loss and the current second loss; In the direction where the total loss is reduced, at least the strategy network is updated to extract the main words from the sentence to be analyzed.

根據申請專利範圍第1項所述的方法，其中，所述策略網路包括第一嵌入層，第一處理層和第二處理層，所述利用策略網路對所述句子樣本集中的第一樣本句子進行主幹詞提取包括：在所述第一嵌入層，獲得所述第一樣本句子中的各個詞的詞嵌入向量；在所述第一處理層，根據所述詞嵌入向量，確定所述各個詞作為主幹詞的機率；在所述第二處理層，至少根據所述機率，從所述各個詞中選擇至少一部分詞，構成所述第一主幹詞集合。The method according to item 1 of the scope of patent application, wherein the strategy network includes a first embedding layer, a first processing layer and a second processing layer, and the strategy network is used to analyze the first set of sentence samples The main word extraction of sample sentences includes: In the first embedding layer, obtaining the word embedding vector of each word in the first sample sentence; In the first processing layer, according to the word embedding vector, determine the probability of each word as a backbone word; In the second processing layer, at least a part of the words is selected from the various words at least according to the probability to form the first trunk word set.

根據申請專利範圍第2項所述的方法，其中，在所述第二處理層，從所述各個詞中選擇機率值大於預設閾值的詞，構成所述第一主幹詞集合。The method according to item 2 of the scope of patent application, wherein, in the second processing layer, words with a probability value greater than a preset threshold are selected from the various words to form the first trunk word set.

根據申請專利範圍第1項所述的方法，其中，所述分類網路包括第二嵌入層和第三處理層，所述利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理包括：在所述第二嵌入層，獲得所述第一備選句子對應的句子嵌入向量；在所述第三處理層，根據所述句子嵌入向量，確定所述第一備選句子的第一分類結果。The method according to item 1 of the scope of patent application, wherein the classification network includes a second embedding layer and a third processing layer, and the classification network is used to compare the first set of stem words The classification processing of a candidate sentence includes: In the second embedding layer, obtain the sentence embedding vector corresponding to the first candidate sentence; In the third processing layer, the first classification result of the first candidate sentence is determined according to the sentence embedding vector.

根據申請專利範圍第1項所述的方法，其中，所述策略網路和/或所述分類網路基於循環神經網路RNN。The method according to item 1 of the scope of patent application, wherein the strategy network and/or the classification network are based on a recurrent neural network RNN.

根據申請專利範圍第1項所述的方法，還包括：分別利用N群組原則參數下的所述策略網路處理所述第一樣本句子，獲得對應的N個主幹詞集合，並分別確定N個第一損失；利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子進行分類處理，獲得N個分類結果，並分別確定N個第二損失；根據N個第一損失和N個第二損失，確定對應的N個總損失，以及所述N個總損失的均值；確定損失值小於等於所述均值的至少一個第一總損失，以及損失值大於所述均值的至少一個第二總損失；基於所述至少一個第一總損失和所述至少一個第二總損失，確定所述總損失減小的方向。According to the method described in item 1 of the scope of patent application, it also includes: Processing the first sample sentence by using the strategy network under the N group principle parameters to obtain corresponding N trunk word sets, and respectively determining N first losses; Use the classification network to classify the N candidate sentences corresponding to the N trunk word sets, obtain N classification results, and determine N second losses respectively; Determine the corresponding N total losses and the average value of the N total losses according to the N first losses and N second losses; Determine at least one first total loss whose loss value is less than or equal to the average value, and at least one second total loss whose loss value is greater than the average value; Based on the at least one first total loss and the at least one second total loss, a direction in which the total loss decreases is determined.

根據申請專利範圍第6項所述的方法，其中，利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子分別進行分類處理，獲得N個分類結果包括：利用同一組分類參數下的所述分類網路，對所述N個備選句子分別進行分類處理，獲得所述N個分類結果；其中，所述N個總損失對應於所述N群組原則參數；基於所述至少一個第一總損失和所述至少一個第二總損失，確定所述總損失減小的方向，包括：確定所述至少一個第一總損失對應的至少一組第一策略參數相對於當前策略參數的梯度的累積，作為正方向；確定所述至少一個第二總損失對應的至少一組第二策略參數相對於當前策略參數的梯度的累積，作為負方向；將所述正方向與所述負方向的相反方向疊加，作為所述總損失減小的方向。The method according to item 6 of the scope of patent application, wherein the classification network is used to classify the N candidate sentences corresponding to the N trunk word sets respectively, and obtaining N classification results includes: The classification network under the same set of classification parameters performs classification processing on the N candidate sentences respectively to obtain the N classification results; Wherein, the N total losses correspond to the N group principle parameters; Based on the at least one first total loss and the at least one second total loss, determining the direction in which the total loss is reduced includes: Determining the accumulation of gradients of at least one group of first strategy parameters corresponding to the at least one first total loss with respect to the current strategy parameters as a positive direction; Determining the accumulation of gradients of at least one set of second strategy parameters corresponding to the at least one second total loss with respect to the current strategy parameters as a negative direction; The direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases.

根據申請專利範圍第7項所述的方法，其中，所述在總損失減小的方向，至少更新所述策略網路包括：在所述總損失減小的方向，更新所述策略網路中的當前策略參數。The method according to item 7 of the scope of patent application, wherein, in the direction of reducing the total loss, at least updating the strategy network includes: In the direction in which the total loss decreases, the current strategy parameters in the strategy network are updated.

根據申請專利範圍第6項所述的方法，其中，利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子分別進行分類處理，獲得N個分類結果包括：利用M組分類參數下的所述分類網路，對所述N個備選句子進行分類處理，獲得所述N個備選句子對應的N個分類結果，其中M＜=N；其中，所述N個總損失對應於N個參數集，其中第i參數集包括第i群組原則參數，和處理第i備選句子時所述分類網路對應的分類參數；所述確定所述總損失減小的方向包括：確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前策略參數的梯度的累積，作為第一正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前策略參數的梯度的累積，作為第一負方向；將所述第一正方向與所述第一負方向的相反方向疊加，作為第一調整方向；確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將所述第二正方向與所述第二負方向的相反方向疊加，作為第二調整方向；將所述第一調整方向和第二調整方向的總和作為所述總損失減小的方向。The method according to item 6 of the scope of patent application, wherein the classification network is used to classify the N candidate sentences corresponding to the N trunk word sets respectively, and obtaining N classification results includes: The classification network under M groups of classification parameters performs classification processing on the N candidate sentences, and obtains N classification results corresponding to the N candidate sentences, where M<=N; Wherein, the N total losses correspond to N parameter sets, wherein the i-th parameter set includes the i-th group principle parameter and the classification parameter corresponding to the classification network when processing the i-th candidate sentence; The determining the direction of the total loss reduction includes: Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current strategy parameter as the first positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss relative to the current strategy parameter as the first negative direction; Superimposing the first positive direction and the opposite direction of the first negative direction as the first adjustment direction; Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current classification parameter as the second positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss with respect to the current classification parameter as the second negative direction; Superimposing the second positive direction and the opposite direction of the second negative direction as a second adjustment direction; The sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced.

根據申請專利範圍第9項所述的方法，其中，所述在總損失減小的方向，至少更新所述策略網路包括：在所述第一調整方向，更新所述策略網路的當前策略參數；在所述第二調整方向，更新所述分類網路的當前分類參數。The method according to item 9 of the scope of patent application, wherein the updating at least the strategy network in the direction of reducing the total loss includes: In the first adjustment direction, update the current policy parameters of the policy network; In the second adjustment direction, the current classification parameters of the classification network are updated.

根據申請專利範圍第1項所述的方法，還包括：將待分析的第二句子輸入所述策略網路；根據所述策略網路的輸出，確定所述第二句子中的主幹詞。According to the method described in item 1 of the scope of patent application, it also includes: Input the second sentence to be analyzed into the strategy network; Determine the main word in the second sentence according to the output of the strategy network.

一種通過強化學習提取主幹詞的裝置，包括：分類網路訓練單元，配置為利用句子樣本集，訓練用於句子分類的分類網路；第一確定單元，配置為利用當前策略參數下的策略網路，對所述句子樣本集中的第一樣本句子進行主幹詞提取，獲得第一主幹詞集合，並根據所述第一樣本句子中的詞語數目和所述第一主幹詞集合中的詞語數目，確定當前的第一損失；第二確定單元，配置為利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，獲得所述第一備選句子的第一分類結果，並根據所述第一分類結果以及所述第一樣本句子的分類標籤，確定當前的第二損失；總損失確定單元，配置為根據所述當前的第一損失和當前的第二損失，確定當前的總損失；更新單元，配置為在總損失減小的方向，至少更新所述策略網路，以用於從待分析句子中提取主幹詞。A device for extracting main words through reinforcement learning, including: The classification network training unit is configured to use the sentence sample set to train the classification network for sentence classification; The first determining unit is configured to use the strategy network under the current strategy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the first sample sentence The number of words in and the number of words in the first stem word set to determine the current first loss; The second determining unit is configured to use the classification network to classify the first candidate sentence formed by the first stem word set, to obtain the first classification result of the first candidate sentence, and according to the Describe the first classification result and the classification label of the first sample sentence, and determine the current second loss; The total loss determining unit is configured to determine the current total loss according to the current first loss and the current second loss; The update unit is configured to update at least the strategy network in a direction in which the total loss is reduced, so as to extract the main words from the sentence to be analyzed.

根據申請專利範圍第12項所述的裝置，其中，所述策略網路包括第一嵌入層，第一處理層和第二處理層，所述第一確定單元配置為利用策略網路對所述句子樣本集中的第一樣本句子進行主幹詞提取，具體包括：在所述第一嵌入層，獲得所述第一樣本句子中的各個詞的詞嵌入向量；在所述第一處理層，根據所述詞嵌入向量，確定所述各個詞作為主幹詞的機率；在所述第二處理層，至少根據所述機率，從所述各個詞中選擇至少一部分詞，構成所述第一主幹詞集合。The device according to item 12 of the scope of patent application, wherein the strategy network includes a first embedding layer, a first processing layer and a second processing layer, and the first determining unit is configured to use the strategy network to The stem word extraction is performed on the first sample sentence in the sentence sample set, which specifically includes: In the first embedding layer, obtaining the word embedding vector of each word in the first sample sentence; In the first processing layer, according to the word embedding vector, determine the probability of each word as a backbone word; In the second processing layer, at least a part of the words is selected from the various words at least according to the probability to form the first trunk word set.

根據申請專利範圍第13項所述的裝置，其中，在所述第二處理層，從所述各個詞中選擇機率值大於預設閾值的詞，構成所述第一主幹詞集合。The device according to item 13 of the scope of patent application, wherein, in the second processing layer, words with a probability value greater than a preset threshold are selected from the various words to form the first trunk word set.

根據申請專利範圍第12項所述的裝置，其中，所述分類網路包括第二嵌入層和第三處理層，所述第二確定單元配置為利用所述分類網路對由所述第一主幹詞集合構成的第一備選句子進行分類處理，具體包括：在所述第二嵌入層，獲得所述第一備選句子對應的句子嵌入向量；在所述第三處理層，根據所述句子嵌入向量，確定所述第一備選句子的第一分類結果。The device according to item 12 of the scope of patent application, wherein the classification network includes a second embedding layer and a third processing layer, and the second determining unit is configured to use the classification network to pair the first The first candidate sentence formed by the stem word set is classified, which specifically includes: In the second embedding layer, obtain the sentence embedding vector corresponding to the first candidate sentence; In the third processing layer, the first classification result of the first candidate sentence is determined according to the sentence embedding vector.

根據申請專利範圍第12項所述的裝置，其中，所述策略網路和/或所述分類網路基於循環神經網路RNN。The device according to claim 12, wherein the strategy network and/or the classification network is based on a recurrent neural network RNN.

根據申請專利範圍第12項所述的裝置，其中：所述第一確定單元還配置為，分別利用N群組原則參數下的所述策略網路處理所述第一樣本句子，獲得對應的N個主幹詞集合，並分別確定N個第一損失；所述第二確定單元還配置為，利用所述分類網路，對所述N個主幹詞集合分別對應的N個備選句子進行分類處理，獲得N個分類結果，並分別確定N個第二損失；所述總損失確定單元還配置為，根據N個第一損失和N個第二損失，確定對應的N個總損失和所述N個總損失的均值；以及，確定損失值小於等於所述均值的至少一個第一總損失，以及損失值大於所述均值的至少一個第二總損失；所述更新單元包括：方向確定模組，配置為基於所述至少一個第一總損失和所述至少一個第二總損失，確定所述總損失減小的方向；更新模組，配置為根據所述總損失減小的方向，執行網路更新。The device according to item 12 of the scope of patent application, wherein: The first determining unit is further configured to process the first sample sentences by using the strategy network under the N group principle parameters to obtain corresponding N trunk word sets, and to determine N first losses respectively ； The second determining unit is further configured to use the classification network to classify the N candidate sentences corresponding to the N trunk word sets, to obtain N classification results, and to determine N second loss; The total loss determining unit is further configured to determine the corresponding N total losses and the average value of the N total losses according to the N first losses and the N second losses; and determine that the loss value is less than or equal to the average value At least one first total loss of, and at least one second total loss with a loss value greater than the average value; The update unit includes: A direction determining module, configured to determine a direction in which the total loss is reduced based on the at least one first total loss and the at least one second total loss; The update module is configured to perform network update according to the direction in which the total loss decreases.

根據申請專利範圍第17項所述的裝置，其中，所述第二確定單元配置為：利用同一組分類參數下的所述分類網路，對所述N個備選句子分別進行分類處理，獲得所述N個分類結果；其中，所述N個總損失對應於所述N群組原則參數；所述方向確定模組配置為：確定所述至少一個第一總損失對應的至少一組第一策略參數相對於當前策略參數的梯度的累積，作為正方向；確定所述至少一個第二總損失對應的至少一組第二策略參數相對於當前策略參數的梯度的累積，作為負方向；將所述正方向與所述負方向的相反方向疊加，作為所述總損失減小的方向。The device according to item 17 of the scope of patent application, wherein the second determining unit is configured to: use the classification network under the same set of classification parameters to classify the N candidate sentences respectively to obtain The N classification results; Wherein, the N total losses correspond to the N group principle parameters; The direction determining module is configured as: Determining the accumulation of gradients of at least one group of first strategy parameters corresponding to the at least one first total loss with respect to the current strategy parameters as a positive direction; Determining the accumulation of gradients of at least one set of second strategy parameters corresponding to the at least one second total loss with respect to the current strategy parameters as a negative direction; The direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases.

根據申請專利範圍第18項所述的裝置，其中，所述更新模組配置為：在所述總損失減小的方向，更新所述策略網路中的當前策略參數。The device according to item 18 of the scope of patent application, wherein the update module is configured as: In the direction in which the total loss decreases, the current strategy parameters in the strategy network are updated.

根據申請專利範圍第17項所述的裝置，其中，所述第二確定單元配置為：利用M組分類參數下的所述分類網路，對所述N個備選句子進行分類處理，獲得所述N個備選句子對應的N個分類結果，其中M＜=N；其中，所述N個總損失對應於N個參數集，其中第i參數集包括第i群組原則參數，和處理第i備選句子時所述分類網路對應的分類參數；所述方向確定模組配置為：確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前策略參數的梯度的累積，作為第一正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前策略參數的梯度的累積，作為第一負方向；將所述第一正方向與所述第一負方向的相反方向疊加，作為第一調整方向；確定所述至少一個第一總損失對應的至少一組第一參數集相對於當前分類參數的梯度的累積，作為第二正方向；確定所述至少一個第二總損失對應的至少一組第二參數集相對於當前分類參數的梯度的累積，作為第二負方向；將所述第二正方向與所述第二負方向的相反方向疊加，作為第二調整方向；將所述第一調整方向和第二調整方向的總和作為所述總損失減小的方向。The device according to item 17 of the scope of patent application, wherein the second determining unit is configured to: use the classification network under M groups of classification parameters to classify the N candidate sentences to obtain all Describe N classification results corresponding to N candidate sentences, where M<=N; Wherein, the N total losses correspond to N parameter sets, wherein the i-th parameter set includes the i-th group principle parameter and the classification parameter corresponding to the classification network when processing the i-th candidate sentence; The direction determining module is configured as: Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current strategy parameter as the first positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss relative to the current strategy parameter as the first negative direction; Superimposing the first positive direction and the opposite direction of the first negative direction as the first adjustment direction; Determining the accumulation of the gradient of the at least one first parameter set corresponding to the at least one first total loss relative to the current classification parameter as the second positive direction; Determining the accumulation of the gradient of the at least one second parameter set corresponding to the at least one second total loss with respect to the current classification parameter as the second negative direction; Superimposing the second positive direction and the opposite direction of the second negative direction as a second adjustment direction; The sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced.

根據申請專利範圍第20項所述的裝置，其中，所述更新模組配置為：在所述第一調整方向，更新所述策略網路的當前策略參數；在所述第二調整方向，更新所述分類網路的當前分類參數。The device according to item 20 of the scope of patent application, wherein the update module is configured as: In the first adjustment direction, update the current policy parameters of the policy network; In the second adjustment direction, the current classification parameters of the classification network are updated.

根據申請專利範圍第12項所述的裝置，還包括預測單元，配置為：將待分析的第二句子輸入所述策略網路；根據所述策略網路的輸出，確定所述第二句子中的主幹詞。The device according to item 12 of the scope of patent application further includes a prediction unit configured to: Input the second sentence to be analyzed into the strategy network; Determine the main word in the second sentence according to the output of the strategy network.

一種電腦可讀儲存媒體，其上儲存有電腦程式，當所述電腦程式在電腦中執行時，令電腦執行申請專利範圍第1-11項中任一項的所述的方法。A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed in a computer, the computer is made to execute the method described in any one of items 1-11 in the scope of the patent application.

一種計算設備，包括記憶體和處理器，其特徵在於，所述記憶體中儲存有可執行程式碼，所述處理器執行所述可執行程式碼時，實現申請專利範圍第1-11項中任一項所述的方法。A computing device, comprising a memory and a processor, characterized in that executable program codes are stored in the memory, and when the processor executes the executable program codes, items 1-11 of the scope of patent application Any one of the methods.