CN111046662A

CN111046662A - Training method, device and system of word segmentation model and storage medium

Info

Publication number: CN111046662A
Application number: CN201811123200.0A
Authority: CN
Inventors: 徐光伟; 王潇斌; 李林琳; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2020-04-21
Anticipated expiration: 2038-09-26
Also published as: CN111046662B

Abstract

The invention discloses a training method, a device, a system and a storage medium of a word segmentation model. The method comprises the following steps: acquiring a labeled data set, wherein the labeled data set comprises training texts with separation marks; acquiring a search behavior data set, and generating a word frequency dictionary according to the search behavior data set; and training the segmentation models corresponding to the labeled data sets based on the training texts with the separation identifications and the word frequency dictionary to obtain the segmentation models corresponding to the labeled data sets after training. According to the method provided by the embodiment of the invention, the word segmentation accuracy and the word segmentation field adaptability can be improved.

Description

Training method, device and system of word segmentation model and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a training method, a device, a system and a storage medium of a word segmentation model.

Background

In the scenario of Chinese information processing, Word Segmentation (Word Segmentation) is a technique that can improve the analysis capability of a search system. Briefly, Chinese Word Segmentation may recombine a sequence of Chinese characters into a single Word. Because there is no natural separator between words when Chinese is written, Chinese word segmentation is needed to help improve system analysis capability in many Chinese information processing scenarios.

With the development of natural language processing technology, the requirement of Chinese word segmentation in specific fields and professional fields is increasing. The Chinese participle corpus is rich, the corpus in each field has unique vocabulary expression, and the labeled data sets in different fields can not be well mutually universal and mutually covered, so that the participle accuracy degree is poor among different fields if the field has no large-scale labeled data set.

Disclosure of Invention

The embodiment of the invention provides a training method, a device, a system and a storage medium of a word segmentation model, which can improve the word segmentation accuracy and the adaptability of the word segmentation field.

According to an aspect of the embodiments of the present invention, there is provided a method for training a segmentation model, including:

acquiring a labeled data set, wherein the labeled data set comprises training texts with separation marks; acquiring a search behavior data set, and generating a word frequency dictionary according to the search behavior data set; and training the segmentation models corresponding to the labeled data sets based on the training texts with the separation identifications and the word frequency dictionary to obtain the segmentation models corresponding to the labeled data sets after training.

According to another aspect of the embodiments of the present invention, there is provided a training apparatus for a segmentation model, including:

the system comprises a label data set acquisition module, a label data set acquisition module and a label analysis module, wherein the label data set acquisition module is used for acquiring a label data set which comprises a training text with a separation mark; the searching behavior data set acquisition module is used for acquiring a searching behavior data set and generating a word frequency dictionary according to the searching behavior data set; and the first word segmentation model training module is used for training the word segmentation model corresponding to the labeled data set based on the training text with the separation identification and the word frequency dictionary to obtain the trained word segmentation model corresponding to the labeled data set.

According to another aspect of the embodiments of the present invention, there is provided a system for training a segmentation model, including:

a memory and a processor; the memory is used for storing programs; the processor is used for reading the executable program code stored in the memory to execute the training method of the word segmentation model.

According to a further aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to execute the training method of the word segmentation model of the above aspects.

According to still another aspect of the embodiments of the present invention, there is provided a word segmentation method, including:

acquiring a character sequence of an appointed text; and performing word segmentation on the character sequence of the specified text by using a word segmentation model to obtain a word segmentation result of the specified text, wherein the word segmentation model is a model obtained by training a training text with separation marks in a labeling data set and a word frequency dictionary, and the word frequency dictionary is a set generated according to a search behavior data set.

According to still another aspect of the embodiments of the present invention, there is provided a word segmentation apparatus, including:

the text acquisition module is used for acquiring a character sequence of the specified text; and the word segmentation module is used for segmenting the character sequence of the specified text by utilizing a word segmentation model to obtain a word segmentation result of the specified text, wherein the word segmentation model is a model obtained by utilizing a training text with a separation mark in a labeled data set and a word frequency dictionary, and the word frequency dictionary is a set generated according to a search behavior data set.

According to still another aspect of the embodiments of the present invention, there is provided a word segmentation system, including:

a memory and a processor; the memory is used for storing executable program codes; the processor is used for reading the executable program codes stored in the memory to execute the word segmentation method of the aspects.

According to a further aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the word segmentation method of the above-described aspects.

According to another aspect of the embodiments of the present invention, there is provided an information retrieval method, including:

acquiring search data input in a user search process; utilizing a word segmentation model to segment the character sequence in the search data to obtain a word segmentation result of the search data, wherein the word segmentation model is a model obtained by utilizing training texts and word frequency dictionaries with separation marks in a label data set, and the word frequency dictionaries are a set generated according to a search behavior data set; and searching the word segmentation result of the search data.

According to still another aspect of the embodiments of the present invention, there is provided an information retrieval apparatus including:

the data acquisition module is used for acquiring search data input in the search process of a user; the word segmentation module is used for segmenting the character sequence in the search data by utilizing a word segmentation model to obtain a word segmentation result of the search data, wherein the word segmentation model is a model obtained by utilizing a training text with separated marks in a labeled data set and a word frequency dictionary which is a set generated according to a search behavior data set; and the retrieval module is used for retrieving the word segmentation result of the search data.

According to yet another aspect of embodiments of the present invention, there is provided an information retrieval system, including a memory and a processor; the memory is used for storing executable program codes; the processor is used for reading the executable program codes stored in the memory to execute the information retrieval method of the aspects.

According to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to execute the above-described aspects of the information retrieval method.

According to the training method, the training device, the training equipment and the storage medium of the word segmentation model in the embodiment of the invention, a small amount of manual labeled data is used, and a user search data set automatically generated by combining the user search behavior of the e-commerce is combined, so that the training data is more accurate, and the word segmentation effect in the e-commerce field is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating the basic principle of scoring the segmentation results in a training method of a segmentation model according to an embodiment of the present invention;

FIG. 2 is a flow diagram illustrating a method of training a segmentation model according to an embodiment of the invention;

FIG. 3 is a detailed flow diagram illustrating a method for training a segmentation model according to an embodiment of the present invention;

FIG. 4 is a flow diagram illustrating a method of training a segmentation model according to another embodiment of the present invention;

FIG. 5 is a detailed flowchart of a method for training a segmentation model according to another embodiment of the present invention;

FIG. 6 is a system flow diagram illustrating a model training method according to an exemplary embodiment of the present invention;

FIG. 7 is a schematic structural diagram illustrating a training apparatus for a word segmentation model according to an embodiment of the present invention;

FIG. 8 is a flow diagram illustrating a method of word segmentation in accordance with an embodiment of the present invention;

fig. 9 is a schematic structural diagram showing a word segmentation apparatus according to an embodiment of the present invention;

FIG. 10 is a flow diagram illustrating an information retrieval method according to an embodiment of the present invention;

fig. 11 is a schematic configuration diagram showing an information retrieval apparatus according to an embodiment of the present invention;

FIG. 12 is a block diagram illustrating an exemplary hardware architecture of a computing device in which the method and apparatus for training a segmentation model according to embodiments of the present invention may be implemented;

FIG. 13 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing the word segmentation method and apparatus in accordance with embodiments of the present invention;

fig. 14 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing the information retrieval method and apparatus according to embodiments of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the embodiment of the invention, word segmentation can be understood as a process of recombining continuous word sequences into word sequences according to a certain specification; the word segmentation labeling is to label the text with the position of a word separator, and the word separator can be a space or other punctuation marks with a separation function. The difficulty with Chinese word segmentation is word disambiguation and identification of new words. Because natural separators do not exist among words when the Chinese character is written, Chinese character segmentation is required in a plurality of application scenes (such as news information processing and Chinese search word processing), and the analysis capability of the system can be improved to a certain extent.

For more general fields, such as the news field, there are continuously updated TreeBank datasets that contain a large number of manually tokenized well-annotated tokenized results. Based on the TreeBank data set, the Chinese word segmentation task in the news field can be well completed by using a proper word segmentation algorithm. However, in the chinese word segmentation in some professional fields, the change of the field content inevitably brings the field vocabulary which does not appear in many general fields, and the utilization of the corpus resources of the general fields leads to the reduction of the word segmentation performance. Therefore, data sets of different domains have a problem of domain adaptability.

Those skilled in the art will appreciate that a particular field of use herein may be a professional field such as the e-commerce field, the financial field, the IT field, the medical field, and the like. For simplicity of description, the embodiments described herein illustrate a domain-specific word segmentation model training method by taking the e-commerce domain as an example. This description is not to be construed as limiting the scope or the possibilities of implementation of the solution, and the processing in other specific areas than the e-commerce area is consistent with the processing in the field of the segmentation.

As can be seen from the above description, due to the differences in the fields, the large-scale labeled data set in the general field does not cover the e-commerce field well, resulting in poor word segmentation effect of e-commerce. The data volume of the large-scale labeled data set is usually more than 10 ten thousand sentences, and for the field of e-commerce, if a set of large-scale high-quality labeled data set is constructed, the time and the labor are extremely consumed.

In the prior art, some technologies use the searching behavior of the e-commerce user to construct a word segmentation data set in the e-commerce field, or use the searching behavior data of the e-commerce user to directly perform word segmentation, but the effect of manually marking the data set with high quality cannot be achieved.

In order to improve the word segmentation processing capability of Chinese word segmentation in a specific field and improve the word segmentation accuracy, the embodiment of the invention provides a word segmentation processing method, a word segmentation processing device and a word segmentation processing system, which can combine a small amount of manually labeled data and data automatically generated by user search behaviors to improve the word segmentation effect in the E-commerce field.

It should be noted that the word segmentation processing scheme according to the embodiment of the present invention is not limited to the chinese text, but is also applicable to other texts with no space division boundary between words, such as japanese text and korean text. The following description will be made only by taking a chinese text as an example.

First, a process of extracting training data from a manually labeled data set and a user search data set automatically generated by a user search behavior in the embodiment of the present invention is described below with reference to tables 1 to 3 by specific examples.

In an embodiment of the invention, the training data may be a complete Sentence sequence, i.e. a unit of language for expressing a relatively complete meaning. The training method of the word segmentation model provided by the embodiment of the invention can be used for training the word segmentation model based on the complete sentence.

Table 1 below shows an example of a manually annotated data set according to an exemplary embodiment of the invention.

Table 1 example of manually annotated data

Type (B)	Data of
		Manual labeling	Pelargy water power deep moisturizing toner
Manual labeling	Cerasus exfoliator
		Manual labeling	Korean edition shape-fitting male leisure trousers knitting panty big-size summer thin summer and spring style

As shown in table 1, each piece of manual annotation data is a complete sentence, and can be used as a manual annotation training text (which may be referred to as an annotation training text hereinafter) for training a segmentation model in the embodiment of the present invention. Each piece of manual tagging data in table 1 contains a manually tagged word segmentation result. As an example, for the labeled training text "pelargy hydrodynamic deep moisturizing toner", the word segmentation result of the artificial word segmentation label includes the following word sequences: perley, hydrodynamic force, deep layer, moisturizing and toner.

In the following description of the embodiments, for a word sequence in a word segmentation result, a word in the word sequence may be referred to as a word segmentation unit. Namely, each word segmentation result of the labeled training text comprises one or more word segmentation units.

In the embodiment of the invention, the search behavior data set is a data set constructed according to data automatically generated by search behaviors. Specifically, the search behavior includes: a search input behavior and a search click behavior; searching the behavioral dataset includes: a search word data set made up of search data of the search input, and a search click data set made up of titles of search results of the search click.

In one embodiment, for the search input behavior, the search terms are automatically obtained by using the separators carried by the user when inputting the search data during the search process, and the obtained search terms are used to form a first search training data set, which may be referred to as Query training data in the following description of the embodiments.

As an example, table 2 below exemplarily shows a search behavior data set according to an exemplary embodiment of the present invention.

Table 2 search (Query) data example

Type (B)	Data of
		Query data	School bag handbag canvas leisure for students
Query data	Necklace female birthday
		Query data	Nine cents trousers woman bag post

As shown in table 2, the field "Query data" represents a Query word input by the user. Each record in the search data table may represent Query training data automatically obtained from a search input action during a search by the user.

In one embodiment, for the search click behavior, the query word and the repeated segment of the title clicked by the user are extracted from the search behavior of the title clicked by the user to obtain a second search training data set. A segment (sententiesegment) is an incomplete sentence. This second search training data set may be referred to as Query-Title training data in the description of the embodiments below.

As an example, Table 3 below illustratively shows a search click dataset in accordance with an embodiment of the present invention.

TABLE 3 Title (Query-Title) data example of click search results

Type (B)	Data of
		Query-Title data	Simple tassel earrings/tassel earrings
Query-Title data	Rhizoma Kaempferiae paper bag/rhizoma Kaempferiae
		Query-Title data	Picard dune electronic pet/pick dune

As an example, one search click behavior of the user search process is described in conjunction with Table 3. In this example, a user enters the query term "reduced tassel earrings" with separators (spaces) that may be submitted to yield multiple search results. For a plurality of search results, the titles of the search results clicked by the user are counted, and repeated segments such as 'tassel earrings' in the titles of the search results are obtained.

As shown in table 3, in the Title data table of the clicked search result, Query represents a Query word input by the user, and Title represents a Title of the search result clicked by the user. Clicking on each record in the Title data table of the search result can represent Query-Title training data extracted from the search click behavior of a user in a certain search process.

In the embodiment of the invention, for the manual labeling data set, programming languages such as C + + or Python and the like can be utilized to efficiently process natural language processing tasks, such as automatic data acquisition and automatic labeling; for the search behavior data set, the search terms can be selected by counting the occurrence frequency of the search terms in the search log by using a distributed computing platform.

In one embodiment, the word frequency of a search word indicates the number of times the search word occurs in the search behavior data set, and a higher number indicates a higher importance of the search word. And carrying out word frequency statistics according to the automatically generated Query training data and the Query-Title training data to obtain a word list of the search words and word frequencies corresponding to the search words. Table 4 below exemplarily shows the word frequency data and the word frequency statistical result.

TABLE 4 word frequency data example

Type (B)	Data of
		Word frequency data	Sofa 14741
Word frequency data	Middle heel 14677
		Word frequency data	Yi Jia 14611

As shown in table 4, each record in the word frequency data table represents a search word and a word frequency statistical result obtained by statistics. Since the word frequency is statistically derived from a large amount of data that is automatically generated, the frequency of many words is a very large data value.

Therefore, when the word frequency information of the search word is used, the word frequency information of the search word can be processed in advance, and the principle of the word frequency information processing is to reduce the absolute numerical value of the word frequency data on the premise of not influencing the property of the data as much as possible.

In one embodiment, the log extraction process may be performed on the word frequency information of the search word in advance. The absolute numerical value of the word frequency data can be reduced through logarithm processing, subsequent calculation is facilitated, the logarithm does not change the property of the data, the word frequency information is compressed after the logarithm is taken to serve as the size of a variable, and the data calculation result is more stable. It should be understood that other data processing methods may be used to process the frequency information of the search term, as long as the above-mentioned principle of processing the term frequency information is satisfied.

In the embodiment of the invention, the repeated segments of the search terms and the clicked titles represent the query requests of most users and imply the importance information of the search terms. The training data containing the search word importance information is extracted from the massive user search logs, the training data is from the real search data, the search word importance information is contained through the user search behavior, the coverage is wider, and the training result is closer to the target word segmentation result of the target sentence to be subjected to word segmentation in the practical application scene.

Therefore, in order to improve the word segmentation effect in the e-commerce field, the embodiment of the invention provides a training method of a word segmentation model, which only needs to use manual labeling data acquired with a small amount of labor cost and combines with a search behavior data set automatically generated by search behaviors of e-commerce users, so that the training data is more accurate, and the word segmentation effect in the e-commerce field is improved.

In the embodiment of the invention, the word segmentation algorithm used for training the word segmentation model comprises the following steps: scoring the word segmentation results obtained by the specified text in different segmentation modes to obtain a plurality of word segmentation results; respectively scoring the multiple word segmentation results to obtain scores of the multiple word segmentation results, wherein the score of each word segmentation result is the sum of the score of a word segmentation unit in the word segmentation results and the connection score of the word segmentation unit; and selecting the word segmentation result with the highest score as the predicted word segmentation result of the specified text.

For convenience of understanding, first, a basic flow of scoring the word segmentation result in different segmentation modes in the training method of the word segmentation model according to the embodiment of the present invention is described below with reference to fig. 1. Fig. 1 is a schematic diagram illustrating a basic principle of scoring a segmentation result in a training method of a segmentation model according to an embodiment of the present invention.

As shown in FIG. 1, in one embodiment, a method 100 of tokenizing word-specified text may comprise:

step S110, as shown in the text segmentation module in fig. 1, performing word segmentation on the character sequence of the specified text by using a first word segmentation manner to obtain a word segmentation result of the specified text.

As an example, the designated text contains a character sequence "C1C 2C3C4C5C6C7C 8", and the character sequences in the designated text are recombined into words using a first word segmentation manner, resulting in a word sequence of the designated text such as "C1C 2", "C3", "C4C 5", "C6C 7C 8".

Step S120, as shown in the word vector determination module of the word model in fig. 1, determines the word vector in the designated text.

In this step, each word in the specified text is mapped to a high-dimensional vector space to obtain a feature vector of a fixed length, which is a feature vector of the word, referred to as a word vector for short.

In the embodiment of the present invention, a Word vector (Word entries) is a vector value that can be understood by a computer by mapping words in a natural language to a vector space.

In one embodiment, a neural network model may be used in advance to represent words as word vectors according to their characteristics. As one example, the neural network model may be a word vector model. For example, through Word2vec model training, the feature information of the words in the specified text is mapped to a high-dimensional vector space, and Word vectors with fixed lengths are obtained.

In this embodiment, the word vector determination module may determine the word feature vector corresponding to the single word in the specified text by looking up the word vector mapping Table Lookup Table.

Step S130, as shown in the word vector determination module of the word segmentation model in fig. 1, determining the word vector of the word segmentation in the word segmentation result according to the word vector of each word included in the word segmentation result.

In the step, word vectors of the participles are determined according to the word vectors corresponding to the words in the participles.

In embodiments of the present invention, a word vector is a vector that maps a word to a fixed dimension, which may be used to characterize the syntactic and semantic information of the word. In one embodiment, a word vector may be used to represent word features of the word. The method comprises the steps of extracting the characteristics of each character in the participles through a neural network model, combining the position information of each character in the participles, and carrying out linear combination on the characteristics of the position information of each character in the participles to obtain the characteristic information of adjacent single character combination compound words, namely word vectors of the participles.

In one embodiment, the Neural Network model may be a convolutional Neural Network model CNN or a closed combinatorial Neural Network (GCNN). For example, the feature of each word in the participle is combined by using the GCNN network to obtain the vector representation of the participle.

Step S140, as shown in the word connection vector determination module of the word segmentation model in fig. 1, determining connection feature vectors of the segmented words according to the word vectors of the segmented words in the word segmentation result.

In one embodiment, the word vectors of the participles in the participle result can be used for extracting the connection feature vectors of the participles through a specified neural network model.

In this embodiment, the neural network model may be a Long Short Term Memory (LSTM) neural network model. And taking the word vector of the participle as the input of the LSTM model, and determining the connection characteristic vector of the participle by the LSTM according to the input word vector of the participle by combining the connection relation of the participle before and after the participle.

Step S150, as shown in the segmentation result scoring module of the segmentation model in fig. 1, based on the word vectors of the segmentation words and the connection feature vectors of the segmentation words in the segmentation result, scoring the segmentation result to obtain a score of the segmentation result.

In one embodiment, step S150 may include:

and step S151, based on the word vectors of the word segmentation, scoring the word segmentation in the word segmentation result to obtain the score of the word segmentation in the word segmentation result.

Specifically, an inner product operation may be performed on the word vector of the word segmentation and a parameter vector of the word segmentation model to obtain a score of the word segmentation unit in the word segmentation result.

In this step, the parameter vector of the word segmentation model is in the form of a vector of a parameter of the word segmentation model, and the parameter vector can have the same vector dimension as the word vector of the word segmentation. In this embodiment, the parameter vector may be used to perform an inner product operation with the word vector of the participle to obtain a score of the participle.

As an example, the initial value of the parameter vector may be a random parameter or a parameter value set empirically by the user.

As a specific example, in FIG. 1, y_i(i ≧ 1) represents the word vector of the ith word segmentation in the word segmentation result, u is the parameter vector of the word segmentation model, y_iThe dot product of u and u is the score of the ith participle.

Step S152, determining the connection score of the participle based on the word vector of the participle in the participle result.

As an example, as shown in FIG. 1, p_i(i ≧ 1) denotes the connected feature vector of the ith word segmentation in the word segmentation result, yi (i ≧ 1) denotes the word vector of the ith word segmentation in the word segmentation result, p_iAnd y_iThe dot product of (a) is the connection score of the ith participle.

Step S153, determining the score of the word segmentation result corresponding to the appointed text and the first word segmentation mode according to the score of the word segmentation and the connection score of the word segmentation in the word segmentation result.

Specifically, the sum of the score of the participle and the connection score of the participle in the participle result is used as the score of the participle result of the specified text corresponding to the first participle mode.

Step S160, performing word segmentation on the specified text by using multiple different word segmentation modes to obtain scores of multiple different word segmentation results of the specified text, and selecting the word segmentation result with the highest score as the word segmentation result of the specified text.

The word segmentation model of the embodiment of the invention can directly score the word segmentation in the word segmentation result when segmenting the specified text, so that in the word segmentation model training method described in the following embodiment, the word frequency score is conveniently utilized to perform additional segmentation on the word segmentation in the word segmentation result in the training process, thereby fusing the word frequency score into the training of the word segmentation model with a small amount of manual labeling data, and obviously improving the word segmentation effect.

For better understanding of the present invention, a method for training a segmentation model according to an embodiment of the present invention is described in detail below with reference to fig. 2 and 3. It should be noted that these examples are not intended to limit the scope of the present disclosure. FIG. 2 is a flowchart illustrating a method for training a segmentation model according to an embodiment of the present invention. FIG. 3 is a detailed flowchart of a training method of a segmentation model according to an embodiment of the present invention.

As shown in fig. 2, in one embodiment, a word frequency dictionary is generated according to Query training data and a Query-Title data set, and the word frequency dictionary includes a word list and word frequencies corresponding to words in the word list. And training the word segmentation model by utilizing the labeled data set and the word frequency dictionary.

As shown in FIG. 3, in one embodiment, a method 200 for training a segmentation model may comprise:

step S210, a labeling data set is obtained, wherein the labeling data set comprises training texts with separation marks.

Step S220, acquiring a search behavior data set, and generating a word frequency dictionary according to the search behavior data set.

In one embodiment, the search behavior data set may include Query training data and/or a Query-Title data set as described in the above embodiments.

Step S230, training the segmentation model corresponding to the labeled data set based on the training text with the separation mark and the word frequency dictionary to obtain the segmentation model corresponding to the labeled data set after training.

In an embodiment, step S230 may specifically include:

step S231, determining word segmentation results of the training text with the separation identifier under different segmentation conditions according to the word segmentation model corresponding to the labeled data set and the word segmentation model parameters corresponding to the labeled data set.

And step S232, scoring the word segmentation results by combining the word frequency dictionary, determining the scores of the word segmentation results, and taking the word segmentation results with the highest scores as the predicted word segmentation results of the training texts with the separation marks.

In one embodiment, in step S232, the step of scoring the segmentation result in combination with the word frequency dictionary and determining the score of the segmentation result may specifically include:

and step S232-01, determining the score of the word segmentation in the word segmentation result according to the word vector of the word segmentation in the word segmentation result and the training parameter vector.

And step S232-02, determining the connection score of the participles in the participle result according to the word vectors of the participles and the connection feature vectors of the participles.

In this step, the connected feature vector of the segmented word may be a vector obtained by training the word vector of the segmented word according to a long-and-short-term memory model.

Step S232-03, the word frequency dictionary is searched, and the word frequency score of the word segmentation in the word segmentation result is determined.

And step S232-04, taking the sum of the score of the participle, the connection score of the participle and the word frequency score of the participle in the participle result as the score of the participle result.

Step S233, according to the training text with the separation marks, determining the labeling word segmentation result of the training text, determining word segmentation errors by using the prediction word segmentation result and the labeling word segmentation result, and constructing a loss function of a word segmentation model corresponding to the labeling data set according to the word segmentation errors.

Step S234, the parameters of the segmentation model corresponding to the labeled data set are adjusted by using the loss function, and the trained segmentation model corresponding to the labeled data set is obtained by using the adjusted parameters of the segmentation model.

In one embodiment, step S234 may specifically include:

and step S234-01, determining the gradient corresponding to the loss function.

And S234-02, adjusting the word segmentation model parameters corresponding to the labeled data set according to the gradient corresponding to the loss function.

In an embodiment, step S234 may specifically include:

and S234-01, determining the word segmentation error of the trained word segmentation model corresponding to the labeled data set by using the predicted word segmentation result and the labeled word segmentation result.

And S234-02, constructing a loss function of the segmentation model corresponding to the labeling data set according to the segmentation error.

And S234-03, adjusting the word segmentation model parameters corresponding to the labeling data set by using the loss function.

And S234-04, when the variable quantity of the word segmentation error stops increasing and is smaller than a set threshold value, or the training times of the word segmentation model corresponding to the labeled data set reach the maximum training times, obtaining the word segmentation model corresponding to the labeled data set after training.

In the embodiment of the present invention, a Loss Function (Loss Function) may be used to estimate a gap between a result of model training and a target of the model training, that is, a word segmentation error of a word segmentation model. For example, the result of model training is the predicted word segmentation result obtained by the word segmentation model trained each time, and the target of model training is the input labeled word segmentation result.

In one embodiment, the parameters of the segmentation model may be adjusted according to the gradient corresponding to the loss function, and the model parameters in the segmentation model may be updated.

In one embodiment, the initial value of the model parameter may be a random parameter or a parameter value that is set empirically by the user.

In the embodiment of the invention, in the training process of the word segmentation model, the gradient corresponding to the loss function is determined by using a gradient descent algorithm, and the word segmentation model parameters are adjusted according to the gradient corresponding to the loss function.

As one example, the gradient descent algorithm may include a back propagation algorithm based on a gradient descent algorithm, an Adam optimization algorithm, and the like.

In some embodiments, the gradient itself is a vector, which may be referred to as a gradient vector, and the gradient vector may indicate a direction of updating the model parameters for the training process of the participle model, and the updating of the participle model parameters according to the magnitude of the gradient vector along the direction of the gradient vector may ensure that the result of each training of the participle model is closer to the target of the model training.

As an example, in the training method of the word segmentation model according to the embodiment of the present invention, a model parameter, such as a model parameter u, of the word segmentation model corresponding to the labeled data set may be initialized randomly, iterative training is performed on the word segmentation model of the labeled data set, and in each iterative training process, a loss function of the word segmentation model is solved through a back propagation algorithm to obtain a gradient corresponding to the loss function, so as to adjust the model parameter of the word segmentation model according to the gradient corresponding to the loss function.

In the embodiment of the invention, the segmentation model is trained by utilizing the labeling data set and the word frequency dictionary extracted from the search behavior data set, and in the training process of the segmentation model, the segmentation result is scored by combining the word frequency dictionary, so that the accuracy of the predicted segmentation result of the segmentation model can be improved, and the segmentation effect of the trained segmentation model is improved.

The following describes in detail a training method of a segmentation model according to another embodiment of the present invention with reference to fig. 4 and 5. It should be noted that these examples are not intended to limit the scope of the present disclosure. FIG. 4 is a flowchart illustrating a method for training a segmentation model according to another embodiment of the present invention. FIG. 5 is a flowchart illustrating a method for training a segmentation model according to another embodiment of the present invention. Steps in fig. 5 that are the same or equivalent to those in fig. 3 use the same reference numerals.

As shown in fig. 4, in an embodiment, a word frequency dictionary is generated according to Query training data and a Query-Title data set, where the word frequency dictionary includes a word list and word frequencies corresponding to words in the word list; and combining the word frequency dictionary, and performing multi-task model training by using the labeling data set, Query training data and Query-Title data set to obtain a word segmentation model after the labeling data set is trained.

As shown in fig. 5, fig. 5 is substantially the same as fig. 3, except that, in one embodiment, the training method 500 of the segmentation model may further include:

step S240, training the segmentation model corresponding to the search behavior data set by using the search training text extracted from the search behavior data set, wherein the segmentation model corresponding to the search behavior data set and the segmentation model corresponding to the label data set share the word encoding layer and the word connection relation layer. The word segmentation model corresponding to the search behavior data set can comprise a word segmentation model corresponding to Query training data and/or a word segmentation model corresponding to a Query-Title data set.

In one embodiment, the word segmentation model corresponding to the labeling data set, the word segmentation model corresponding to the Query training data and the word segmentation model corresponding to the Query-Title data set are subjected to multi-task learning.

In one embodiment, a word coding layer and a word connection relation layer are shared for a word segmentation model corresponding to a labeling data set, a word segmentation model corresponding to Query training data and a word segmentation model corresponding to a Query-Title data set.

The word coding layer is used for determining word vectors of word segmentation in the first word segmentation result of the training text with the separation marks under different segmentation conditions and determining word vectors of word segmentation in the second word segmentation result of the search training text under different segmentation conditions;

and the word connection relation layer is used for determining the connection characteristic vector of the word in the first word segmentation result and the connection characteristic vector of the word in the second word segmentation result by using the word vector of the word in the first word segmentation result and the word vector of the word in the second word segmentation result.

In an embodiment, the step of training the segmentation model corresponding to the search behavior data set by using the search training text extracted from the search behavior data set in step S240 may specifically include:

step S241, determining word vectors of the word segmentation in the first word segmentation result of the training text with the separation marks under different segmentation conditions and determining word vectors of the word segmentation in the second word segmentation result of the search training text under different segmentation conditions by using a word coding module;

step S242, determining, by using the word connection vector determining module, a connection feature vector of the word in the first word segmentation result and a connection feature vector of the word in the second word segmentation result according to the word vector of the word in the first word segmentation result and the word vector of the word in the second word segmentation result.

In the embodiment of the invention, in Multi-task Learning (Multi-task Learning), a segmentation model corresponding to a label data set, a segmentation model corresponding to Query training data and a segmentation model corresponding to a Query-Title data set are trained, three model training tasks are provided, parallel training is performed through a plurality of model training tasks, and the performance of a training result of each task is improved by combining the correlation among a plurality of tasks in the training process.

FIG. 6 shows a system flow diagram of a model training method according to an exemplary embodiment of the present invention. In the following description of the embodiments, the segmentation model corresponding to the annotation data set may be used as the first segmentation model, and the model corresponding to the search behavior data set may be used as the second segmentation model. In training the first and second segmentation models, the training text with the separation markers from the annotation data set may be used as the first training text and the training data from the search behavior data set may be used as the second training text.

As shown in fig. 6, the first segmentation model may include a first input module, a word encoder module, a word connection vector determination module, a first segmentation result scoring module, and a first model parameter adjustment module; the second word segmentation model can comprise a second input module, a word encoder module, a word connection vector determination module, a second word segmentation result scoring module and a second model parameter adjustment module.

The first word segmentation model and the second word segmentation model share a word encoder module and a word connection vector determination module. And the first word segmentation result scoring module and the first model parameter adjusting module are private modules of the first word segmentation model, and the second word segmentation result scoring module and the second model parameter adjusting module are private modules of the second word segmentation model.

As shown in fig. 6, in an embodiment, the method 600 for training the word segmentation model specifically includes:

step S610, acquiring a training text with a separation mark in the labeled data set by using a first input module of the first segmentation model, and acquiring a search training text in the search behavior data set by using a second input module of the second segmentation model.

In this step, the search behavior data set includes Query training data and/or a Query-Title data set. Since Query training data contains search terms with separate identifications, the Query-Title data set contains sentence fragments.

Step S620, determining word vectors of the word segmentation in the word segmentation result of the first training text under different segmentation conditions by using the word encoder module.

In this step, a word encoder module may be further used to determine word vectors of the word segmentation in the word segmentation result of the second training text under different segmentation conditions.

As shown in fig. 6, in one embodiment, the word segmentation result corresponding to the word segmentation mode may be determined through a Language Model (Language Model) or a preset dictionary. The first training text or the second training text may be decomposed into word segmentation units using a language model or looking up word groups in a dictionary. As one example, the language model may be an N-Gram language model.

In one embodiment, a word Encoder module, such as a word Encoder, may be used to implement the functions of the word vector determination module and the word vector module of the word segmentation model in the embodiments described above.

As an example, the word encoding module may determine word vectors of the single words in the word segmentation result of the first training text under different segmentation conditions, and determine a function of the word vectors of the words in the word segmentation result of the first training text under different segmentation conditions according to the word vectors, which will not be described herein again.

Step S630, determining a connection feature vector of the word in the word segmentation result of the first training text under different segmentation conditions according to the word vector of the word in the word segmentation result of the first training text under different segmentation conditions by using a word connection vector determination module.

In this step, the word connection vector determining module may further determine a connection feature vector of a word in the word segmentation result of the second training text under different segmentation conditions according to a word vector of a word in the word segmentation result of the second training text under different segmentation conditions.

As shown in FIG. 6, in one embodiment, the word connection vector determination module includes an LSTM neural network model. And inputting the word vector of the word in the word segmentation result of the first training text under different segmentation conditions and the word vector of the word in the word segmentation result of the second training text under different segmentation conditions into an LSTM neural network for training to obtain the connection characteristic vector of the word in the word segmentation result of the first training text under different segmentation conditions and the connection characteristic vector of the word in the word segmentation result of the second training text under different segmentation conditions.

In this example, the connected feature vectors of the participles are used to characterize the inter-participle connection relationship features.

Step 640, using the first segmentation result scoring module to score the segmentation results of the first training text under different segmentation conditions based on the word vectors and the connection feature vectors of the segmentation of the first training text under different segmentation conditions, determining the scores of the segmentation results of the first training text under different segmentation conditions, and using the segmentation results of the first training text with the highest score under different segmentation conditions as the predicted segmentation results of the first training text.

In one embodiment, the score of the word segmentation result of the first training text under different segmentation conditions is the sum of the score determined by the word vector of the word segmentation in the word segmentation result of the first training text under different segmentation conditions and the score determined by the connection feature vector of the word segmentation in the word segmentation result.

In one embodiment, word frequency dictionaries generated according to the search behavior data sets can be used for determining word frequency scores of participles in the participle results of the first training text under different segmentation conditions. The score of the word segmentation result of the first training text under different segmentation conditions is the sum of the score determined by the word vector of the word in the word segmentation result of the first training text under different segmentation conditions, the score determined by the connection feature vector of the word in the word segmentation result, and the word frequency score of the word in the word segmentation result.

In this step, a second word segmentation result scoring module may be further used to score the word segmentation results of the second training text under different segmentation conditions based on the word vectors of the words in the word segmentation results of the second training text under different segmentation conditions and the connection feature vectors of the words in the word segmentation results, determine scores of the word segmentation results of the second training text under different segmentation conditions, and use the word segmentation result of the second training text with the highest score under different segmentation conditions as the predicted word segmentation result of the second training text.

In one embodiment, the scores of the word segmentation results of the second training text under different segmentation conditions are the sum of the score determined by the word vector of the word segmentation in the word segmentation results of the second training text under different segmentation conditions and the score determined by the connection feature vector of the word segmentation in the word segmentation results.

Step S650, determining a loss function of the first segmentation model through a predicted segmentation result of the first training text and a labeled segmentation result corresponding to the predicted segmentation result by using a model parameter adjusting module; and adjusting the model parameters of the first segmentation model by using the loss function of the first segmentation model, so as to obtain the trained first segmentation model by using the adjusted parameters of the first segmentation model.

In this step, the first training text in the annotation dataset has a separation marker. Therefore, the labeling word segmentation result of the first training text can be determined according to the first training text with the separation mark; and constructing a loss function of the segmentation model corresponding to the labeled data set by using the predicted segmentation result of the first training text and the labeled segmentation result of the first training text, wherein the loss function of the first segmentation model is used for representing the segmentation error of the first segmentation model.

In one embodiment, step S650 may further determine a loss function of the second segmentation model according to the predicted segmentation result of the second training text and the labeled segmentation result corresponding to the predicted segmentation result.

In this step, the model parameters of the second segmentation model may also be adjusted by using the loss function of the second segmentation model, so as to obtain the trained second segmentation model by using the adjusted parameters of the second segmentation model.

In this step, a plurality of first training texts may be obtained from the labeling dataset, so that the first segmentation model is trained for a plurality of times by using the plurality of first training texts; determining a word segmentation error of the first word segmentation model trained each time according to a predicted word segmentation result obtained by the word segmentation model trained each time and a labeled word segmentation result corresponding to the predicted word segmentation result; and when the variable quantity of the word segmentation error of the first word segmentation model stops increasing and is smaller than a set threshold value, or the training times of the first word segmentation model reach the maximum training times, obtaining the trained first word segmentation model.

The invention discloses a training method of a word segmentation model, which is a supervised machine learning method. The closer the predicted word segmentation result of the first training text is to the data set of the labeled word segmentation result, the more accurate the first word segmentation model is.

In the embodiment of the invention, the search behavior data is automatically mined from the search log, and the word frequency of the search word in the massive search behavior data can be used for representing the importance of the search word. Therefore, the coverage of the automatically extracted search behavior data is wider, the word frequency information extracted by the search behavior data is fused in the training process of the word segmentation model, the training precision of the word segmentation model can be improved, and the accuracy of word segmentation is improved.

For a better understanding of the present invention, the following detailed description of the training model of the word segmentation model according to the embodiments of the present invention is provided in conjunction with the accompanying drawings, and it should be noted that these embodiments are not intended to limit the scope of the present disclosure.

FIG. 7 is a diagram illustrating the structure of a training module of a word segmentation model according to an embodiment of the present invention. As shown in fig. 7, the training module 300 of the word segmentation model includes:

a labeling data set obtaining module 310, configured to obtain a labeling data set, where the labeling data set includes a training text with a separation identifier;

a search behavior data set obtaining module 320, configured to obtain a search behavior data set, and generate a word frequency dictionary according to the search behavior data set;

and the first segmentation model training module 330 is configured to train a segmentation model corresponding to the labeled data set based on the training text with the separation identifier and the word frequency dictionary to obtain a trained segmentation model corresponding to the labeled data set.

In one embodiment, the first segmentation model training module 330 includes:

the text segmentation unit is used for determining word segmentation results of the training text with the separation marks under different segmentation conditions according to the word segmentation model corresponding to the labeling data set and the word segmentation model parameters corresponding to the labeling data set;

the word segmentation result scoring unit is used for scoring the word segmentation result by combining the word frequency dictionary, determining the score of the word segmentation result, and taking the word segmentation result with the highest score as the predicted word segmentation result of the training text with the separation mark;

the loss function determining unit is used for determining a labeled word segmentation result of the training text according to the training text with the separation identification, and constructing a loss function of a word segmentation model corresponding to the labeled data set by using the predicted word segmentation result and the labeled word segmentation result;

and the model parameter adjusting unit is used for adjusting the word segmentation model parameters corresponding to the labeling data set by using the loss function so as to obtain the trained word segmentation model corresponding to the labeling data set by using the adjusted word segmentation model parameters.

In one embodiment, the word segmentation result scoring unit comprises:

the word segmentation score determining subunit is used for determining the score of the word segmentation in the word segmentation result according to the word vector of the word segmentation in the word segmentation result and the training parameter vector;

the connection score determining subunit is used for determining the connection score of the word segmentation in the word segmentation result according to the word vector of the word segmentation and the connection feature vector of the word segmentation;

the word frequency score determining subunit is used for searching the word frequency dictionary and determining the word frequency score of the participle in the participle result;

and the word segmentation result determining subunit is used for taking the sum of the score of the word segmentation, the connection score of the word segmentation and the word frequency score of the word segmentation in the word segmentation result as the score of the word segmentation result.

In this embodiment, the connected feature vector of the segmented word is a vector obtained by training the word vector of the segmented word according to the long-and-short-term memory model.

In one embodiment, the model parameter adjustment unit includes:

the gradient determining subunit is used for determining the gradient corresponding to the loss function;

and the model parameter adjusting unit is also used for adjusting the word segmentation model parameters corresponding to the labeling data set according to the gradient corresponding to the loss function.

In one embodiment, the training module 300 for the word segmentation model may further include:

the second segmentation model training module 340 trains the segmentation models corresponding to the search behavior data set by using the search training text extracted from the search behavior data set, wherein,

the device comprises a word segmentation model corresponding to a search behavior data set and a word segmentation model corresponding to a marking data set, a shared word coding module and a word connection vector determining module.

In one embodiment, the word encoding module is configured to determine word vectors of words in a first word segmentation result of a training text with separation identifiers under different segmentation conditions, and determine word vectors of words in a second word segmentation result of a search training text under different segmentation conditions;

and the word connection vector determining module is used for determining the connection characteristic vector of the word in the first word segmentation result and the connection characteristic vector of the word in the second word segmentation result by using the word vector of the word in the first word segmentation result and the word vector of the word in the second word segmentation result.

In one embodiment, a search behavior data set of an embodiment of the present invention includes: and searching data input in the user searching process and the title information of the clicked searching result.

In an embodiment, the model parameter adjusting unit may be further configured to:

determining a word segmentation error of the trained word segmentation model corresponding to the labeled data set by using the predicted word segmentation result and the labeled word segmentation result;

constructing a loss function of a word segmentation model corresponding to the labeling data set according to the word segmentation errors;

adjusting the word segmentation model parameters corresponding to the labeled data set by using a loss function;

and when the variable quantity of the word segmentation error stops increasing and is smaller than a set threshold value, or the training times of the word segmentation model corresponding to the labeled data set reach the maximum training times, obtaining the trained word segmentation model corresponding to the labeled data set.

In the embodiment of the invention, the manual marking data in the E-commerce field usually needs higher labor cost, the search behavior training data can be automatically acquired from the search behavior log, and the search behavior data carries the unique corpus information of the specific field and the general grammatical structure characteristics of the field. Therefore, the segmentation method is trained by using the labeling data set and the searching behavior data set and combining the word frequency scores, so that the segmentation accuracy and the adaptability of the segmentation field can be improved, and the segmentation effect of the segmentation model is improved.

FIG. 8 shows a flow diagram of a word segmentation method according to an embodiment of the invention. As shown in fig. 8, the training apparatus 400 for word segmentation model includes:

step S410, acquiring a character sequence of an appointed text;

step S420, utilizing the word segmentation model to segment the character sequence of the specified text to obtain the word segmentation result of the specified text, wherein,

the word segmentation model is a model obtained by training a training text with separation marks in a labeling data set and a word frequency dictionary, and the word frequency dictionary is a set generated according to a search behavior data set.

In one embodiment, step S420 may include:

step S421, determining word vectors of word segmentation in word segmentation results of the specified text under different segmentation conditions;

step S422, determining the connection characteristic vector of the word segmentation in the word segmentation result of the specified text under different segmentation conditions according to the word vector of the word segmentation in the word segmentation result of the specified text under different segmentation conditions;

step 423, determining scores of the word segmentation results of the specified text under different segmentation conditions based on the word vectors of the words in the word segmentation results of the specified text under different segmentation conditions, the connection feature vectors of the words in the word segmentation results and the word frequency dictionary;

step S424, the word segmentation result of the designated text with the highest score under different segmentation conditions is used as the word segmentation result of the designated text.

Fig. 9 is a schematic structural diagram of a word segmentation apparatus according to an embodiment of the present invention. As shown in fig. 9, in one embodiment, the word segmentation apparatus 500 may include:

a text obtaining module 510, configured to obtain a character sequence of a specified text;

a word segmentation module 520, configured to perform word segmentation on the character sequence of the specified text by using a word segmentation model to obtain a word segmentation result of the specified text, where,

In one embodiment, the word segmentation module 520 may specifically include:

the word encoder module is used for determining word vectors of word segmentation in the word segmentation result of the specified text under different segmentation conditions;

the word connection vector determining module is used for determining connection characteristic vectors of word segmentation in the word segmentation result of the specified text under different segmentation conditions according to word vectors of word segmentation in the word segmentation result of the specified text under different segmentation conditions;

the word segmentation result scoring module is used for determining the scores of the word segmentation results of the specified texts under different segmentation conditions based on word vectors of the words in the word segmentation results of the specified texts under different segmentation conditions, connection characteristic vectors of the words in the word segmentation results and a word frequency dictionary;

and taking the word segmentation result of the designated text with the highest score under different segmentation conditions as the word segmentation result of the designated text.

Fig. 10 shows a flow chart of an information retrieval method according to an embodiment of the invention. As shown in fig. 10, the information retrieval method 600 may include:

step S610, acquiring search data input in the user search process.

Step S620, utilizing a word segmentation model to segment the character sequence in the search data to obtain a word segmentation result of the search data, wherein the word segmentation model is a model obtained by utilizing a training text with a separation mark in a labeling data set and a word frequency dictionary, and the word frequency dictionary is a set generated according to the search behavior data set.

Step S630, a search is performed on the word segmentation result of the search data.

In one embodiment, step S620 may include:

step S421, determining word vectors of word segmentation in the word segmentation result of the character sequence in the search data under different segmentation conditions;

step S422, determining connection characteristic vectors of the word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data according to the word vectors of the word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data;

step 423, determining scores of the word segmentation results of the character sequences in the search data under different segmentation conditions based on the word vectors of the words in the word segmentation results, the connection feature vectors of the words in the word segmentation results and the word frequency dictionary of the character sequences in the search data under different segmentation conditions;

step S424, the word segmentation result of the character sequence in the search data with the highest score under different segmentation conditions is used as the word segmentation result of the character sequence in the search data.

Fig. 11 is a schematic structural diagram of an information retrieval apparatus according to an embodiment of the present invention. As shown in fig. 11, the information retrieval apparatus 700 may include:

a data obtaining module 710, configured to obtain search data input in a user search process;

a word segmentation module 720, configured to perform word segmentation on the character sequence in the search data by using a word segmentation model to obtain a word segmentation result of the search data, where the word segmentation model is a model obtained by training a training text and a word frequency dictionary having separate identifiers in a labeled data set, and the word frequency dictionary is a set generated according to a search behavior data set;

and the retrieval module 730 is used for retrieving the word segmentation result of the search data.

In one embodiment, the word segmentation model 720 may include:

the word encoder module is used for determining word vectors of word segmentation in word segmentation results of character sequences in the search data under different segmentation conditions;

the word connection vector determining module is used for determining connection feature vectors of word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data according to the word vectors of the word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data;

the word segmentation result scoring module is used for determining the scores of the word segmentation results of the character sequences in the search data under different segmentation conditions based on the word vectors of the words in the word segmentation results, the connection characteristic vectors of the words in the word segmentation results and the word frequency dictionary of the character sequences in the search data under different segmentation conditions;

the word segmentation model is also used for taking word segmentation results of the character sequences in the search data with the highest score under different segmentation conditions as word segmentation results of the specified text.

It should be noted that the apparatuses in the foregoing embodiments can be used as the execution main body in the methods in the foregoing embodiments, and can implement corresponding processes in the methods to achieve the same technical effects, and for brevity, the contents of this aspect are not described again.

FIG. 12 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing the method and apparatus for training a segmentation model according to embodiments of the present invention.

As shown in fig. 12, computing device 1200 includes an input device 1201, an input interface 1202, a central processor 1203, a memory 1204, an output interface 1205, and an output device 1206. The input interface 1202, the central processing unit 1203, the memory 1204, and the output interface 1205 are connected to each other through the bus 1210, and the input device 1201 and the output device 1206 are connected to the bus 1210 through the input interface 1202 and the output interface 1205, respectively, and further connected to other components of the computing device 1200.

Specifically, the input device 1201 receives input information from the outside and transmits the input information to the central processor 1203 via the input interface 1202; the central processor 1203 processes the input information based on computer-executable instructions stored in the memory 1204 to generate output information, temporarily or permanently stores the output information in the memory 1204, and then transmits the output information to the output device 1206 via the output interface 1205; output device 1206 outputs output information to the exterior of computing device 1200 for use by a user.

In one embodiment, the computing device 1200 shown in fig. 12 may be implemented as a training system for a segmentation model that may include: a memory configured to store a program; a processor configured to execute the program stored in the memory to execute the training method of the word segmentation model described in the above embodiments.

According to an embodiment of the invention, the process described above with reference to the flow chart may be implemented as a computer software program. For example, an embodiment of the invention includes a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing a method of training a segmentation model as illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network, and/or installed from a removable storage medium.

FIG. 13 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing the word segmentation method and apparatus according to embodiments of the present invention.

As shown in fig. 13, computing device 1300 includes an input device 1301, an input interface 1302, a central processor 1303, a memory 1304, an output interface 1305, and an output device 1306. The input interface 1302, the central processor 1303, the memory 1304, and the output interface 1305 are connected to each other through a bus 1310, and the input device 1301 and the output device 1306 are connected to the bus 1310 through the input interface 1302 and the output interface 1305, respectively, and further connected to other components of the computing device 1300.

Specifically, the input device 1301 receives input information from the outside, and transmits the input information to the central processor 1303 through the input interface 1302; the central processor 1303 processes input information based on computer-executable instructions stored in the memory 1304 to generate output information, stores the output information in the memory 1304 temporarily or permanently, and then transmits the output information to the output device 1306 through the output interface 1305; output device 1306 outputs output information to the exterior of computing device 1300 for use by a user.

In one embodiment, the computing device 1300 shown in fig. 13 may be implemented as a segmentation system, and the training system of the segmentation model may include: a memory configured to store a program; a processor configured to execute the program stored in the memory to perform the word segmentation method described in the above embodiments.

According to an embodiment of the invention, the process described above with reference to the flow chart may be implemented as a computer software program. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing a method of word segmentation as illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network, and/or installed from a removable storage medium.

As shown in fig. 14, computing device 1400 includes an input device 1401, an input interface 1402, a central processor 1403, a memory 1404, an output interface 1405, and an output device 1406. Wherein input interface 1402, central processor 1403, memory 1404, and output interface 1405 are connected to each other by bus 1410, and input device 1401 and output device 1406 are connected to bus 1410 by input interface 1402 and output interface 1405, respectively, to further connect to other components of computing device 1400.

Specifically, the input device 1401 receives input information from the outside and transmits the input information to the central processor 1403 via the input interface 1402; central processor 1403 processes the input information based on computer-executable instructions stored in memory 1404 to generate output information, stores the output information temporarily or permanently in memory 1404, and then transmits the output information to output device 1406 via output interface 1405; output device 1406 outputs output information to the exterior of computing device 1400 for use by a user.

In one embodiment, the computing device 1400 shown in FIG. 14 may be implemented as an information retrieval system that may include: a memory configured to store a program; a processor configured to execute the program stored in the memory to perform the information retrieval method described in the above embodiments.

According to an embodiment of the invention, the process described above with reference to the flow chart may be implemented as a computer software program. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing an information retrieval method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network, and/or installed from a removable storage medium.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions which, when run on a computer, cause the computer to perform the method described in the various embodiments above. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is to be understood that the invention is not limited to the particular arrangements and instrumentality described in the above embodiments and shown in the drawings. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.

It will be apparent to those skilled in the art that the method procedures of the present invention are not limited to the specific steps described and illustrated, and that various changes, modifications and additions, or equivalent substitutions and changes in the sequence of steps within the technical scope of the present invention are possible within the technical scope of the present invention as those skilled in the art can appreciate the spirit of the present invention.

Claims

1. A training method of a word segmentation model comprises the following steps:

acquiring a labeling data set, wherein the labeling data set comprises training texts with separation marks;

acquiring a search behavior data set, and generating a word frequency dictionary according to the search behavior data set;

and training the word segmentation model corresponding to the labeled data set based on the training text with the separation identification and the word frequency dictionary to obtain the trained word segmentation model corresponding to the labeled data set.

2. The method for training the segmentation model according to claim 1, wherein the training the segmentation model corresponding to the labeled data set based on the training text with the separation identifier and the word frequency dictionary to obtain the trained segmentation model corresponding to the labeled data set includes:

determining word segmentation results of the training text with the separation marks under different segmentation conditions according to the word segmentation model corresponding to the labeling data set and the word segmentation model parameters corresponding to the labeling data set;

scoring the word segmentation result by combining the word frequency dictionary, determining the score of the word segmentation result, and taking the word segmentation result with the highest score as the predicted word segmentation result of the training text with the separation mark;

determining a labeled word segmentation result of the training text according to the training text with the separation mark, and constructing a loss function of a word segmentation model corresponding to the labeled data set by using the predicted word segmentation result and the labeled word segmentation result;

and adjusting the word segmentation model parameters corresponding to the labeled data set by using the loss function, so as to obtain the trained word segmentation model corresponding to the labeled data set by using the adjusted word segmentation model parameters.

3. The method for training the word segmentation model according to claim 2, wherein the scoring the word segmentation result in combination with the word frequency dictionary to determine the score of the word segmentation result comprises:

determining the score of the word segmentation in the word segmentation result according to the word vector and the parameter vector of the word segmentation in the word segmentation result;

determining the connection score of the word segmentation in the word segmentation result according to the word vector of the word segmentation and the connection feature vector of the word segmentation;

searching the word frequency dictionary, and determining word frequency scores of the participles in the word segmentation result;

and taking the sum of the score of the participle, the connection score of the participle and the word frequency score of the participle in the participle result as the score of the participle result.

4. The method for training a segmentation model according to claim 3, wherein,

the connection characteristic vector of the participle is a vector obtained by training the word vector of the participle according to a long-time memory model and a short-time memory model.

5. The method for training the segmentation model according to claim 2, wherein the adjusting the segmentation model parameters corresponding to the labeled data set by using the loss function includes:

determining a gradient corresponding to the loss function;

and adjusting the word segmentation model parameters corresponding to the labeling data set according to the gradient corresponding to the loss function.

6. The method for training a segmentation model according to claim 1, further comprising:

training a segmentation model corresponding to the search behavior data set by using a search training text extracted from the search behavior data set, wherein,

the word segmentation model corresponding to the search behavior data set and the word segmentation model corresponding to the marking data set share a word coding module and a word connection vector determination module.

7. The method for training the segmentation model according to claim 6, wherein the training the segmentation model corresponding to the search behavior data set by using the search training text extracted from the search behavior data set includes:

determining word vectors of word segmentation in the first word segmentation result of the training text with the separation identification under different segmentation conditions and determining word vectors of word segmentation in the second word segmentation result of the search training text under different segmentation conditions by using the word coding module;

and determining the connection characteristic vector of the word in the first word segmentation result and the connection characteristic vector of the word in the second word segmentation result by using the word connection vector determination module according to the word vector of the word in the first word segmentation result and the word vector of the word in the second word segmentation result.

8. The method for training a segmentation model according to claim 1,

the search behavior dataset comprises: searching data input in the user searching process and the clicked title information of the searching result.

9. The method for training the segmentation model according to claim 2, wherein the adjusting the segmentation model parameters corresponding to the labeled data set by using the loss function to obtain the trained segmentation model corresponding to the labeled data set by using the adjusted segmentation model parameters comprises:

determining word segmentation errors of the trained word segmentation models corresponding to the labeled data sets by using the predicted word segmentation results and the labeled word segmentation results;

constructing a loss function of a word segmentation model corresponding to the labeling data set according to the word segmentation error;

utilizing the loss function to adjust word segmentation model parameters corresponding to the labeled data set;

10. A training apparatus for a segmentation model, comprising:

the system comprises a labeling data set acquisition module, a labeling data set acquisition module and a labeling data set acquisition module, wherein the labeling data set acquisition module is used for acquiring a labeling data set which comprises a training text with a separation mark;

the searching behavior data set acquisition module is used for acquiring a searching behavior data set and generating a word frequency dictionary according to the searching behavior data set;

and the first word segmentation model training module is used for training the word segmentation model corresponding to the labeled data set based on the training text with the separation identification and the word frequency dictionary to obtain the trained word segmentation model corresponding to the labeled data set.

11. The apparatus for training a segmentation model according to claim 10, further comprising:

a second segmentation model training module, configured to train a segmentation model corresponding to the search behavior data set by using a search training text extracted from the search behavior data set, where,

and the word segmentation model corresponding to the search behavior data set and the word segmentation model corresponding to the marking data set share a word coding layer and a word connection relation layer.

12. A method of word segmentation, comprising:

acquiring a character sequence of an appointed text;

utilizing a word segmentation model to segment the character sequence of the specified text to obtain a word segmentation result of the specified text, wherein,

the word segmentation model is obtained by training a training text with separation marks in a labeling data set and a word frequency dictionary, and the word frequency dictionary is a set generated according to a search behavior data set.

13. The word segmentation method according to claim 12, wherein the obtaining of the word segmentation result of the specified text by performing word segmentation on the character sequence of the specified text by using a word segmentation model comprises:

determining word vectors of word segmentation in word segmentation results of the specified text under different segmentation conditions;

determining a connection characteristic vector of the word segmentation in the word segmentation result of the specified text under different segmentation conditions according to the word vector of the word segmentation in the word segmentation result of the specified text under different segmentation conditions;

determining scores of the word segmentation results of the specified texts under different segmentation conditions based on word vectors of the words in the word segmentation results of the specified texts under different segmentation conditions, connection feature vectors of the words in the word segmentation results and a word frequency dictionary;

14. A word segmentation device comprising:

the text acquisition module is used for acquiring a character sequence of the specified text;

a word segmentation module for performing word segmentation on the character sequence of the specified text by using a word segmentation model to obtain a word segmentation result of the specified text,

15. The word segmentation apparatus according to claim 14, wherein the word segmentation module comprises:

the word encoder module is used for determining word vectors of word segmentation in the word segmentation result of the training text under different segmentation conditions;

the word connection vector determining module is used for determining connection feature vectors of word segmentation in the word segmentation result of the training text under different segmentation conditions according to word vectors of word segmentation in the word segmentation result of the training text under different segmentation conditions;

the word segmentation result scoring module is used for determining scores of the word segmentation results of the training texts under different segmentation conditions based on word vectors of the words in the word segmentation results of the training texts under different segmentation conditions, connection feature vectors of the words in the word segmentation results and a word frequency dictionary;

and the word segmentation model is also used for taking the word segmentation result of the training text with the highest score under different segmentation conditions as the word segmentation result of the specified text.

16. An information retrieval method, comprising:

acquiring search data input in a user search process;

utilizing a word segmentation model to segment the character sequence in the search data to obtain a word segmentation result of the search data, wherein the word segmentation model is a model obtained by utilizing a training text with a separation mark in a labeled data set and a word frequency dictionary which is a set generated according to a search behavior data set;

and retrieving the word segmentation result of the search data.

17. The information retrieval method according to claim 16, wherein the performing word segmentation on the character sequence in the search data by using a word segmentation model to obtain a word segmentation result of the search data includes:

determining word vectors of word segmentation in word segmentation results of character sequences in the search data under different segmentation conditions;

determining connection characteristic vectors of word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data according to word vectors of word segmentation in the word segmentation result under different segmentation conditions of the character sequence in the search data;

determining scores of word segmentation results of the character sequences in the search data under different segmentation conditions based on word vectors of the words in the word segmentation results of the character sequences in the search data under different segmentation conditions, connection feature vectors of the words in the word segmentation results and a word frequency dictionary;

and taking the word segmentation result of the character sequence in the search data with the highest score under different segmentation conditions as the word segmentation result of the character sequence in the search data.

18. An information retrieval apparatus comprising:

the data acquisition module is used for acquiring search data input in the search process of a user;

the word segmentation module is used for segmenting the character sequence in the search data by utilizing a word segmentation model to obtain a word segmentation result of the search data, wherein the word segmentation model is a model obtained by utilizing a training text with separation marks in a labeled data set and a word frequency dictionary which is a set generated according to a search behavior data set;

and the retrieval module is used for retrieving the word segmentation result of the search data.

19. The information retrieval device of claim 18, wherein the word segmentation module comprises:

the word encoder module is used for determining word vectors of word segmentation in the word segmentation result of the character sequence in the search data under different segmentation conditions;

the word segmentation result scoring module is used for determining the scores of the word segmentation results of the character sequences in the search data under different segmentation conditions based on the word vectors of the words in the word segmentation results of the character sequences in the search data under different segmentation conditions, the connection feature vectors of the words in the word segmentation results and the word frequency dictionary;

the word segmentation model is further used for taking word segmentation results of the character sequences in the search data with the highest scores under different segmentation conditions as word segmentation results of the specified text.

20. A training system of a word segmentation model comprises a memory and a processor;

the memory is used for storing executable program codes;

the processor is configured to read executable program code stored in the memory to perform the method of training a segmentation model according to any one of claims 1 to 9.

21. A word segmentation system comprising a memory and a processor;

the memory is used for storing executable program codes;

the processor is configured to read executable program code stored in the memory to perform the word segmentation method of claim 12.

22. An information retrieval system comprising a memory and a processor;

the memory is used for storing executable program codes;

the processor is configured to read executable program code stored in the memory to perform the information retrieval method of claim 16.

23. A computer-readable storage medium, comprising instructions which, when executed on a computer, cause the computer to perform the method of training a segmentation model according to any one of claims 1 to 9.

24. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the word segmentation method of claim 12.

25. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the information retrieval method of claim 16.