WO2020224219A1 - Chinese word segmentation method and apparatus, electronic device and readable storage medium - Google Patents

Chinese word segmentation method and apparatus, electronic device and readable storage medium Download PDF

Info

Publication number
WO2020224219A1
WO2020224219A1 PCT/CN2019/117900 CN2019117900W WO2020224219A1 WO 2020224219 A1 WO2020224219 A1 WO 2020224219A1 CN 2019117900 W CN2019117900 W CN 2019117900W WO 2020224219 A1 WO2020224219 A1 WO 2020224219A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
training
neural network
network model
convolutional neural
Prior art date
Application number
PCT/CN2019/117900
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020224219A1 publication Critical patent/WO2020224219A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Definitions

  • This application relates to the technical field of data analysis, and in particular to a Chinese word segmentation method, device, electronic device and readable storage medium for Chinese word segmentation through a convolutional neural network model.
  • the word segmentation processing is first required, that is, the coherent text is broken down into a sequence of units with specific language meanings. This processing is manifested in Chinese information processing. Is particularly prominent. As a basic step in the field of natural language processing, word segmentation plays an important role in natural language processing.
  • Chinese word segmentation is different from English word segmentation, English word segmentation is based on words, and words are separated by spaces, while Chinese word segmentation is based on words. All the words in a sentence can be connected to express a complete meaning. .
  • the so-called Chinese word segmentation is to divide the sequence of Chinese characters into meaningful words, also called word segmentation. For example, the cut-off result of the phrase "knowledge is power" is "knowledge/is/power".
  • the accuracy of Chinese word segmentation often directly affects the relevance ranking of search results.
  • text segmentation algorithms generally use template matching methods, such as word segmentation algorithms based on text matching, dictionary-based word segmentation algorithms, etc.
  • the accuracy of word segmentation is completely dependent on the template, resulting in low word segmentation accuracy.
  • this application provides a Chinese word segmentation method, device, electronic device, and readable storage medium that can increase word segmentation accuracy and can quickly segment words.
  • this application provides a Chinese word segmentation method based on a convolutional neural network model.
  • the Chinese word segmentation method includes the following steps:
  • Step 1 Obtain a word dictionary, remove special symbols and non-Chinese characters in the word dictionary, separate each word in the word dictionary into a separate word form, and the set of words in the separate word form is the first training text;
  • Step 2 Transform the first training text into a first word vector training text in the form of a word vector through word vector training, and determine a word vector dictionary according to the first training text and the first word vector training text, so The word vector dictionary records the correspondence between words and word vectors;
  • Step 3 Obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
  • the fourth step training the convolutional neural network model according to the training information, the preset cross-entropy loss function and the ADAM optimization algorithm;
  • Step 5 Perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
  • this application provides a Chinese word segmentation device based on a convolutional neural network model, including: a preprocessing module, a character vector training module, a training information generation module, a training module, and a recognition prediction module, wherein:
  • the preprocessing module is used to obtain a word dictionary and preprocess the word dictionary, remove special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separate each word in the word dictionary into A text in a single text form, and the set of words in a single text form is the first training text;
  • the word vector training module is configured to convert the first training text into a first word vector training text in the form of a word vector, and determine a word vector dictionary according to the first training text and the first word vector training text, The corresponding relationship between words and word vectors is recorded in the word vector dictionary;
  • the training information generating module is used to obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
  • the training module is used to train the convolutional neural network model according to a preset cross-entropy loss function, an ADAM optimization algorithm, and the training information;
  • the recognition prediction module is configured to perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
  • the present application also provides an electronic device, which includes a memory, a processor, and a database, and a word dictionary and a second training text are stored in the database.
  • the memory includes a preprocessing program, a word vector training program, a training information generation program and a convolutional neural network model.
  • the convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel; the first convolutional layer includes three one-dimensional convolution kernels, and the first convolutional layer
  • the lengths of the one-dimensional convolution kernels of the layers are 1, 3, and 5 respectively.
  • the one-dimensional convolution kernels of the first convolutional layer have 128 channels respectively; the second to fourth convolutional layers include lengths
  • the one-dimensional convolution kernel is 3, the one-dimensional convolution kernel of the second layer, the one-dimensional convolution kernel of the third layer and the one-dimensional convolution kernel of the fourth layer have 384 channels; in the fourth
  • a parallel attention mechanism is constructed at the layer convolution layer. The attention mechanism is used for attention weight calculation and weight adjustment for each channel.
  • the preprocessing program obtains the word dictionary from the database, and then preprocesses the word dictionary, removes special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separates the word dictionary into the first training in the form of separate words text;
  • the word vector training program converts the first training text in the form of a single word into a word vector dictionary in the form of word vectors;
  • the training information generating program obtains a second training text with word segmentation annotations from a database, and converts the second training text into training information in a word vector form according to the word vector dictionary;
  • the convolutional neural network model obtains the training information, and performs training according to the training information, a preset cross-entropy loss function, and an ADAM optimization algorithm.
  • the present application also provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a computer program and a database.
  • the computer program is executed by a processor, The steps of the Chinese word segmentation method based on the convolutional neural network model mentioned above
  • the Chinese word segmentation method, device, electronic equipment, and readable storage medium provided in this application first obtain a word vector dictionary, then use the word vector dictionary to convert the second text into training information, and then train a convolutional neural network model based on the training information, Finally, the trained convolutional neural network model performs character boundary recognition and prediction based on the input text to be segmented.
  • the word segmentation through the convolutional neural network model consumes less resources, the word segmentation speed is fast, and the accuracy rate is high.
  • the setting of the attention mechanism can optimize the convolutional neural network model and improve the convolutional neural network model The accuracy of the forecast.
  • Fig. 1 is a flowchart of a Chinese word segmentation method based on a convolutional neural network model based on an embodiment of the present application.
  • Fig. 2 is a working flowchart of various programs in an electronic device based on an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the logical structure of an electronic device based on an embodiment of the present application.
  • the convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel.
  • the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernel has 128 channels respectively.
  • the second to fourth convolutional layers all include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, a one-dimensional convolution kernel of the third layer, and a convolution kernel of the fourth layer.
  • the dimensional convolution kernel has 384 channels.
  • An attention mechanism parallel to the convolutional neural network model is built at the fourth convolutional layer. The attention mechanism is used for attention weight calculation, which is performed for each channel of the one-dimensional convolution kernel of the fourth convolutional layer. Weight adjustment.
  • the attention mechanism is used to adjust the weight of the convolution result output by each channel of the fourth layer of convolutional layer to obtain the weighted result, and then input the weighted result into the softmax function, which will
  • the character boundary of each word is mapped into a probability value from 0 to 1, and the one with the highest probability value is output as the prediction result of the character boundary.
  • the softmax function outputs the prediction result to complete the character boundary recognition prediction of each word.
  • the softmax function maps the character boundary of each word into a probability value from 0 to 1.
  • the probability value refers to the probability value of each word being the beginning of the word, the middle of the word, the end of the word, and the word.
  • the word When one of the probabilities When the value is the highest, the word is predicted to be the character boundary corresponding to the highest probability.
  • the softmax function outputs the character boundary corresponding to the character with the highest probability, and it can be considered that the corresponding word is the most likely to be the character boundary, thereby realizing the prediction of the character boundary.
  • the identification label of the above-mentioned character boundary is BMES
  • B stands for the beginning of the word
  • M stands for the middle of the word
  • E stands for the end of the word
  • S stands for a single word, that is, the identification label B is added to the word predicted to be the beginning of the word
  • the prediction is An identification tag M is added to the word in the middle of the word
  • an identification tag E is added to the word predicted to be the end of the word
  • an identification tag S is added to the word predicted to be a single word.
  • Fig. 1 shows a flowchart of a Chinese word segmentation method based on a convolutional neural network model based on an embodiment of the present application.
  • the Chinese word segmentation method based on a convolutional neural network model provided by this embodiment includes the following steps:
  • the word dictionary is Chinese Wikipedia.
  • the word dictionary can be stored in the database.
  • the word dictionary can be obtained by accessing the database; then the special symbols and non-Chinese characters in the word dictionary are removed ,
  • the non-Chinese characters include pinyin, numbers and English symbols, and the special symbols include phonetic symbols or other non-Chinese symbols.
  • each character in the character dictionary is separated into separate character forms, and each Chinese character is separated into independent units by means of separation, and the set of characters in the separate character form is the first training text.
  • the above-mentioned first training text can be input into the Word2Vec algorithm for word vector training.
  • the input first training text is a collection of words in a single text form, and the first training text is converted into a word vector form through the above-mentioned Word2Vec algorithm
  • the first word vector of the training text is obtained according to the first training text and the converted first word vector training text, and the corresponding relationship between words and word vectors is recorded in the word vector dictionary to facilitate the subsequent conversion between words and word vectors.
  • the conversion speed is faster than that in the prior art, which is processed by hot encoding to convert the text into a word vector.
  • the word vector dictionary obtained by the Word2Vec algorithm is more accurate than the word vector dictionary obtained by conventional hot coding, and the prediction result obtained in the final character boundary recognition prediction is more accurate.
  • the second training text with word segmentation annotations that is, the second training text is a text that completes Chinese word segmentation, and the words in the second training text
  • the beginning, the middle of the word, the end of the word, and the word are known; in this embodiment, the word segmentation label is marked with an identification label, and the identification label is BMES.
  • the second training text can be stored in a database, and the second training text can be obtained by accessing the database.
  • the second training text is converted into training information in the form of word vectors.
  • the word vector dictionary plays a role of comparison.
  • the word vector corresponding to the text in the second training text is obtained through the word vector dictionary;
  • the conversion of text into training information in the form of word vectors facilitates the recognition and reading of the convolutional neural network model.
  • the convolutional neural network model can only recognize and read the training information in the form of word vectors; the convolutional neural network model cannot directly recognize and read Chinese characters The second training text of the form.
  • step S140 After the training information is obtained in step S130, the training information is input to the convolutional neural network model, and the convolutional neural network model is trained according to the training information, the cross-entropy loss function, and the ADAM optimization algorithm; in the training, the training information Input to the convolutional neural network model, take the cross-entropy loss function as the loss function, and use the ADAM optimization algorithm as the optimization algorithm.
  • the convolutional neural network model is trained according to the input training information. After the convolutional neural network model is trained, it can perform character boundary recognition prediction.
  • the character boundary recognition prediction is the prediction of the character boundary mentioned above in this embodiment. After the character boundary prediction is completed, the text The beginning of the word, the middle of the word, the end of the word and the single word are distinguished to realize the word segmentation of the text.
  • the input text to be segmented can be obtained from the database or cache by copy transmission; the input text to be segmented can also be input through an input device, such as a keyboard; of course, the input text to be segmented can also be It can be text data transmitted by other equipment signals.
  • the b*a matrix is matrix multiplied with the attention matrix formed according to the attention mechanism to obtain the b*a matrix and convert it into
  • the a*b*1 three-dimensional matrix is summed with the convolution result mapped as the probability to obtain and output the weighted weighted result to complete the weight adjustment of each channel.
  • the weighted result is transmitted to the two fully connected layers, and then calculated by the softmax function, and the calculated probability value is the highest as the predicted result.
  • the softmax function can be calculated by tensorflow in Python Library implementation.
  • This embodiment provides a Chinese word segmentation device based on a convolutional neural network model, which includes a preprocessing module, a character vector training module, a training information generation module, a training module, and a recognition prediction module, where:
  • the preprocessing module is used to obtain a word dictionary and preprocess the word dictionary, remove special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separate each word in the word dictionary into A text in a single text form, and the set of words in a single text form is the first training text;
  • the word vector training module is configured to convert the first training text into a first word vector training text in the form of a word vector, and determine a word vector dictionary according to the first training text and the first word vector training text, The corresponding relationship between words and word vectors is recorded in the word vector dictionary;
  • the training information generating module is used to obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
  • the training module is used to train the convolutional neural network model according to a preset cross-entropy loss function, an ADAM optimization algorithm, and the training information;
  • the recognition prediction module is configured to perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
  • the convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel.
  • the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernels have 128 channels respectively;
  • the second to fourth convolution layers include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, and a one-dimensional convolution kernel of the third layer.
  • the one-dimensional convolution kernel of the fourth convolutional layer and the fourth convolutional layer have 384 channels; a parallel attention mechanism is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, and weights for each channel Adjustment; the above-mentioned convolutional neural network model is also provided with a softmax function. After the weights of each channel are adjusted, the weighted results of each channel after adjustment are input to the softmax function.
  • the softmax function maps the character boundary of each word into The probability value is 0 to 1, and the highest probability value is output as the prediction result of character boundary recognition prediction.
  • FIG. 3 provides a schematic diagram of the logical structure of an electronic device based on an embodiment of the present application, as shown in FIG. 3.
  • the electronic device 1 includes a processor 2 and a memory 3, and a computer program 4 is stored in the memory.
  • the electronic device 1 further includes a database in which a word dictionary and a second training text are stored.
  • the word dictionary is Chinese Wikipedia
  • the second training text is marked with word segmentation.
  • a computer program 4 is stored in the aforementioned memory, and the computer program 4 includes a preprocessing program, a word vector training program, a training information generation program, and a convolutional neural network model.
  • the aforementioned convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel.
  • the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernels have 128 channels respectively;
  • the second to fourth convolution layers include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, and a one-dimensional convolution kernel of the third layer.
  • the one-dimensional convolution kernel of the fourth convolutional layer and the fourth convolutional layer have 384 channels; a parallel attention mechanism is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, and weights for each channel Adjustment; the above-mentioned convolutional neural network model is also provided with a softmax function. After the weights of each channel are adjusted, the weighted results of each channel after adjustment are input to the softmax function.
  • the softmax function maps the character boundary of each word into The probability value is 0 to 1, and the highest probability value is output as the prediction result of character boundary recognition prediction.
  • Figure 2 provides a working flowchart of each program in an electronic device based on an embodiment of the application. As shown in Figure 2, the above-mentioned preprocessing program, word vector training program, training information generation program and convolutional neural network model are processed by the The following steps are implemented when the device is executed:
  • the aforementioned preprocessing program obtains a word dictionary from a database, and the word dictionary can be obtained by accessing the database; after obtaining the word dictionary, preprocess the word dictionary.
  • the preprocessing refers to the removal of special symbols and non-Chinese characters in the word dictionary.
  • the non-Chinese characters include pinyin, numbers, and English symbols.
  • the special symbols include phonetic symbols or other non-Chinese symbols; the preprocessing process is in the removal of words dictionary After the special symbols and non-Chinese characters, the text dictionary is separated into the first training text in the form of separate text, and the preprocessing steps are completed.
  • the above-mentioned word vector training program converts the first training text in the form of a single text into a word vector dictionary in the form of word vectors; the word vector training program includes the Word2Vec algorithm, and the first training text is trained on the word vector by the Word2Vec algorithm.
  • the first training text is a collection of words in the form of individual words, and the first training text is converted into the first word vector training text in the form of word vectors through the aforementioned Word2Vec algorithm.
  • a word vector dictionary is obtained according to the first training text and the converted first word vector training text, and the word vector dictionary records the correspondence between the text and the word vector.
  • the training information generating program obtains the second training text with word segmentation tags from the database, and converts the second training text into training information in the form of word vectors according to the word vector dictionary; the word vector dictionary records the text and word vector Corresponding relationship, the second training text records the text, the word vector corresponding to the text can be obtained through the word vector dictionary, and then the training information converted into the word vector form can be obtained.
  • the convolutional neural network model obtains the training information, and performs training according to the training information, the preset cross-entropy loss function, and the ADAM optimization algorithm.
  • the training of the convolutional neural network model can be conducted in a conventional manner.
  • the input data information is training information.
  • a trained convolutional neural network model can be obtained.
  • the trained convolutional neural network model can perform character boundary recognition and prediction on the text according to the training results.
  • one or more programs may be a series of computer program 4 instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 4 in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the electronic device 1 may include, but is not limited to, a processor 2 and a memory 3. Those skilled in the art can understand that it does not constitute a limitation on the electronic device 1. It may include more or less components than shown, or combine certain components, or different components. For example, the electronic device 1 may also include input and output. Equipment, network access equipment, bus, etc.
  • the so-called processor 2 can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors 2 (Digital Signal Processor, DSP), and Application Specific Integrated Circuit (ASIC) , Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a controller, a microcontroller, or a microprocessor, or the processor may also be any conventional processor. It is used to execute test task input program, tester input program, test task allocation program and test task trigger program.
  • the memory 3 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1.
  • the memory 3 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) equipped on the electronic device 1. Card), multimedia card, card-type memory, magnetic memory, magnetic disk and optical disk, etc.
  • the memory 3 may also include both an internal storage unit of the terminal device and an external storage device.
  • the memory 3 is used to store the computer program 4 and other programs and data required by the electronic device.
  • the memory 3 can also be used to temporarily store data that has been output or will be output.
  • This embodiment provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a computer program and a database.
  • the computer program is executed by a processor, the Chinese word segmentation as described in the above embodiment 1 is implemented. The steps of the method will not be repeated here.
  • Unit completion means dividing the internal structure of the device into different functional units or units to complete all or part of the functions described above.
  • the functional units and units in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit.
  • the above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of software functional units.
  • the specific names of the functional units and units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application. For the specific working process of the units and units in the above system, please refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the above-mentioned unit or division of units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined. Or it can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the aforementioned integrated unit/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the above-mentioned computer program may be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the above-mentioned computer-readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signal telecommunications signal
  • telecommunications signal and software distribution medium, etc.
  • the content contained in the above-mentioned computer-readable media can be appropriately added or deleted in accordance with the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable media cannot Including electrical carrier signals and telecommunication signals.
  • the Chinese word segmentation method, electronic device, and readable storage medium provided in this application first obtain a word vector dictionary, convert the second text into training information through the word vector dictionary, and then train a convolutional neural network model according to the training information, and the trained volume
  • the product neural network model performs character boundary recognition and prediction based on the input text to be segmented.
  • the word segmentation through the convolutional neural network model consumes less resources, the word segmentation speed is fast, and the accuracy rate is high.
  • An attention mechanism is built at the fourth layer of the convolutional neural network model. When training the convolutional neural network model, the setting of the attention mechanism can optimize the convolutional neural network model and improve the convolutional neural network model. The accuracy of the forecast.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

A Chinese word segmentation method based on a convolutional neural network model, an electronic apparatus and a readable storage medium. The method comprises: firstly, acquiring a word vector dictionary; converting, by means of the word vector dictionary, second text into training information; then training a convolutional neural network model according to the training information; and finally, the convolutional neural network model performing character boundary recognition and prediction according to input text. Word segmentation is performed by means of a convolutional neural network model, such that fewer resources are consumed, the speed of word segmentation is high, and the accuracy is high. An attention mechanism is constructed at a fourth convolutional layer of the convolutional neural network model, and when the convolutional neural network model is trained, the convolutional neural network model can be optimized by means of the arrangement of the attention mechanism, thereby improving the prediction accuracy of the convolutional neural network model.

Description

中文分词方法、装置、电子设备及可读存储介质Chinese word segmentation method, device, electronic equipment and readable storage medium
本申请要求于2019年05月06日提交的中国专利申请号201910371045.2的优先权益,上述案件全部内容以引用的方式并入本文中。This application claims the priority rights of Chinese Patent Application No. 201910371045.2 filed on May 6, 2019. The entire contents of the above cases are incorporated herein by reference.
技术领域Technical field
本申请涉及数据分析技术领域,具体地,涉及一种通过卷积神经网络模型进行中文分词的中文分词方法、装置、电子设备及可读存储介质。This application relates to the technical field of data analysis, and in particular to a Chinese word segmentation method, device, electronic device and readable storage medium for Chinese word segmentation through a convolutional neural network model.
背景技术Background technique
随着互联网技术的发展,网络上出现的文本数量与日俱增,比如电子邮件、实时新闻、科技博文等等,产生了海量的文本类型的数据信息。人们对于信息分析和信息处理的需求越来越多,对这些文本类型的数据信息进行处理以获得所需要的信息的需求日益迫切。With the development of Internet technology, the number of texts appearing on the Internet is increasing day by day, such as e-mails, real-time news, scientific and technological blog posts, etc., resulting in massive text-type data information. People's demand for information analysis and information processing is increasing, and the demand for processing these text-type data information to obtain the required information is increasingly urgent.
在对文本类型的数据信息进行相应的数据分析时,首先需要进行分词处理,也就是将连贯的文字分解由一个个具有特定语言含义的单元组成的序列,这样的处理在中文的信息处理中表现的尤为突出。分词作为自然语言处理领域中的基础步骤,在自然语言处理中扮演着重要的角色。When performing corresponding data analysis on the data information of the text type, the word segmentation processing is first required, that is, the coherent text is broken down into a sequence of units with specific language meanings. This processing is manifested in Chinese information processing. Is particularly prominent. As a basic step in the field of natural language processing, word segmentation plays an important role in natural language processing.
因为中文分词与英文分词有所不同,英文分词是以词为单位,词与词之间用空格分隔,而中文分词是以字为单位,句子中所有的字连起来才能完整地表达某个含义。所谓中文分词就是将中文的汉字序列切分成有意义的词,也称为切词。例如,“知识就是力量”这句话的切词结果是“知识/就是/力量”。中文分词的准确程度,常常直接影响到搜索结果的相关度排序。Because Chinese word segmentation is different from English word segmentation, English word segmentation is based on words, and words are separated by spaces, while Chinese word segmentation is based on words. All the words in a sentence can be connected to express a complete meaning. . The so-called Chinese word segmentation is to divide the sequence of Chinese characters into meaningful words, also called word segmentation. For example, the cut-off result of the phrase "knowledge is power" is "knowledge/is/power". The accuracy of Chinese word segmentation often directly affects the relevance ranking of search results.
目前,文本分词算法一般是采用模板匹配的方式,比如基于文本匹配的分词算法、基于词典的分词算法等等,分词准确性完全依赖于模板,导致分词准确率较低。At present, text segmentation algorithms generally use template matching methods, such as word segmentation algorithms based on text matching, dictionary-based word segmentation algorithms, etc. The accuracy of word segmentation is completely dependent on the template, resulting in low word segmentation accuracy.
发明内容Summary of the invention
为了解决现有技术中分词准确率低的问题,本申请提供一种能够增加分 词准确率,且能够快速分词的中文分词方法、装置、电子设备及可读存储介质。In order to solve the problem of low word segmentation accuracy in the prior art, this application provides a Chinese word segmentation method, device, electronic device, and readable storage medium that can increase word segmentation accuracy and can quickly segment words.
第一方面,本申请提供一种基于卷积神经网络模型的中文分词方法,中文分词方法包括如下步骤:In the first aspect, this application provides a Chinese word segmentation method based on a convolutional neural network model. The Chinese word segmentation method includes the following steps:
第一步:获取文字字典,去除所述文字字典中的特殊符号和非中文字符,将文字字典中的各文字分隔为单独文字形式,所述单独文字形式的文字的集合为第一训练文本;Step 1: Obtain a word dictionary, remove special symbols and non-Chinese characters in the word dictionary, separate each word in the word dictionary into a separate word form, and the set of words in the separate word form is the first training text;
第二步:通过字向量训练将所述第一训练文本转化为字向量形式的第一字向量训练文本,根据所述第一训练文本和所述第一字向量训练文本确定字向量字典,所述字向量字典中记录有文字与字向量的对应关系;Step 2: Transform the first training text into a first word vector training text in the form of a word vector through word vector training, and determine a word vector dictionary according to the first training text and the first word vector training text, so The word vector dictionary records the correspondence between words and word vectors;
第三步:获取带有分词标注的第二训练文本,根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;Step 3: Obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
第四步:根据所述训练信息、预设的交叉熵损失函数和ADAM优化算法对所述卷积神经网络模型进行训练;The fourth step: training the convolutional neural network model according to the training information, the preset cross-entropy loss function and the ADAM optimization algorithm;
第五步:根据所述卷积神经网络模型的训练结果对输入的待分词的文本进行字符边界识别预测。Step 5: Perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
第二方面,本申请提供一种基于卷积神经网络模型的中文分词装置,包括:包括预处理模块、字向量训练模块、训练信息生成模块、训练模块和识别预测模块,其中:In the second aspect, this application provides a Chinese word segmentation device based on a convolutional neural network model, including: a preprocessing module, a character vector training module, a training information generation module, a training module, and a recognition prediction module, wherein:
所述预处理模块用于获取文字字典,并对文字字典进行预处理,通过所述预处理去除所述文字字典中的特殊符号和非中文字符,并将所述文字字典中的各文字分隔为单独文字形式的文字,所述单独文字形式的文字的集合为第一训练文本;The preprocessing module is used to obtain a word dictionary and preprocess the word dictionary, remove special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separate each word in the word dictionary into A text in a single text form, and the set of words in a single text form is the first training text;
所述字向量训练模块用于将所述第一训练文本转化为字向量形式的第一字向量训练文本,并根据所述第一训练文本和所述第一字向量训练文本确定字向量字典,所述字向量字典中记录有文字与字向量的对应关系;The word vector training module is configured to convert the first training text into a first word vector training text in the form of a word vector, and determine a word vector dictionary according to the first training text and the first word vector training text, The corresponding relationship between words and word vectors is recorded in the word vector dictionary;
所述训练信息生成模块用于获取带有分词标注的第二训练文本,并根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;The training information generating module is used to obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
所述训练模块用于根据预设的交叉熵损失函数和ADAM优化算法以及所述训练信息,对所述卷积神经网络模型进行训练;The training module is used to train the convolutional neural network model according to a preset cross-entropy loss function, an ADAM optimization algorithm, and the training information;
所述识别预测模块用于根据所述卷积神经网络模型的训练结果对输入的待分词的文本进行字符边界识别预测。The recognition prediction module is configured to perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
第三方面,本申请还提供一种电子装置,该电子装置包括:存储器、处理器及数据库,在该数据库中存储有文字字典和第二训练文本。所述存储器中包括预处理程序、字向量训练程序、训练信息生成程序和卷积神经网络模型。In a third aspect, the present application also provides an electronic device, which includes a memory, a processor, and a database, and a word dictionary and a second training text are stored in the database. The memory includes a preprocessing program, a word vector training program, a training information generation program and a convolutional neural network model.
所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核;第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道;在第四层卷积层处构建并行的注意力机制,该注意力机制用于注意力权重计算,为各通道进行权重调整。The convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel; the first convolutional layer includes three one-dimensional convolution kernels, and the first convolutional layer The lengths of the one-dimensional convolution kernels of the layers are 1, 3, and 5 respectively. The one-dimensional convolution kernels of the first convolutional layer have 128 channels respectively; the second to fourth convolutional layers include lengths The one-dimensional convolution kernel is 3, the one-dimensional convolution kernel of the second layer, the one-dimensional convolution kernel of the third layer and the one-dimensional convolution kernel of the fourth layer have 384 channels; in the fourth A parallel attention mechanism is constructed at the layer convolution layer. The attention mechanism is used for attention weight calculation and weight adjustment for each channel.
所述预处理程序、字向量训练程序、训练信息生成程序和卷积神经网络模型被所述处理器执行时实现如下步骤:When the preprocessing program, the word vector training program, the training information generation program and the convolutional neural network model are executed by the processor, the following steps are implemented:
所述预处理程序从数据库中获取文字字典,然后对文字字典进行预处理,通过所述预处理去除文字字典中的特殊符号和非中文字符,并将文字字典分隔为单独文字形式的第一训练文本;The preprocessing program obtains the word dictionary from the database, and then preprocesses the word dictionary, removes special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separates the word dictionary into the first training in the form of separate words text;
所述字向量训练程序将单独文字形式的第一训练文本转化为字向量形式的字向量字典;The word vector training program converts the first training text in the form of a single word into a word vector dictionary in the form of word vectors;
所述训练信息生成程序从数据库中获取带有分词标注的第二训练文本,根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;The training information generating program obtains a second training text with word segmentation annotations from a database, and converts the second training text into training information in a word vector form according to the word vector dictionary;
所述卷积神经网络模型获取所述训练信息,根据所述训练信息、预设的交叉熵损失函数和ADAM优化算法进行训练。The convolutional neural network model obtains the training information, and performs training according to the training information, a preset cross-entropy loss function, and an ADAM optimization algorithm.
第四方面,本申请还提供一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质中包括计算机程序和数据库,所述计算机程序被处理器执行时,实现如上述的基于卷积神经网络模型的中文分词方法的步骤In a fourth aspect, the present application also provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium includes a computer program and a database. When the computer program is executed by a processor, The steps of the Chinese word segmentation method based on the convolutional neural network model mentioned above
本申请提供的中文分词方法、装置、电子设备及可读存储介质与现有技术相比,具有以下有益效果:Compared with the prior art, the Chinese word segmentation method, device, electronic equipment, and readable storage medium provided in this application have the following beneficial effects:
本申请提供的中文分词方法、装置、电子设备及可读存储介质,首先获 取字向量字典,然后通过该字向量字典将第二文本转化为训练信息,接着根据训练信息训练卷积神经网络模型,最后训练好的卷积神经网络模型根据输入的待分词的文本进行字符边界识别预测。通过卷积神经网络模型进行分词,消耗的资源更少,分词速度快,正确率高。在上述卷积神经网络模型的第四层卷积层处构建注意力机制后,在训练卷积神经网络模型时,该注意力机制的设置能够优化卷积神经网络模型,提高卷积神经网络模型预测的准确率。The Chinese word segmentation method, device, electronic equipment, and readable storage medium provided in this application first obtain a word vector dictionary, then use the word vector dictionary to convert the second text into training information, and then train a convolutional neural network model based on the training information, Finally, the trained convolutional neural network model performs character boundary recognition and prediction based on the input text to be segmented. The word segmentation through the convolutional neural network model consumes less resources, the word segmentation speed is fast, and the accuracy rate is high. After constructing the attention mechanism at the fourth convolution layer of the above convolutional neural network model, when training the convolutional neural network model, the setting of the attention mechanism can optimize the convolutional neural network model and improve the convolutional neural network model The accuracy of the forecast.
附图说明Description of the drawings
通过参考以下流程附图的说明及权利要求书的内容,并且随着对本申请的更全面理解,本申请的其它目的及结果将更加明白及易于理解。在附图中:By referring to the description of the following process drawings and the content of the claims, and with a more comprehensive understanding of the application, other purposes and results of the application will be more clear and easy to understand. In the attached picture:
图1是基于本申请实施例的基于卷积神经网络模型的中文分词方法的流程图。Fig. 1 is a flowchart of a Chinese word segmentation method based on a convolutional neural network model based on an embodiment of the present application.
图2是基于本申请实施例的电子设备中各程序的工作流程图。Fig. 2 is a working flowchart of various programs in an electronic device based on an embodiment of the present application.
图3是基于本申请实施例的电子设备的逻辑结构示意图。Fig. 3 is a schematic diagram of the logical structure of an electronic device based on an embodiment of the present application.
在所有附图中相同的标号指示相似或相应的特征或功能。The same reference numerals in all drawings indicate similar or corresponding features or functions.
具体实施方式Detailed ways
在下面的描述中,出于说明的目的,为了提供对一个或多个实施例的全面理解,阐述了许多具体细节。然而,很明显,也可以在没有这些具体细节的情况下实现这些实施例。在其它例子中,为了便于描述一个或多个实施例,公知的结构和设备以方框图的形式示出。In the following description, for illustrative purposes, in order to provide a comprehensive understanding of one or more embodiments, many specific details are set forth. However, it is obvious that these embodiments can also be implemented without these specific details. In other examples, for the convenience of describing one or more embodiments, well-known structures and devices are shown in the form of block diagrams.
以下将结合附图对本申请的具体实施例进行详细描述。The specific embodiments of the present application will be described in detail below in conjunction with the accompanying drawings.
实施例1Example 1
本实施例提供一种基于卷积神经网络模型的中文分词方法,该卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核。其中,第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道。第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道。在第四层卷积层处构建有与卷积神经网络模型并行的注意力机制,该注意力机制用于注意力权重 计算,为第四层卷积层的一维卷积核的各通道进行权重调整。This embodiment provides a method for Chinese word segmentation based on a convolutional neural network model. The convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel. Among them, the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernel has 128 channels respectively. The second to fourth convolutional layers all include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, a one-dimensional convolution kernel of the third layer, and a convolution kernel of the fourth layer. The dimensional convolution kernel has 384 channels. An attention mechanism parallel to the convolutional neural network model is built at the fourth convolutional layer. The attention mechanism is used for attention weight calculation, which is performed for each channel of the one-dimensional convolution kernel of the fourth convolutional layer. Weight adjustment.
在训练卷积神经网络模型时,通过该注意力机制调整第四层卷积层各通道输出的卷积结果的权重,获取加权结果,然后将该加权结果输入至softmax函数,该softmax函数将每个字的字符边界映射后转化为0至1的概率值,并输出概率值最高者作为字符边界的预测结果,softmax函数输出该预测结果,完成每一个字的字符边界识别预测。该softmax函数将每个字的字符边界映射后转化为0至1的概率值,该概率值是指的每个字分别为词语开头、词语中部、词语结尾和单字的概率值,当其中一个概率值最高时,则预测该字为该概率最高者对应的字符边界。该softmax函数输出概率最高者对应的字符边界,可认为对应的字是该字符边界的可能性最大,进而实现字符边界的预测。When training the convolutional neural network model, the attention mechanism is used to adjust the weight of the convolution result output by each channel of the fourth layer of convolutional layer to obtain the weighted result, and then input the weighted result into the softmax function, which will The character boundary of each word is mapped into a probability value from 0 to 1, and the one with the highest probability value is output as the prediction result of the character boundary. The softmax function outputs the prediction result to complete the character boundary recognition prediction of each word. The softmax function maps the character boundary of each word into a probability value from 0 to 1. The probability value refers to the probability value of each word being the beginning of the word, the middle of the word, the end of the word, and the word. When one of the probabilities When the value is the highest, the word is predicted to be the character boundary corresponding to the highest probability. The softmax function outputs the character boundary corresponding to the character with the highest probability, and it can be considered that the corresponding word is the most likely to be the character boundary, thereby realizing the prediction of the character boundary.
本实施例中,上述字符边界的识别标签为BMES,B代表词语开头,M代表词语中部,E代表词语结尾,S代表单字,即在预测为词语开头的字上添加识别标签B,在预测为词语中部的字上添加识别标签M,在预测为词语结尾的字上添加识别标签E,在预测为单字的字上添加识别标签S。In this embodiment, the identification label of the above-mentioned character boundary is BMES, B stands for the beginning of the word, M stands for the middle of the word, E stands for the end of the word, S stands for a single word, that is, the identification label B is added to the word predicted to be the beginning of the word, and the prediction is An identification tag M is added to the word in the middle of the word, an identification tag E is added to the word predicted to be the end of the word, and an identification tag S is added to the word predicted to be a single word.
图1示出了基于本申请实施例的基于卷积神经网络模型的中文分词方法的流程图,如图1所示,本实施例提供的基于卷积神经网络模型的中文分词方法包括如下步骤:Fig. 1 shows a flowchart of a Chinese word segmentation method based on a convolutional neural network model based on an embodiment of the present application. As shown in Fig. 1, the Chinese word segmentation method based on a convolutional neural network model provided by this embodiment includes the following steps:
S110:首先获取文字字典,在具体实施时,该文字字典是中文***,该文字字典可以存储在数据库中,通过访问数据库获取该文字字典;然后去除该文字字典中的特殊符号和非中文字符,该非中文字符包括拼音、数字和英文符号,该特殊符号包括音标或其它非中文的符号。接着将文字字典中的各文字分隔为单独文字形式,通过分隔的方式将每个汉字分隔为独立的单元,该单独文字形式的文字的集合为第一训练文本。S110: First obtain a word dictionary. In specific implementation, the word dictionary is Chinese Wikipedia. The word dictionary can be stored in the database. The word dictionary can be obtained by accessing the database; then the special symbols and non-Chinese characters in the word dictionary are removed , The non-Chinese characters include pinyin, numbers and English symbols, and the special symbols include phonetic symbols or other non-Chinese symbols. Next, each character in the character dictionary is separated into separate character forms, and each Chinese character is separated into independent units by means of separation, and the set of characters in the separate character form is the first training text.
S120:将上述第一训练文本转化为字向量形式的第一字向量训练文本,该处的转换通过Word2Vec算法实现。S120: Convert the above-mentioned first training text into the first word vector training text in the form of word vector, and the conversion here is realized by the Word2Vec algorithm.
具体操作中,可以将上述第一训练文本输入至Word2Vec算法进行字向量训练,该输入的第一训练文本是单独文字形式的文字的集合,通过上述Word2Vec算法将第一训练文本转化为字向量形式的第一字向量训练文本。根据该第一训练文本和转换的第一字向量训练文本获取字向量字典,在字向量字典中记录有文字与字向量的对应关系,以便于后期文字和字向量之间的转 化。In a specific operation, the above-mentioned first training text can be input into the Word2Vec algorithm for word vector training. The input first training text is a collection of words in a single text form, and the first training text is converted into a word vector form through the above-mentioned Word2Vec algorithm The first word vector of the training text. A word vector dictionary is obtained according to the first training text and the converted first word vector training text, and the corresponding relationship between words and word vectors is recorded in the word vector dictionary to facilitate the subsequent conversion between words and word vectors.
通过Word2Vec算法将上述第一训练文本转化为字向量形式的第一字向量训练文本,与现有技术中通过热编码处理文字,将文字转化为字向量相比,转化速度更快。在本实施例应用中,通过Word2Vec算法获得字向量字典与常规热编码获得的字向量字典相比,最终进行字符边界识别预测时获得的预测结果更准确。Using the Word2Vec algorithm to convert the above-mentioned first training text into the first word vector training text in the form of a word vector, the conversion speed is faster than that in the prior art, which is processed by hot encoding to convert the text into a word vector. In the application of this embodiment, the word vector dictionary obtained by the Word2Vec algorithm is more accurate than the word vector dictionary obtained by conventional hot coding, and the prediction result obtained in the final character boundary recognition prediction is more accurate.
S130:获得字向量字典后,获取带有分词标注的第二训练文本,该第二训练文本带有分词标注,即该第二训练文本是完成中文分词的文本,该第二训练文本中的词语开头、词语中部、词语结尾和单字是已知的;本实施例中,该分词标注使用识别标签进行标注,该识别标签为BMES。该第二训练文本可以存储在数据库中,通过访问数据库获取该第二训练文本。根据字向量字典将该第二训练文本转化为字向量形式的训练信息,该字向量字典起到对照的作用,通过字向量字典获取第二训练文本中文字对应的字向量;将该第二训练文本转化为字向量形式的训练信息,是便于卷积神经网络模型进行识别读取,卷积神经网络模型仅能识别读取字向量形式的训练信息;卷积神经网络模型无法直接识别读取汉字形式的第二训练文本。S130: After obtaining the word vector dictionary, obtain a second training text with word segmentation annotations, the second training text with word segmentation annotations, that is, the second training text is a text that completes Chinese word segmentation, and the words in the second training text The beginning, the middle of the word, the end of the word, and the word are known; in this embodiment, the word segmentation label is marked with an identification label, and the identification label is BMES. The second training text can be stored in a database, and the second training text can be obtained by accessing the database. According to the word vector dictionary, the second training text is converted into training information in the form of word vectors. The word vector dictionary plays a role of comparison. The word vector corresponding to the text in the second training text is obtained through the word vector dictionary; The conversion of text into training information in the form of word vectors facilitates the recognition and reading of the convolutional neural network model. The convolutional neural network model can only recognize and read the training information in the form of word vectors; the convolutional neural network model cannot directly recognize and read Chinese characters The second training text of the form.
S140:通过步骤S130获得训练信息后,将该训练信息输入至卷积神经网络模型,根据训练信息、交叉熵损失函数和ADAM优化算法对卷积神经网络模型进行训练;该训练中,将训练信息输入至卷积神经网络模型,以交叉熵损失函数为损失函数,以ADAM优化算法为优化算法,由卷积神经网络模型根据输入的训练信息进行训练。该卷积神经网络模型经训练后,能够进行字符边界识别预测,该字符边界识别预测,即是本实施例上文提到的字符边界的预测,该字符边界的预测完成后,可对文本中的词语开头、词语中部、词语结尾和单字进行区分,实现文本的分词。S140: After the training information is obtained in step S130, the training information is input to the convolutional neural network model, and the convolutional neural network model is trained according to the training information, the cross-entropy loss function, and the ADAM optimization algorithm; in the training, the training information Input to the convolutional neural network model, take the cross-entropy loss function as the loss function, and use the ADAM optimization algorithm as the optimization algorithm. The convolutional neural network model is trained according to the input training information. After the convolutional neural network model is trained, it can perform character boundary recognition prediction. The character boundary recognition prediction is the prediction of the character boundary mentioned above in this embodiment. After the character boundary prediction is completed, the text The beginning of the word, the middle of the word, the end of the word and the single word are distinguished to realize the word segmentation of the text.
S150:在上述卷积神经网络模型训练完毕后,向卷积神经网络模型内输入待分词的文本,根据输入的待分词的文本进行字符边界识别预测,该字符边界识别预测是获取词语开头、词语中部、词语结尾以及单字的信息的过程,最终用于获取字符边界识别预测的预测结果。该输入的待分词的文本,可以从数据库或缓存中通过复制传输的方式获取;该输入的待分词的文本,还可以是通过输入设备输入,如键盘;当然,该输入的待分词的文本还可以是通 过其它设备信号传输的文本数据。S150: After the above convolutional neural network model is trained, input the text to be segmented into the convolutional neural network model, and perform character boundary recognition prediction based on the input text to be segmented. The character boundary recognition prediction is to obtain the beginning of the word, the word The process of middle part, word end and single word information is finally used to obtain the prediction result of character boundary recognition prediction. The input text to be segmented can be obtained from the database or cache by copy transmission; the input text to be segmented can also be input through an input device, such as a keyboard; of course, the input text to be segmented can also be It can be text data transmitted by other equipment signals.
上述通过注意力机制调整卷积结果的权重时:将第四层卷积层的输出转化为a*b的矩阵;此处的a为通道数,在本实施例中,通道数为384,b为处理的文本长度;通过两个并行前馈层输出a*b与b*a的矩阵进行矩阵乘法,再经过softmax函数映射为概率,获得第四卷积层的卷积结果;在调整卷积结果的权重时,通过另一并行前馈层输出b*a的矩阵,该b*a的矩阵与根据注意力机制形成的注意力矩阵进行矩阵乘法,得到b*a的矩阵并将其转化为a*b*1的三维矩阵,并与映射为概率的卷积结果加和,获得并输出加权重的加权结果,完成各通道权重调整。各通道的权重调整后,将该加权结果传输至两个全连接层,然后通过softmax函数进行计算,将计算得出的概率值最高者作为预测结果,该softmax函数的计算可以通过Python中的tensorflow库实现。When adjusting the weight of the convolution result through the attention mechanism described above: convert the output of the fourth layer of convolution into a matrix of a*b; where a is the number of channels, in this embodiment, the number of channels is 384, b Is the length of the text to be processed; matrix multiplication is performed through two parallel feedforward layers outputting a*b and b*a matrices, and then the softmax function is mapped to probabilities to obtain the convolution result of the fourth convolution layer; in adjusting the convolution When the result weight is used, the b*a matrix is output through another parallel feedforward layer. The b*a matrix is matrix multiplied with the attention matrix formed according to the attention mechanism to obtain the b*a matrix and convert it into The a*b*1 three-dimensional matrix is summed with the convolution result mapped as the probability to obtain and output the weighted weighted result to complete the weight adjustment of each channel. After the weight of each channel is adjusted, the weighted result is transmitted to the two fully connected layers, and then calculated by the softmax function, and the calculated probability value is the highest as the predicted result. The softmax function can be calculated by tensorflow in Python Library implementation.
实施例2Example 2
本实施例提供一种基于卷积神经网络模型的中文分词装置,包括预处理模块、字向量训练模块、训练信息生成模块、训练模块和识别预测模块,其中:This embodiment provides a Chinese word segmentation device based on a convolutional neural network model, which includes a preprocessing module, a character vector training module, a training information generation module, a training module, and a recognition prediction module, where:
所述预处理模块用于获取文字字典,并对文字字典进行预处理,通过所述预处理去除所述文字字典中的特殊符号和非中文字符,并将所述文字字典中的各文字分隔为单独文字形式的文字,所述单独文字形式的文字的集合为第一训练文本;The preprocessing module is used to obtain a word dictionary and preprocess the word dictionary, remove special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separate each word in the word dictionary into A text in a single text form, and the set of words in a single text form is the first training text;
所述字向量训练模块用于将所述第一训练文本转化为字向量形式的第一字向量训练文本,并根据所述第一训练文本和所述第一字向量训练文本确定字向量字典,所述字向量字典中记录有文字与字向量的对应关系;The word vector training module is configured to convert the first training text into a first word vector training text in the form of a word vector, and determine a word vector dictionary according to the first training text and the first word vector training text, The corresponding relationship between words and word vectors is recorded in the word vector dictionary;
所述训练信息生成模块用于获取带有分词标注的第二训练文本,并根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;The training information generating module is used to obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
所述训练模块用于根据预设的交叉熵损失函数和ADAM优化算法以及所述训练信息,对所述卷积神经网络模型进行训练;The training module is used to train the convolutional neural network model according to a preset cross-entropy loss function, an ADAM optimization algorithm, and the training information;
所述识别预测模块用于根据所述卷积神经网络模型的训练结果对输入的待分词的文本进行字符边界识别预测。The recognition prediction module is configured to perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核。其中,第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷 积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道;在第四层卷积层处构建并行的注意力机制,该注意力机制用于注意力权重计算,为各通道进行权重调整;上述卷积神经网络模型还设置有softmax函数,各通道进行权重调整后,将调整后的各通道的加权结果输入至softmax函数,所述softmax函数将每个字的字符边界映射后转化为0至1的概率值,并输出概率值最高者作为字符边界识别预测的预测结果。The convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel. Among them, the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernels have 128 channels respectively; the second to fourth convolution layers include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, and a one-dimensional convolution kernel of the third layer. The one-dimensional convolution kernel of the fourth convolutional layer and the fourth convolutional layer have 384 channels; a parallel attention mechanism is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, and weights for each channel Adjustment; the above-mentioned convolutional neural network model is also provided with a softmax function. After the weights of each channel are adjusted, the weighted results of each channel after adjustment are input to the softmax function. The softmax function maps the character boundary of each word into The probability value is 0 to 1, and the highest probability value is output as the prediction result of character boundary recognition prediction.
需要说明的是,所述中文分词装置中各个模块所实现的功能或执行的方法步骤与上文中所述中文分词方法中的步骤大致相同,在此不再一一赘述。It should be noted that the functions or method steps implemented by each module in the Chinese word segmentation device are roughly the same as the steps in the Chinese word segmentation method described above, and will not be repeated here.
实施例3Example 3
图3提供了基于本申请实施例的电子设备的逻辑结构示意图,如图3所述。该电子设备1包括处理器2、存储器3,在存储器中存储有计算机程序4。FIG. 3 provides a schematic diagram of the logical structure of an electronic device based on an embodiment of the present application, as shown in FIG. 3. The electronic device 1 includes a processor 2 and a memory 3, and a computer program 4 is stored in the memory.
该电子设备1还包括数据库,在该数据库中存储有文字字典和第二训练文本,本实施例中,文字字典是中文***,该第二训练文本带有分词标记。The electronic device 1 further includes a database in which a word dictionary and a second training text are stored. In this embodiment, the word dictionary is Chinese Wikipedia, and the second training text is marked with word segmentation.
上述存储器中存储有计算机程序4,该计算机程序4包括预处理程序、字向量训练程序、训练信息生成程序、和卷积神经网络模型。A computer program 4 is stored in the aforementioned memory, and the computer program 4 includes a preprocessing program, a word vector training program, a training information generation program, and a convolutional neural network model.
上述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核。其中,第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道;在第四层卷积层处构建并行的注意力机制,该注意力机制用于注意力权重计算,为各通道进行权重调整;上述卷积神经网络模型还设置有softmax函数,各通道进行权重调整后,将调整后的各通道的加权结果输入至softmax函数,所述softmax函数将每个字的字符边界映射后转化为0至1的概率值,并输出概率值最高者作为字符边界识别预测的预测结果。The aforementioned convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel. Among them, the first layer of convolutional layer includes three one-dimensional convolution kernels, the length of each one-dimensional convolution kernel of the first layer of convolutional layer is 1, 3, and 5, and each one-dimensional The convolution kernels have 128 channels respectively; the second to fourth convolution layers include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, and a one-dimensional convolution kernel of the third layer. The one-dimensional convolution kernel of the fourth convolutional layer and the fourth convolutional layer have 384 channels; a parallel attention mechanism is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, and weights for each channel Adjustment; the above-mentioned convolutional neural network model is also provided with a softmax function. After the weights of each channel are adjusted, the weighted results of each channel after adjustment are input to the softmax function. The softmax function maps the character boundary of each word into The probability value is 0 to 1, and the highest probability value is output as the prediction result of character boundary recognition prediction.
图2提供了基于本申请实施例的电子设备中各程序的工作流程图,如图2所示,上述预处理程序、字向量训练程序、训练信息生成程序和卷积神经网 络模型被所述处理器执行时实现如下步骤:Figure 2 provides a working flowchart of each program in an electronic device based on an embodiment of the application. As shown in Figure 2, the above-mentioned preprocessing program, word vector training program, training information generation program and convolutional neural network model are processed by the The following steps are implemented when the device is executed:
S210:上述预处理程序从数据库中获取文字字典,该文字字典的获取可以通过访问数据库的方式获得;获取文字字典后,对该文字字典进行预处理。该预处理是指去除文字字典中的特殊符号和非中文字符,该非中文字符包括拼音、数字和英文符号,该特殊符号包括音标或其它非中文的符号;该预处理过程在去除文字字典中的特殊符号和非中文字符后,将文字字典分隔为单独文字形式的第一训练文本,完成预处理步骤。S210: The aforementioned preprocessing program obtains a word dictionary from a database, and the word dictionary can be obtained by accessing the database; after obtaining the word dictionary, preprocess the word dictionary. The preprocessing refers to the removal of special symbols and non-Chinese characters in the word dictionary. The non-Chinese characters include pinyin, numbers, and English symbols. The special symbols include phonetic symbols or other non-Chinese symbols; the preprocessing process is in the removal of words dictionary After the special symbols and non-Chinese characters, the text dictionary is separated into the first training text in the form of separate text, and the preprocessing steps are completed.
S220:上述字向量训练程序将单独文字形式的第一训练文本转化为字向量形式的字向量字典;该字向量训练程序包括Word2Vec算法,第一训练文本经Word2Vec算法进行字向量训练,该输入的第一训练文本是单独文字形式的文字的集合,通过上述Word2Vec算法将第一训练文本转化为字向量形式的第一字向量训练文本。根据该第一训练文本和转换的第一字向量训练文本获取字向量字典,该字向量字典记录文字与字向量的对应关系。S220: The above-mentioned word vector training program converts the first training text in the form of a single text into a word vector dictionary in the form of word vectors; the word vector training program includes the Word2Vec algorithm, and the first training text is trained on the word vector by the Word2Vec algorithm. The first training text is a collection of words in the form of individual words, and the first training text is converted into the first word vector training text in the form of word vectors through the aforementioned Word2Vec algorithm. A word vector dictionary is obtained according to the first training text and the converted first word vector training text, and the word vector dictionary records the correspondence between the text and the word vector.
S230:上述训练信息生成程序从数据库中获取带有分词标注的第二训练文本,根据上述字向量字典将第二训练文本转化为字向量形式的训练信息;上述字向量字典记录文字与字向量的对应关系,第二训练文本中记录有文字,可通过字向量字典获取文字对应的字向量,进而获得转化为字向量形式的训练信息。S230: The training information generating program obtains the second training text with word segmentation tags from the database, and converts the second training text into training information in the form of word vectors according to the word vector dictionary; the word vector dictionary records the text and word vector Corresponding relationship, the second training text records the text, the word vector corresponding to the text can be obtained through the word vector dictionary, and then the training information converted into the word vector form can be obtained.
S240:上述卷积神经网络模型获取上述训练信息,根据该训练信息、预设的交叉熵损失函数和ADAM优化算法进行训练。该卷积神经网络模型的训练可通过常规的方式进行训练,输入的数据信息为训练信息,根据交叉熵函数和ADAM优化算法进行训练后,获得训练好的卷积神经网络模型。该训练后的卷积神经网络模型能够根据训练结果对文本进行字符边界识别预测。S240: The convolutional neural network model obtains the training information, and performs training according to the training information, the preset cross-entropy loss function, and the ADAM optimization algorithm. The training of the convolutional neural network model can be conducted in a conventional manner. The input data information is training information. After training according to the cross-entropy function and the ADAM optimization algorithm, a trained convolutional neural network model can be obtained. The trained convolutional neural network model can perform character boundary recognition and prediction on the text according to the training results.
其中,一个或多个程序可以是能够完成特定功能的一系列计算机程序4指令段,该指令段用于描述计算机程序4在电子设备1中的执行过程。Among them, one or more programs may be a series of computer program 4 instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 4 in the electronic device 1.
电子设备1可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。电子设备1可包括,但不仅限于,处理器2、存储器3。本领域技术人员可以理解,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如电子设备1还可以包括输入输出设备、网络接入设备、总线等。The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The electronic device 1 may include, but is not limited to, a processor 2 and a memory 3. Those skilled in the art can understand that it does not constitute a limitation on the electronic device 1. It may include more or less components than shown, or combine certain components, or different components. For example, the electronic device 1 may also include input and output. Equipment, network access equipment, bus, etc.
所称处理器2可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器2(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是控制器、微控制器、微处理器,或者该处理器也可以是任何常规的处理器等。用于执行测试任务输入程序、测试人员输入程序、测试任务分配程序和测试任务触发程序。The so-called processor 2 can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors 2 (Digital Signal Processor, DSP), and Application Specific Integrated Circuit (ASIC) , Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a controller, a microcontroller, or a microprocessor, or the processor may also be any conventional processor. It is used to execute test task input program, tester input program, test task allocation program and test task trigger program.
存储器3可以是电子设备1的内部存储单元,例如电子设备置1的硬盘或内存。存储器3也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、多媒体卡、卡型存储器、磁性存储器、磁盘和光盘等。进一步地,存储器3还可以既包括终端设备的内部存储单元也包括外部存储设备。存储器3用于存储计算机程序4以及电子设备所需的其他程序和数据。存储器3还可以用于暂时地存储已经输出或者将要输出的数据。The memory 3 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. The memory 3 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) equipped on the electronic device 1. Card), multimedia card, card-type memory, magnetic memory, magnetic disk and optical disk, etc. Further, the memory 3 may also include both an internal storage unit of the terminal device and an external storage device. The memory 3 is used to store the computer program 4 and other programs and data required by the electronic device. The memory 3 can also be used to temporarily store data that has been output or will be output.
需要说明的是,本申请之电子设备的具体实施方式与上述中文分词方法和中文分词装置的具体实施方式大致相同,在此不再赘述。It should be noted that the specific implementation of the electronic device of this application is substantially the same as the specific implementation of the Chinese word segmentation method and the Chinese word segmentation device described above, and will not be repeated here.
实施例4Example 4
本实施例提供一种计算机非易失性可读存储介质,该计算机非易失性可读存储介质包括计算机程序和数据库,该计算机程序被处理器执行时,实现如上述实施例1的中文分词方法的步骤,在此不再一一赘述。This embodiment provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium includes a computer program and a database. When the computer program is executed by a processor, the Chinese word segmentation as described in the above embodiment 1 is implemented. The steps of the method will not be repeated here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、单元完成,即将装置的内部结构划分成不同的功能单元或单元,以完成以上描述的全部或者部分功能。实施例中的各功能单元、单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述***中单元、单元的具体工作过程,可以参考前述方法实施 例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the above-mentioned functional units and the division of units are used as examples. In actual applications, the above-mentioned functions can be allocated to different functional units, Unit completion means dividing the internal structure of the device into different functional units or units to complete all or part of the functions described above. The functional units and units in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of software functional units. In addition, the specific names of the functional units and units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application. For the specific working process of the units and units in the above system, please refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the above-mentioned unit or division of units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined. Or it can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
上述集成的单元/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,上述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,上述计算机程序包括计算机程序代码,上述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。上述计算机可读 介质可以包括:能够携带上述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,上述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。If the aforementioned integrated unit/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The above-mentioned computer program may be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The above-mentioned computer-readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium, etc. It should be noted that the content contained in the above-mentioned computer-readable media can be appropriately added or deleted in accordance with the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable media cannot Including electrical carrier signals and telecommunication signals.
本申请提供的中文分词方法、电子装置及可读存储介质,首先获取字向量字典,通过字向量字典将第二文本转化为训练信息,然后根据训练信息训练卷积神经网络模型,训练好的卷积神经网络模型根据输入的待分词的文本进行字符边界识别预测。通过卷积神经网络模型进行分词,消耗的资源更少,分词速度快,正确率高。在上述卷积神经网络模型的第四层卷积层处构建有注意力机制,在训练卷积神经网络模型时,该注意力机制的设置能够优化卷积神经网络模型,提高卷积神经网络模型预测的准确率。The Chinese word segmentation method, electronic device, and readable storage medium provided in this application first obtain a word vector dictionary, convert the second text into training information through the word vector dictionary, and then train a convolutional neural network model according to the training information, and the trained volume The product neural network model performs character boundary recognition and prediction based on the input text to be segmented. The word segmentation through the convolutional neural network model consumes less resources, the word segmentation speed is fast, and the accuracy rate is high. An attention mechanism is built at the fourth layer of the convolutional neural network model. When training the convolutional neural network model, the setting of the attention mechanism can optimize the convolutional neural network model and improve the convolutional neural network model. The accuracy of the forecast.
如上参照附图以示例的方式描述了根据本申请的中文分词方法、装置、电子设备及可读存储介质。但是,本领域技术人员应当理解,对于上述本申请所提出的中文分词方法、装置、电子设备及可读存储介质,还可以在不脱离本申请内容的基础上做出各种改进。因此,本申请的保护范围应当由所附的权利要求书的内容确定。As above, the Chinese word segmentation method, device, electronic equipment and readable storage medium according to the present application are described by way of example with reference to the accompanying drawings. However, those skilled in the art should understand that for the Chinese word segmentation method, device, electronic equipment, and readable storage medium proposed in this application, various improvements can be made without departing from the content of this application. Therefore, the protection scope of this application should be determined by the content of the appended claims.

Claims (20)

  1. 一种基于卷积神经网络模型的中文分词方法,其特征在于,包括如下步骤:A Chinese word segmentation method based on a convolutional neural network model is characterized in that it includes the following steps:
    第一步:获取文字字典,去除所述文字字典中的特殊符号和非中文字符,将所述文字字典中的各文字分隔为单独文字形式的文字,所述单独文字形式的文字的集合为第一训练文本;Step 1: Obtain a word dictionary, remove special symbols and non-Chinese characters in the word dictionary, separate each word in the word dictionary into words in the form of separate words, and the set of words in the form of separate words is the first A training text;
    第二步:通过字向量训练将所述第一训练文本转化为字向量形式的第一字向量训练文本,根据所述第一训练文本和所述第一字向量训练文本确定字向量字典,所述字向量字典中记录有文字与字向量的对应关系;Step 2: Transform the first training text into a first word vector training text in the form of a word vector through word vector training, and determine a word vector dictionary according to the first training text and the first word vector training text, so The word vector dictionary records the correspondence between words and word vectors;
    第三步:获取带有分词标注的第二训练文本,根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;Step 3: Obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
    第四步:根据预设的交叉熵损失函数和ADAM优化算法以及所述训练信息,对所述卷积神经网络模型进行训练;The fourth step: training the convolutional neural network model according to the preset cross-entropy loss function and ADAM optimization algorithm and the training information;
    第五步:根据所述卷积神经网络模型的训练结果对输入的待分词的文本进行字符边界识别预测。Step 5: Perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
  2. 如权利要求1所述的基于卷积神经网络模型的中文分词方法,其特征在于,通过字向量训练将所述第一训练文本转化为字向量形式的第一字向量训练文本包括如下步骤:运行Word2Vec算法,基于所述Word2Vec算法对所述第一训练文本进行字向量训练,通过所述Word2Vec算法将所述第一训练文本转化为字向量形式的第一字向量训练文本。The method for Chinese word segmentation based on a convolutional neural network model according to claim 1, wherein the conversion of the first training text into a first word vector training text in the form of a word vector through word vector training comprises the following steps: The Word2Vec algorithm performs word vector training on the first training text based on the Word2Vec algorithm, and converts the first training text into a first word vector training text in the form of a word vector through the Word2Vec algorithm.
  3. 如权利要求1所述的基于卷积神经网络模型的中文分词方法,其特征在于,所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核;在第四层卷积层处构建有与卷积神经网络模型并行的注意力机制,所述注意力机制用于注意力权重计算,为第四层卷积层的一维卷积核的各通道进行权重调整;The Chinese word segmentation method based on the convolutional neural network model of claim 1, wherein the convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is one-dimensional convolution Core; an attention mechanism parallel to the convolutional neural network model is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, which is the one-dimensional convolution kernel of the fourth convolutional layer Weight adjustment for each channel;
    在第四步训练所述卷积神经网络模型时,通过所述注意力机制调整第四层卷积层各通道输出的卷积结果的权重,获取加权结果,然后将所述加权结果输入至softmax函数,接着通过所述softmax函数输出字符边界识别预测的预测结果。When training the convolutional neural network model in the fourth step, adjust the weight of the convolution result output by each channel of the fourth layer of convolutional layer through the attention mechanism to obtain the weighted result, and then input the weighted result to softmax Function, and then output the prediction result of character boundary recognition prediction through the softmax function.
  4. 如权利要求3所述的基于卷积神经网络模型的中文分词方法,其特征在于,所述softmax函数将每个字的字符边界映射后转化为0至1的概率值,并以概率值最高者作为预测结果;The Chinese word segmentation method based on a convolutional neural network model according to claim 3, wherein the softmax function maps the character boundary of each character into a probability value of 0 to 1, and uses the highest probability value As a prediction result;
    所述字符边界的识别标签为BMES,B代表词语开头,M代表词语中部,E代表词语结尾,S代表单字。The identification label of the character boundary is BMES, B represents the beginning of the word, M represents the middle of the word, E represents the end of the word, and S represents a single word.
  5. 如权利要求3所述的基于卷积神经网络模型的中文分词方法,其特征在于,调整所述卷积结果的权重时:The Chinese word segmentation method based on a convolutional neural network model according to claim 3, wherein when adjusting the weight of the convolution result:
    将所述第四层卷积层的输出转化为a*b的矩阵,所述a为通道数,所述b为处理的文本长度;通过两个并行前馈层输出a*b与b*a的矩阵进行矩阵乘法,再经过softmax函数映射为概率,获得第四卷积层的卷积结果;Convert the output of the fourth convolutional layer into a matrix of a*b, where a is the number of channels, and b is the length of the processed text; output a*b and b*a through two parallel feedforward layers Perform matrix multiplication on the matrix of, and then map to probability through the softmax function to obtain the convolution result of the fourth convolution layer;
    通过另一并行前馈层输出b*a的矩阵,所述b*a的矩阵与根据注意力机制形成的注意力矩阵进行矩阵乘法,得到b*a的矩阵并将其转化为a*b*1的三维矩阵,并与映射为概率的卷积结果加和,获得并输出加权重的加权结果,完成各通道权重调整。The b*a matrix is output through another parallel feedforward layer, and the b*a matrix is matrix multiplied with the attention matrix formed according to the attention mechanism to obtain the b*a matrix and convert it into a*b* The three-dimensional matrix of 1 is added to the convolution result mapped as the probability to obtain and output the weighted weighted result to complete the weight adjustment of each channel.
  6. 如权利要求5所述的基于卷积神经网络模型的中文分词方法,其特征在于,完成各通道权重调整后,将所述加权结果传输至两个全连接层,然后通过softmax函数将每个字的字符边界映射后转化为0至1的概率值,并以概率值最高者作为预测结果。The Chinese word segmentation method based on the convolutional neural network model according to claim 5, wherein after the weight adjustment of each channel is completed, the weighted result is transmitted to two fully connected layers, and then each word is transferred through the softmax function The character boundary of is mapped into a probability value from 0 to 1, and the highest probability value is used as the prediction result.
  7. 如权利要求3所述的基于卷积神经网络模型的中文分词方法,其特征在于,第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;The Chinese word segmentation method based on a convolutional neural network model according to claim 3, wherein the first convolutional layer includes three one-dimensional convolution kernels, and each one-dimensional convolution kernel of the first convolutional layer The length of is 1, 3, and 5, and each one-dimensional convolution kernel of the first convolution layer has 128 channels;
    第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道。The second to fourth convolutional layers all include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, a one-dimensional convolution kernel of the third layer, and a convolution kernel of the fourth layer. The dimensional convolution kernel has 384 channels.
  8. 一种基于卷积神经网络模型的中文分词装置,其特征在于,包括预处理模块、字向量训练模块、训练信息生成模块、训练模块和识别预测模块,其中:A Chinese word segmentation device based on a convolutional neural network model, which is characterized by comprising a preprocessing module, a character vector training module, a training information generation module, a training module, and a recognition prediction module, wherein:
    所述预处理模块用于获取文字字典,并对文字字典进行预处理,通过所述预处理去除所述文字字典中的特殊符号和非中文字符,并将所述文字字典中的各文字分隔为单独文字形式的文字,所述单独文字形式的文字的集合为第一训练文本;The preprocessing module is used to obtain a word dictionary and preprocess the word dictionary, remove special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separate each word in the word dictionary into A text in a single text form, and the set of words in a single text form is the first training text;
    所述字向量训练模块用于将所述第一训练文本转化为字向量形式的第一字向量训练文本,并根据所述第一训练文本和所述第一字向量训练文本确定字向量字典,所述字向量字典中记录有文字与字向量的对应关系;The word vector training module is configured to convert the first training text into a first word vector training text in the form of a word vector, and determine a word vector dictionary according to the first training text and the first word vector training text, The corresponding relationship between words and word vectors is recorded in the word vector dictionary;
    所述训练信息生成模块用于获取带有分词标注的第二训练文本,并根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;The training information generating module is used to obtain a second training text with word segmentation annotations, and convert the second training text into training information in the form of word vectors according to the word vector dictionary;
    所述训练模块用于根据预设的交叉熵损失函数和ADAM优化算法以及所述训练信息,对所述卷积神经网络模型进行训练;The training module is used to train the convolutional neural network model according to a preset cross-entropy loss function, an ADAM optimization algorithm, and the training information;
    所述识别预测模块用于根据所述卷积神经网络模型的训练结果对输入的待分词的文本进行字符边界识别预测。The recognition prediction module is configured to perform character boundary recognition prediction on the input text to be segmented according to the training result of the convolutional neural network model.
  9. 如权利要求8所述的基于卷积神经网络模型的中文分词装置,其特征在于,所述字向量训练模块基于所述Word2Vec算法对所述第一训练文本进行字向量训练,通过所述Word2Vec算法将所述第一训练文本转化为字向量形式的第一字向量训练文本。The Chinese word segmentation device based on a convolutional neural network model according to claim 8, wherein the word vector training module performs word vector training on the first training text based on the Word2Vec algorithm, and uses the Word2Vec algorithm The first training text is converted into a first word vector training text in the form of a word vector.
  10. 如权利要求8所述的基于卷积神经网络模型的中文分词装置,其特征在于,所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核;在第四层卷积层处构建有与卷积神经网络模型并行的注意力机制,所述注意力机制用于注意力权重计算,为第四层卷积层的一维卷积核的各通道进行权重调整;The Chinese word segmentation device based on a convolutional neural network model according to claim 8, wherein the convolutional neural network model comprises four convolutional layers, and the convolution kernel of each convolutional layer is one-dimensional convolution Core; an attention mechanism parallel to the convolutional neural network model is constructed at the fourth convolutional layer, and the attention mechanism is used for attention weight calculation, which is the one-dimensional convolution kernel of the fourth convolutional layer Weight adjustment for each channel;
    在所述训练模块训练所述卷积神经网络模型时,通过所述注意力机制调整第四层卷积层各通道输出的卷积结果的权重,获取加权结果,然后将所述加权结果输入至softmax函数,接着通过所述softmax函数输出字符边界识别预测的预测结果。When the training module trains the convolutional neural network model, the attention mechanism is used to adjust the weights of the convolution results output by the channels of the fourth layer of the convolutional layer to obtain the weighted results, and then input the weighted results to The softmax function, and then output the prediction result of character boundary recognition prediction through the softmax function.
  11. 如权利要求10所述的基于卷积神经网络模型的中文分词装置,其特征在于,所述softmax函数将每个字的字符边界映射后转化为0至1的概率值,并以概率值最高者作为预测结果;The Chinese word segmentation device based on a convolutional neural network model according to claim 10, wherein the softmax function maps the character boundary of each character into a probability value of 0 to 1, and uses the highest probability value As a prediction result;
    所述字符边界的识别标签为BMES,B代表词语开头,M代表词语中部,E代表词语结尾,S代表单字。The identification label of the character boundary is BMES, B represents the beginning of the word, M represents the middle of the word, E represents the end of the word, and S represents a single word.
  12. 如权利要求10所述的基于卷积神经网络模型的中文分词装置,其特征在于,调整所述卷积结果的权重时:The Chinese word segmentation device based on the convolutional neural network model of claim 10, wherein when adjusting the weight of the convolution result:
    将所述第四层卷积层的输出转化为a*b的矩阵,所述a为通道数,所述b 为处理的文本长度;通过两个并行前馈层输出a*b与b*a的矩阵进行矩阵乘法,再经过softmax函数映射为概率,获得第四卷积层的卷积结果;Convert the output of the fourth convolutional layer into a matrix of a*b, where a is the number of channels, and b is the length of the processed text; output a*b and b*a through two parallel feedforward layers Perform matrix multiplication on the matrix of, and then map to probability through the softmax function to obtain the convolution result of the fourth convolution layer;
    通过另一并行前馈层输出b*a的矩阵,所述b*a的矩阵与根据注意力机制形成的注意力矩阵进行矩阵乘法,得到b*a的矩阵并将其转化为a*b*1的三维矩阵,并与映射为概率的卷积结果加和,获得并输出加权重的加权结果,完成各通道权重调整。The b*a matrix is output through another parallel feedforward layer, and the b*a matrix is matrix multiplied with the attention matrix formed according to the attention mechanism to obtain the b*a matrix and convert it into a*b* The three-dimensional matrix of 1 is added to the convolution result mapped as the probability to obtain and output the weighted weighted result to complete the weight adjustment of each channel.
  13. 如权利要求10所述的基于卷积神经网络模型的中文分词装置,其特征在于,第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;The Chinese word segmentation device based on the convolutional neural network model of claim 10, wherein the first convolutional layer includes three one-dimensional convolution kernels, and each one-dimensional convolution kernel of the first convolutional layer The length of is 1, 3, and 5, and each one-dimensional convolution kernel of the first convolution layer has 128 channels;
    第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道。The second to fourth convolutional layers all include a one-dimensional convolution kernel of length 3, a one-dimensional convolution kernel of the second layer, a one-dimensional convolution kernel of the third layer, and a convolution kernel of the fourth layer. The dimensional convolution kernel has 384 channels.
  14. 一种电子设备,其特征在于,所述电子设备包括:存储器、处理器及数据库,所述数据库中存储有文字字典和第二训练文本;所述存储器中包括预处理程序、字向量训练程序、训练信息生成程序和卷积神经网络模型;An electronic device, characterized in that it includes a memory, a processor, and a database. The database stores a word dictionary and a second training text; the memory includes a preprocessing program, a word vector training program, Training information generation program and convolutional neural network model;
    所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核;第一层卷积层包括三个一维卷积核,第一层卷积层的各一维卷积核的长度分别为1、3、5,第一层卷积层的各一维卷积核分别有128个通道;第二层至第四层卷积层均包括长度为3的一维卷积核,第二层的一维卷积核、第三层的一维卷积核和第四层卷积层的一维卷积核均有384个通道;在第四层卷积层处构建并行的注意力机制,该注意力机制用于注意力权重计算,为各通道进行权重调整;The convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel; the first convolutional layer includes three one-dimensional convolution kernels, and the first convolutional layer The lengths of the one-dimensional convolution kernels of the layers are 1, 3, and 5 respectively. The one-dimensional convolution kernels of the first convolutional layer have 128 channels respectively; the second to fourth convolutional layers include lengths The one-dimensional convolution kernel is 3, the one-dimensional convolution kernel of the second layer, the one-dimensional convolution kernel of the third layer and the one-dimensional convolution kernel of the fourth layer have 384 channels; in the fourth A parallel attention mechanism is constructed at the convolution layer, which is used for attention weight calculation and weight adjustment for each channel;
    所述预处理程序、字向量训练程序、训练信息生成程序和卷积神经网络模型被所述处理器执行时实现如下步骤:When the preprocessing program, the word vector training program, the training information generation program and the convolutional neural network model are executed by the processor, the following steps are implemented:
    所述预处理程序从数据库中获取文字字典,然后对文字字典进行预处理,通过所述预处理去除文字字典中的特殊符号和非中文字符,并将文字字典分隔为单独文字形式的第一训练文本;The preprocessing program obtains the word dictionary from the database, and then preprocesses the word dictionary, removes special symbols and non-Chinese characters in the word dictionary through the preprocessing, and separates the word dictionary into the first training in the form of separate words text;
    所述字向量训练程序将单独文字形式的第一训练文本转化为字向量形式的字向量字典;The word vector training program converts the first training text in the form of a single word into a word vector dictionary in the form of word vectors;
    所述训练信息生成程序从数据库中获取带有分词标注的第二训练文本,根据所述字向量字典将所述第二训练文本转化为字向量形式的训练信息;The training information generating program obtains a second training text with word segmentation annotations from a database, and converts the second training text into training information in a word vector form according to the word vector dictionary;
    所述卷积神经网络模型获取所述训练信息,根据所述训练信息、预设的交叉熵损失函数和ADAM优化算法进行训练。The convolutional neural network model obtains the training information, and performs training according to the training information, a preset cross-entropy loss function, and an ADAM optimization algorithm.
  15. 如权利要求14所述的电子设备,其特征在于,所述字向量训练程序包括Word2Vec算法,所述字向量训练程序通过所述Word2Vec算法将第一训练文本转化为字向量形式的字向量字典。The electronic device of claim 14, wherein the word vector training program comprises the Word2Vec algorithm, and the word vector training program converts the first training text into a word vector dictionary in the form of word vectors through the Word2Vec algorithm.
  16. 如权利要求14所述的电子设备,其特征在于,所述卷积神经网络模型还设置有softmax函数,各通道进行权重调整后,将调整后的各通道的加权结果输入至softmax函数,通过所述softmax函数输出字符边界识别预测的预测结果。The electronic device of claim 14, wherein the convolutional neural network model is further provided with a softmax function, and after each channel is weighted, the adjusted weighted result of each channel is input to the softmax function, and the The softmax function outputs the prediction result of character boundary recognition prediction.
  17. 如权利要求16所述的电子设备,其特征在于,所述softmax函数将每个字的字符边界映射后转化为0至1的概率值,并输出概率值最高者作为预测结果;所述字符边界的识别标签为BMES,B代表词语开头,M代表词语中部,E代表词语结尾,S代表单字。The electronic device according to claim 16, wherein the softmax function maps the character boundary of each word into a probability value of 0 to 1, and outputs the highest probability value as the prediction result; the character boundary The identification label of is BMES, B stands for the beginning of the word, M stands for the middle of the word, E stands for the end of the word, and S stands for a single word.
  18. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质中包括计算机程序和数据库,所述计算机程序被处理器执行时,实现如权利要求1所述的基于卷积神经网络模型的中文分词方法。A computer non-volatile readable storage medium, characterized in that the computer non-volatile readable storage medium includes a computer program and a database, and when the computer program is executed by a processor, the The Chinese word segmentation method based on convolutional neural network model described.
  19. 如权利要求18所述的计算机非易失性可读存储介质,其特征在于,通过字向量训练将所述第一训练文本转化为字向量形式的第一字向量训练文本包括如下步骤:运行Word2Vec算法,基于所述Word2Vec算法对所述第一训练文本进行字向量训练,通过所述Word2Vec算法将所述第一训练文本转化为字向量形式的第一字向量训练文本。The computer non-volatile readable storage medium of claim 18, wherein the conversion of the first training text into the first word vector training text in the form of a word vector through word vector training comprises the following steps: running Word2Vec The algorithm performs word vector training on the first training text based on the Word2Vec algorithm, and converts the first training text into a first word vector training text in the form of a word vector through the Word2Vec algorithm.
  20. 如权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述卷积神经网络模型包括四层卷积层,各卷积层的卷积核均为一维卷积核;在第四层卷积层处构建有与卷积神经网络模型并行的注意力机制,所述注意力机制用于注意力权重计算,为第四层卷积层的一维卷积核的各通道进行权重调整;The computer non-volatile readable storage medium of claim 18, wherein the convolutional neural network model includes four convolutional layers, and the convolution kernel of each convolutional layer is a one-dimensional convolution kernel ; At the fourth layer of the convolutional layer, an attention mechanism parallel to the convolutional neural network model is constructed. The attention mechanism is used for attention weight calculation, and is each of the one-dimensional convolution kernels of the fourth layer Channel weight adjustment;
    在第四步训练所述卷积神经网络模型时,通过所述注意力机制调整第四层卷积层各通道输出的卷积结果的权重,获取加权结果,然后将所述加权结果输入至softmax函数,接着通过所述softmax函数输出字符边界识别预测的预测结果。When training the convolutional neural network model in the fourth step, adjust the weight of the convolution result output by each channel of the fourth layer of convolutional layer through the attention mechanism to obtain the weighted result, and then input the weighted result to softmax Function, and then output the prediction result of character boundary recognition prediction through the softmax function.
PCT/CN2019/117900 2019-05-06 2019-11-13 Chinese word segmentation method and apparatus, electronic device and readable storage medium WO2020224219A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910371045.2A CN110287961B (en) 2019-05-06 2019-05-06 Chinese word segmentation method, electronic device and readable storage medium
CN201910371045.2 2019-05-06

Publications (1)

Publication Number Publication Date
WO2020224219A1 true WO2020224219A1 (en) 2020-11-12

Family

ID=68001770

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117900 WO2020224219A1 (en) 2019-05-06 2019-11-13 Chinese word segmentation method and apparatus, electronic device and readable storage medium

Country Status (2)

Country Link
CN (1) CN110287961B (en)
WO (1) WO2020224219A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329477A (en) * 2020-11-27 2021-02-05 上海浦东发展银行股份有限公司 Information extraction method, device and equipment based on pre-training model and storage medium
CN112364663A (en) * 2020-11-16 2021-02-12 上海优扬新媒信息技术有限公司 User feature recognition method, device, equipment and storage medium
CN112487803A (en) * 2020-11-20 2021-03-12 中国人寿保险股份有限公司 Contract auditing method and device based on deep learning and electronic equipment
CN112507112A (en) * 2020-12-07 2021-03-16 中国平安人寿保险股份有限公司 Comment generation method, device, equipment and storage medium
CN112528658A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Hierarchical classification method and device, electronic equipment and storage medium
CN112800183A (en) * 2021-02-25 2021-05-14 国网河北省电力有限公司电力科学研究院 Content name data processing method and terminal equipment
CN112906382A (en) * 2021-02-05 2021-06-04 山东省计算中心(国家超级计算济南中心) Policy text multi-label labeling method and system based on graph neural network
CN113012220A (en) * 2021-02-02 2021-06-22 深圳市识农智能科技有限公司 Fruit counting method and device and electronic equipment
CN113065359A (en) * 2021-04-07 2021-07-02 齐鲁工业大学 Sentence-to-semantic matching method and device oriented to intelligent interaction
CN113109782A (en) * 2021-04-15 2021-07-13 中国人民解放军空军航空大学 Novel classification method directly applied to radar radiation source amplitude sequence
CN113220936A (en) * 2021-06-04 2021-08-06 黑龙江广播电视台 Intelligent video recommendation method and device based on random matrix coding and simplified convolutional network and storage medium
CN113378541A (en) * 2021-05-21 2021-09-10 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113988068A (en) * 2021-12-29 2022-01-28 深圳前海硬之城信息技术有限公司 Word segmentation method, device, equipment and storage medium of BOM text
CN114091631A (en) * 2021-10-28 2022-02-25 国网江苏省电力有限公司连云港市赣榆区供电分公司 Power grid accident information publishing method and device
CN114580424A (en) * 2022-04-24 2022-06-03 之江实验室 Labeling method and device for named entity identification of legal document
WO2022267453A1 (en) * 2021-06-24 2022-12-29 平安科技(深圳)有限公司 Method for training key information extraction model, and extraction method, device and medium

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287961B (en) * 2019-05-06 2024-04-09 平安科技(深圳)有限公司 Chinese word segmentation method, electronic device and readable storage medium
CN111079418B (en) * 2019-11-06 2023-12-05 科大讯飞股份有限公司 Named entity recognition method, device, electronic equipment and storage medium
CN110929517B (en) * 2019-11-28 2023-04-18 海南大学 Geographical position positioning method, system, computer equipment and storage medium
CN111507103B (en) * 2020-03-09 2020-12-29 杭州电子科技大学 Self-training neural network word segmentation model using partial label set
CN111767718B (en) * 2020-07-03 2021-12-07 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN113051913A (en) * 2021-04-09 2021-06-29 中译语通科技股份有限公司 Tibetan word segmentation information processing method, system, storage medium, terminal and application
CN113313129B (en) * 2021-06-22 2024-04-05 中国平安财产保险股份有限公司 Training method, device, equipment and storage medium for disaster damage recognition model
CN113901814A (en) * 2021-10-11 2022-01-07 国网电子商务有限公司 Neural network word segmentation method and device for energy E-commerce field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system
CN110287961A (en) * 2019-05-06 2019-09-27 平安科技(深圳)有限公司 Chinese word cutting method, electronic device and readable storage medium storing program for executing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN109086267B (en) * 2018-07-11 2022-07-26 南京邮电大学 Chinese word segmentation method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system
CN110287961A (en) * 2019-05-06 2019-09-27 平安科技(深圳)有限公司 Chinese word cutting method, electronic device and readable storage medium storing program for executing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG, DENGYI ET AL.: "Joint learning method based on BLSTM for Chinese word segmentation", APPLICATION RESEARCH OF COMPUTERS, vol. 36, no. 10, October 2019 (2019-10-01), pages 1 - 5, XP055751491, ISSN: 1001-3695 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364663A (en) * 2020-11-16 2021-02-12 上海优扬新媒信息技术有限公司 User feature recognition method, device, equipment and storage medium
CN112487803A (en) * 2020-11-20 2021-03-12 中国人寿保险股份有限公司 Contract auditing method and device based on deep learning and electronic equipment
CN112329477A (en) * 2020-11-27 2021-02-05 上海浦东发展银行股份有限公司 Information extraction method, device and equipment based on pre-training model and storage medium
CN112507112B (en) * 2020-12-07 2023-07-25 中国平安人寿保险股份有限公司 Comment generation method, comment generation device, comment generation equipment and storage medium
CN112507112A (en) * 2020-12-07 2021-03-16 中国平安人寿保险股份有限公司 Comment generation method, device, equipment and storage medium
CN112528658A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Hierarchical classification method and device, electronic equipment and storage medium
CN112528658B (en) * 2020-12-24 2023-07-25 北京百度网讯科技有限公司 Hierarchical classification method, hierarchical classification device, electronic equipment and storage medium
CN113012220A (en) * 2021-02-02 2021-06-22 深圳市识农智能科技有限公司 Fruit counting method and device and electronic equipment
CN112906382A (en) * 2021-02-05 2021-06-04 山东省计算中心(国家超级计算济南中心) Policy text multi-label labeling method and system based on graph neural network
CN112906382B (en) * 2021-02-05 2022-06-21 山东省计算中心(国家超级计算济南中心) Policy text multi-label labeling method and system based on graph neural network
CN112800183B (en) * 2021-02-25 2023-09-26 国网河北省电力有限公司电力科学研究院 Content name data processing method and terminal equipment
CN112800183A (en) * 2021-02-25 2021-05-14 国网河北省电力有限公司电力科学研究院 Content name data processing method and terminal equipment
CN113065359B (en) * 2021-04-07 2022-05-24 齐鲁工业大学 Sentence-to-semantic matching method and device oriented to intelligent interaction
CN113065359A (en) * 2021-04-07 2021-07-02 齐鲁工业大学 Sentence-to-semantic matching method and device oriented to intelligent interaction
CN113109782A (en) * 2021-04-15 2021-07-13 中国人民解放军空军航空大学 Novel classification method directly applied to radar radiation source amplitude sequence
CN113109782B (en) * 2021-04-15 2023-08-15 中国人民解放军空军航空大学 Classification method directly applied to radar radiation source amplitude sequence
CN113378541A (en) * 2021-05-21 2021-09-10 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113378541B (en) * 2021-05-21 2023-07-07 标贝(北京)科技有限公司 Text punctuation prediction method, device, system and storage medium
CN113220936A (en) * 2021-06-04 2021-08-06 黑龙江广播电视台 Intelligent video recommendation method and device based on random matrix coding and simplified convolutional network and storage medium
CN113220936B (en) * 2021-06-04 2023-08-15 黑龙江广播电视台 Video intelligent recommendation method, device and storage medium based on random matrix coding and simplified convolutional network
WO2022267453A1 (en) * 2021-06-24 2022-12-29 平安科技(深圳)有限公司 Method for training key information extraction model, and extraction method, device and medium
CN114091631A (en) * 2021-10-28 2022-02-25 国网江苏省电力有限公司连云港市赣榆区供电分公司 Power grid accident information publishing method and device
CN113988068A (en) * 2021-12-29 2022-01-28 深圳前海硬之城信息技术有限公司 Word segmentation method, device, equipment and storage medium of BOM text
CN114580424B (en) * 2022-04-24 2022-08-05 之江实验室 Labeling method and device for named entity identification of legal document
CN114580424A (en) * 2022-04-24 2022-06-03 之江实验室 Labeling method and device for named entity identification of legal document

Also Published As

Publication number Publication date
CN110287961A (en) 2019-09-27
CN110287961B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
WO2020224219A1 (en) Chinese word segmentation method and apparatus, electronic device and readable storage medium
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
US11544474B2 (en) Generation of text from structured data
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN112215008B (en) Entity identification method, device, computer equipment and medium based on semantic understanding
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
WO2020147409A1 (en) Text classification method and apparatus, computer device, and storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
US20230130006A1 (en) Method of processing video, method of quering video, and method of training model
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
WO2023092960A1 (en) Labeling method and apparatus for named entity recognition in legal document
CN112559687A (en) Question identification and query method and device, electronic equipment and storage medium
CN111339775A (en) Named entity identification method, device, terminal equipment and storage medium
JP2022145623A (en) Method and device for presenting hint information and computer program
CN114218945A (en) Entity identification method, device, server and storage medium
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN116701574A (en) Text semantic similarity calculation method, device, equipment and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114445832A (en) Character image recognition method and device based on global semantics and computer equipment
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN110705287B (en) Method and system for generating text abstract
CN112926314A (en) Document repeatability identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927740

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19927740

Country of ref document: EP

Kind code of ref document: A1