CN111178065A - Word segmentation recognition word stock construction method, Chinese word segmentation method and device - Google Patents

Word segmentation recognition word stock construction method, Chinese word segmentation method and device Download PDF

Info

Publication number
CN111178065A
CN111178065A CN201911288705.7A CN201911288705A CN111178065A CN 111178065 A CN111178065 A CN 111178065A CN 201911288705 A CN201911288705 A CN 201911288705A CN 111178065 A CN111178065 A CN 111178065A
Authority
CN
China
Prior art keywords
neuron
word
segmented
word segmentation
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911288705.7A
Other languages
Chinese (zh)
Other versions
CN111178065B (en
Inventor
李胤文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911288705.7A priority Critical patent/CN111178065B/en
Publication of CN111178065A publication Critical patent/CN111178065A/en
Application granted granted Critical
Publication of CN111178065B publication Critical patent/CN111178065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a word segmentation recognition word stock construction method, a Chinese word segmentation method and a Chinese word segmentation device, and relates to the technical field of computers. One embodiment of the method comprises: for short sentences in the training text set, executing: removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction; and fusing the short sentence neural networks to form a word segmentation recognition word bank. The embodiment can effectively improve the word quantity of the word bank and the accuracy of word segmentation.

Description

Word segmentation recognition word stock construction method, Chinese word segmentation method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method for constructing a word segmentation recognition word stock, a method and a device for Chinese word segmentation.
Background
The Chinese word segmentation based on word stock is one of the more common word segmentation methods at present. Therefore, word stock is constructed and maintained, and the foundation for realizing word segmentation is realized.
The existing word stock is mainly constructed and maintained in a manual mode, namely, the existing words such as words in a modern Chinese standard dictionary and new words appearing in a network are collected in the manual mode and are stored in the word stock.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
compared with massive text data in a network, the words collected manually are very limited, so that the amount of the words stored in the word bank is very limited. Then, when the word segmentation is performed on the word bank constructed or maintained in a manual mode, the word quantity stored in the word bank often cannot meet the word segmentation requirement.
Disclosure of Invention
In view of this, embodiments of the present invention provide a word segmentation recognition word stock construction method, a chinese word segmentation method, and a server, which can effectively improve word amount of the word stock and accuracy of word segmentation.
In order to achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a method for constructing a word segmentation recognition lexicon, including:
for short sentences in the training text set, executing:
removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons;
constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
and fusing the short sentence neural networks to form a word segmentation recognition word bank.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word bank comprises a main neural network and a linker linked by neurons in the main neural network;
fusing each short sentence neural network, comprising:
performing for each phrase neural network:
linking each neuron in the phrasal neural network to a linker;
traversing each neuron in the short sentence neural network through a linker;
and when the traversal result shows that the neurons with the same signal types exist between the main neural network and the phrase neural network, deleting the neurons with the same signal types in the phrase neural network, and connecting the related linkage relations of the neurons with the same signal types to the main neural network.
Preferably, the first and second electrodes are formed of a metal,
when the traversal result shows that the main neural network and the phrase neural network have the link relation with the same signal transmission direction,
and updating the link coefficients indicated by the link relations with the same signal transmission direction on the master neural network according to the link coefficients indicated by the link relations with the same signal transmission direction.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word stock construction method further comprises the following steps:
acquiring a newly added short sentence;
and aiming at each added word in the newly added short sentence, executing:
converting the added words into corresponding neurons;
on the master neural network, searching a first neuron matched with a neuron corresponding to the added word through a linker, and activating the first neuron;
when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
the link coefficient indicated by the first link relation is updated using the calculated first link coefficient.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word stock construction method further comprises the following steps:
setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used;
when a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state.
Preferably, the first and second electrodes are formed of a metal,
the neuron further indicates signal strength;
finding a first neuron that matches a neuron corresponding to the added word, comprising:
searching a first neuron matched with the signal type indicated by the neuron corresponding to the increased word;
and when the signal intensity indicated by the neuron corresponding to the increased word is not less than a preset threshold value, activating the first neuron.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word stock construction method further comprises the following steps:
for each short sentence in the training text set, performing:
calculating md5 codes corresponding to the short sentences;
and judging whether md5 codes are recorded or not, if so, ignoring the short sentence, otherwise, recording md5 codes, and executing the steps of removing the duplicates of the short sentence and constructing a corresponding neuron for each character in the short sentence after the duplicates are removed.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word stock construction method further comprises the following steps:
setting corresponding attenuation periods for the link relation and the neuron respectively;
when the word segmentation recognition lexicon is used,
deleting the link relation when the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation;
and when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron, deleting the neuron and the link relation linked with the neuron.
Preferably, the first and second electrodes are formed of a metal,
the word segmentation recognition word stock construction method further comprises the following steps:
calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset attenuation function;
and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
Preferably, the word segmentation recognition word bank construction method further includes:
and aiming at least one second link relation corresponding to the deterministic word sequence, executing:
and updating the link coefficient indicated by each second link relation by using a preset link constant.
Preferably, the word segmentation recognition word bank construction method further includes:
and aiming at least one third link relation corresponding to the deterministic non-word sequence, executing:
and deleting at least one third link relation.
According to a second aspect of the embodiments of the present invention, there is provided a chinese word segmentation method, including:
aiming at each short sentence to be participled in the text to be participled, executing:
converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
searching a matching neuron matched with each neuron to be segmented in a segmentation recognition word bank, wherein the segmentation recognition word bank is constructed by any one of the segmentation recognition word bank construction methods and comprises a plurality of neurons and link relations among the neurons;
and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
Preferably, the first and second electrodes are formed of a metal,
searching a matched neuron matched with each neuron to be classified, wherein the searching comprises the following steps:
and sequentially searching matched neurons matched with each neuron to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented, and converting the matched neurons from a suppressed state to an activated state.
Preferably, the chinese word segmentation method further comprises:
for each neuron to be divided, performing:
determining a first output signal strength of a neuron to be divided;
when the neurons in the activated state exist in the word segmentation recognition word stock, penetrating the neurons in the activated state by using the neurons to be segmented;
calculating the first penetration signal intensity generated by the neuron to be divided penetrating the neuron in the activated state according to the relative positions of the neuron in the activated state and the neuron to be divided in the word segmentation short sentence;
determining the word segmentation position of the short sentence to be segmented, comprising the following steps:
calculating the first segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity;
and performing word segmentation according to the first word segmentation signal strength.
Preferably, the first and second electrodes are formed of a metal,
determining the word segmentation position of the short sentence to be segmented, comprising the following steps:
positioning the current word segmentation position and the previous word segmentation position corresponding to the current word segmentation position;
determining second output signal intensity of a neuron to be divided corresponding to a first word to be divided between the current word position to be divided and the last word position to be divided;
aiming at every two target words to be segmented between the current word to be segmented position and the last word to be segmented position, executing:
determining second penetration signal intensity corresponding to the two target characters to be segmented according to the relative positions of the two target characters to be segmented in the short sentence to be segmented and the link coefficients corresponding to the two target characters to be segmented;
calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
and judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current word segmentation position to be the word segmentation position.
According to a third aspect of the embodiments of the present invention, there is provided a device for constructing a word segmentation recognition lexicon, including: a building element and a first linker, wherein,
a construction unit, configured to execute, for short sentences in the training text set: removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
and the first linker is used for connecting the neurons constructed by the construction unit and fusing each short sentence neural network constructed by the construction unit to the word segmentation recognition word stock.
According to a fourth aspect of the embodiments of the present invention, there is provided a chinese word segmentation apparatus, including: a conversion unit and a second linker, wherein,
the conversion unit is used for executing the following steps aiming at each short sentence to be participled in the text to be participled: converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
the second linker is used for searching a matching neuron matched with each neuron to be segmented in a word segmentation recognition word bank, wherein the word segmentation recognition word bank comprises a plurality of neurons and link relations among the neurons;
and the second linker is further used for determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of constructing a neuron through each word in a short sentence in a training text set, constructing a link relation between the neurons according to the relative position and the occurrence frequency between every two words, enabling a word segmentation recognition word bank to comprise the neurons of the words and the link relation between the words instead of being stored according to words or phrases, performing word segmentation according to the link relation between the words when performing word segmentation on the basis of the word segmentation recognition word bank, and constructing words based on the words and the link relation compared with the word bank for storing the words or the phrases. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word bank and the accuracy of word segmentation.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of a main flow of a word segmentation recognition word stock construction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a portion of a neural network in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of the relationship between a linker and a neuron according to an embodiment of the invention;
FIG. 4 is a diagram illustrating a main flow of a word segmentation recognition lexicon construction method according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of a phrasal neural network in accordance with another embodiment of the present invention;
FIG. 6 is a schematic diagram of a portion of a master neural network in a word segmentation recognition lexicon according to another embodiment of the present invention;
FIG. 7 is a diagram illustrating a main flow of a word segmentation recognition lexicon construction method according to another embodiment of the present invention;
FIG. 8 is a schematic diagram of the main flow of a method of word segmentation in accordance with one embodiment of the present invention;
FIG. 9 is a schematic illustration of a main flow of determining a position of a word segmentation according to one embodiment of the invention;
FIG. 10 is a schematic diagram of a portion of a master neural network in a word segmentation recognition lexicon according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a main flow of determining the position of a word segmentation according to another embodiment of the present invention;
fig. 12 is a schematic diagram of the main units of a word segmentation recognition lexicon construction apparatus according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of the main components of a text-to-word apparatus according to an embodiment of the present invention;
FIG. 14 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 15 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A phrase refers to a sentence separated in an article by symbols such as commas, pause signs, quotation marks, periods, question marks, and the like.
A neuron is an abstract node having an electrical signal, which constitutes a neural network.
A word is the smallest unit of language that can be used independently.
Chinese word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Chinese is simply a word, sentence, and paragraph that can be simply delimited by distinct delimiters, the only word not having a formal delimiters.
The neuron is a basic structure and a functional unit constituting a neural network, and processes input link information and outputs the processed input link information to other neurons through an output link. With the type of signal and the strength of the signal,
the neural network refers to a network composed of neurons and link relations between the neurons.
Neural signals refer to signals transmitted between neurons, which indicate signal type and signal strength. In the process of constructing a word stock or word segmentation, the signal type of a neural signal is the same as that of a neuron, and the neuron can be activated if the signal strength is greater than a threshold value. The signal type is not changed in the operation process, and the signal intensity of the nerve signal is changed. The classification is input signal type and output signal type, input signal strength and output signal strength.
The link coefficient is a coefficient linked between two neurons, which is used to control the intensity ratio of input signal/output signal of a neuron link formed by two or more words, and is divided into an input coefficient and an output coefficient. The output signal intensity of the neural signal can be calculated according to the input signal intensity and the connection coefficient of the neural signal, so that word stock or word segmentation is constructed.
The inhibition state refers to that neurons in a word segmentation recognition word bank are usually in the inhibition state, the neurons can be activated when receiving signals of types matched with the neurons and the strength of the signals is larger than a threshold value when the neurons are in the inhibition state, and non-matching signals are not processed.
The activation state means that neurons in the inhibition state in the word segmentation recognition word stock are activated by matched neural signals and become the activation state. The neuron in the activated state may be permeable to all input neural signals and return to the inhibited state when the neuron in the activated state receives a recovery signal. For example, the neuron a becomes active after receiving the neuron a signal, and then receives the neuron b signal, because the neuron a is currently in the active state, neuron operation is performed between the neuron b signal and the neuron a, if the recovery signal is received, all the neurons enter the inhibition state, and the neuron a is activated again only when receiving the neuron a signal.
The linker is connected with the neurons, can send input neuron signals to output neurons with the same/matched types as the input neuron signals, stores the neurons corresponding to the input neuron signals into a word bank if the input neuron signals are not matched in the process of constructing a word segmentation word bank, and outputs abnormal signals if the input neuron signals are not matched in the process of word segmentation. The link coefficient between the linker and the interfacing neuron is 1 (i.e. after the linker receives a neuron signal, the process of transferring the neuron signal does not cause the neuron signal strength to be attenuated, and the neuron signal can be directly penetrated through). Each word corresponds to a neuron signal type; inputting a neuron signal to an output neuron which is the same as/matched with the input neuron signal in type, and indicating that the output neuron which is the same as/matched with the input neuron signal in type expresses the same word.
Fig. 1 is a method for constructing a segmentation recognition word bank according to an embodiment of the present invention, and as shown in fig. 1, the method for constructing a segmentation recognition word bank may include the following steps:
for short sentences in the training text set, executing S101 and S102:
s101: removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons;
s102: constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
s103: and fusing the short sentence neural networks to form a word segmentation recognition word bank.
The training text collection is from all readable text contents such as dictionaries, textbooks, newspapers, network articles and the like.
The link coefficient in the short sentence neural network is calculated according to the distance between the current word and the subsequent word and the following calculation formula (1).
Calculating formula (1):
W(NLi)=10×pow(0.1,i)
wherein NLiRepresenting two words with a distance of i in the short sentence; InW (NL)i) Characterizing a link coefficient between two words with a distance i in a short sentence; pow (0.1, i) characterizes an i-th power function that calculates 0.1, i being an integer no less than 0, i being 0 when it is self, i being 1 when two words are adjacent, and so on. For example, in the apple, the interval between "this" and "yes" is 1; the interval between "this" and "apple" is 2; the interval between the 'fruit' and the 'fruit' is 3; therefore, a short sentence "this is apple" the neural network constructed by the above steps S101 and S102 is shown in fig. 2. It is found from the above calculation formula (1) that the link coefficient of the signal transmission direction "this" neuron to "yes" neuron is 0.1, the link coefficient of the signal transmission direction "this" neuron to "apple" neuron is 0.01, the link coefficient of the signal transmission direction "this" neuron to "apple" neuron is 0.001, the link coefficient of the signal transmission direction "yes" neuron to "apple" neuron is 0.1, and the link coefficient of the signal transmission direction "apple" neuron to "apple" neuron is 0.2.
That is, the link coefficient is a value calculated according to the relative position between two words, such as the number of characters spaced between two words, and the frequency of the two words appearing in a short sentence at the same time, and the link coefficient can reflect the probability that the two words belong to the same word or the same phrase to a certain extent. The farther the distance, the lower the probability of belonging to the same word, and therefore the smaller the link coefficient. Therefore, the word segmentation recognition word stock constructed by the embodiment of the invention can better embody the relation between the words.
The signal transmission direction refers to the sequence between two words, for example, a short sentence "i love apple", corresponding neurons are respectively constructed from "i", "love", "eat", "apple" and "fruit", the signal transmission direction between the neuron corresponding to "i" and the neuron corresponding to "love" is from the neuron corresponding to "i" to the neuron corresponding to "love", for example, the short sentence "love me", corresponding neurons are respectively constructed from "love", "i", "middle" and "hua", and the signal transmission direction between the neuron corresponding to "i" and the neuron corresponding to "love" is from the neuron corresponding to "love" to the neuron corresponding to "i".
In the scheme provided by the embodiment of the invention, the neurons are constructed by training each word in the short sentences in the text set, and the link relation between the neurons is constructed according to the relative position and the occurrence frequency between every two words, so that the word segmentation recognition word stock comprises the neurons of the words and the link relation between the words instead of being stored according to the words or phrases, and more words can be embodied without being restricted by the words or phrases. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word bank.
In one embodiment of the invention, the word segmentation recognition word bank comprises a main neural network and a linker linked by neurons in the main neural network; the relationship between each neuron and the linker is shown in fig. 3, and as can be seen from fig. 3, the linker includes an entrance linker and an exit linker, and the neurons corresponding to each character on the main neural network (character a neuron, character B neuron, …, character C neuron, character D neuron, …) are respectively connected to the entrance linker and the exit linker, so that the neuron or the neural signal is input from the entrance linker, and the result is output from the exit linker.
Accordingly, based on the linker, the specific implementation of step S103 may include: as shown in fig. 4, steps S401 to S403 are performed for each of the phrase neural networks:
s401: linking each neuron in the phrasal neural network to a linker;
s402: traversing each neuron in the short sentence neural network through a linker;
s403: and when the traversal result shows that the neurons with the same signal types exist between the main neural network and the phrase neural network, deleting the neurons with the same signal types in the phrase neural network, and connecting the related linkage relations of the neurons with the same signal types to the main neural network.
The neural network shown in fig. 2 is used as a main neural network in a word segmentation recognition word bank, the short-sentence neural network shown in fig. 5 is fused into the neural network shown in fig. 2, and each neuron (i.e., "i" corresponding neuron, "ai" corresponding neuron, "apple" corresponding neuron and "fruit" corresponding neuron) in the short-sentence neural network shown in fig. 5 is traversed, wherein the neurons with the same signal type between the short-sentence neural network shown in fig. 5 and the neural network shown in fig. 2 are "apple" corresponding neuron and "fruit" corresponding neuron, the neurons with the same signal type in the short-sentence neural network are deleted, and the related linkage relation of the neurons with the same signal type is connected to the main neural network, so that the neural network shown in fig. 6 is obtained.
In an embodiment of the invention, when the traversal result shows that the link relation with the same signal transmission direction exists between the main neural network and the phrasal neural network, the link coefficient indicated by the link relation with the same signal transmission direction on the main neural network is updated according to the link coefficient indicated by the link relation with the same signal transmission direction. If the signal transmission direction in the link relation of the neuron corresponding to "apple" to the neuron corresponding to "apple" existing in the phrase neural network shown in fig. 5 is the same as the signal transmission direction in the link relation of the neuron corresponding to "apple" to the neuron corresponding to "apple" in the neural network shown in fig. 2, the link coefficient of the neuron corresponding to "apple" to the neuron corresponding to "apple" in fig. 6 is updated with the link coefficient of the neuron corresponding to "apple" existing in the phrase neural network shown in fig. 5.
The above shows the fusion of a phrasal neural network to a master neural network to update the master neural network. An embodiment will be given below as a process of sequentially inputting each word in the short sentence to the master neural network to update the master neural network. In an embodiment of the present invention, as shown in fig. 7, for the obtained new short sentences, the method for constructing the word segmentation recognition lexicon may further include the following steps:
executing S701 to S704 for each added word in the added short sentence:
s701: converting the added words into corresponding neurons;
s702: on the master neural network, searching a first neuron matched with a neuron corresponding to the added word through a linker, and activating the first neuron;
s703: when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
s704: the link coefficient indicated by the first link relation is updated using the calculated first link coefficient.
In the above step S702, if the first neuron that matches is not found, a new neuron is added to the word stock for the neuron corresponding to the added word; a neuron in the lexicon is typically in an inhibitory state, and when the neuron is matched, the neuron is activated and the state of the neuron changes to an active state.
The activation function of step S703 is shown in the following calculation formula (2):
calculating formula (2):
W′n+1(NL)=Su(Su(W′n(NL))+1)
wherein, W'n+1(NL) characterizing the linking coefficient between neurons corresponding to two words of NL activated the n +1 th time; w'n(NL) characterizing the linking coefficient between neurons corresponding to two words of NL activated the nth time; n is an integer of not less than 0; su (P) ═ sigmoid (P-N); to W'n+1(NL), P represents W'n(NL) or Su (W'n(NL)) + 1; n represents a setting parameter which is the position of the x-axis central point of the sigmoid function.
The formula form of Su (P) ═ sigmoid (P-N) is shown in the following calculation formula (3):
calculation formula (3):
Figure BDA0002315507170000121
the value of the sigmoid function is between 0 and 1, the sigmoid function is centrosymmetric at 0.5 and is an S-shaped curve, so that the link coefficient can be kept between 0 and 1, and the result is increased along with the increase of the activation times, so that subsequent word segmentation calculation is facilitated. In a preferred embodiment, N-10.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon further includes: setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used; when a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state. Through the active state, the inhibition state and the recovery signal, in the searching process, the target neuron can be searched by searching the neuron in the active state, so that the link coefficient can be more efficiently and quickly counted, and the word can be more efficiently and quickly segmented. In addition, through the suppression state and the recovery signal, resources occupied by neurons in the word segmentation recognition word stock can be effectively reduced.
Such as: the neural network shown in fig. 2 is already present in the word segmentation recognition word stock, so that after each word in the apple is abstracted into a corresponding neural signal, the neural signals of me, love, apple and apple are sequentially input into the linker, and the neural signals respectively corresponding to me and love do not find matched neurons, so that no neuron is activated; the neural signal corresponding to "apple" will activate the "apple" neuron as it matches the "apple" neuron in fig. 2; the "apple" nerve signal can penetrate through the link relation between the "apple" neuron and activate the corresponding neuron of the "apple". For another example, after each word in the word "this effect" is abstracted to a corresponding neural signal, the "this" and "effect" neural signals are sequentially input to the linker, and the "this" neural signal activates the "this" neuron due to matching with the "this" neuron in fig. 2; the "fruit" nerve signal can penetrate through the link relation between the "neuron and the" fruit "neuron and activate the neuron corresponding to the" fruit ". In the process of constructing the word segmentation-based lexicon, one link relation is penetrated once, namely, the link relation is activated once.
In one embodiment of the invention, the neuron further indicates signal strength; the specific implementation of the step S702 may include: searching a first neuron matched with the signal type indicated by the neuron corresponding to the increased word; and when the signal intensity indicated by the neuron corresponding to the increased word is not less than a preset threshold value, activating the first neuron. Since the link coefficient between the linker and the neuron is 1, the signal strength of the neuron is not reduced by the process of searching the neuron through the linker.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon may further include: for each short sentence in the training text set, performing: calculating md5 codes corresponding to the short sentences; and judging whether md5 codes are recorded or not, if so, ignoring the short sentence, otherwise, recording md5 codes, and executing the steps of removing duplicates of the short sentence and constructing a corresponding neuron for each character in the short sentence after the duplicates are removed. Wherein, MD5 is a 128-bit (bit) feature code obtained by mathematically transforming the short sentence according to the disclosed MD5 algorithm. The same phrases correspond to the same MD 5. The md5 code can prevent the same short sentence from being trained, thereby effectively improving the accuracy of the link relation between the neurons.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon may further include: setting corresponding attenuation periods for the link relation and the neuron respectively; when the word segmentation recognition word stock is used and the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation, deleting the link relation; and when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron, deleting the neuron and the link relation linked with the neuron. The method and the device realize deletion of unused link relations and neurons, reduce unnecessary resource occupation of the link relations and the neurons, and avoid unnecessary resource overhead, thereby further reducing the resource occupation and effectively improving the word segmentation efficiency.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon further includes: calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset attenuation function; and updating the link coefficient indicated by the link relation by using the calculated second link coefficient. The link coefficient can reflect the activity degree of the link relation so as to more accurately indicate the activity degree of the words corresponding to the link relation. Meanwhile, the attenuation process is embodied on the link coefficient and can represent the forgetting of long-term unused words in the neural network, and the forgetting is helpful for filtering the recently unused words and reducing the recognition calculation cost in the word segmentation recognition process.
The specific implementation of calculating the second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time, and the preset attenuation function may include:
calculating a second link coefficient corresponding to the current time by using the following calculation formula (4):
calculating formula (4):
Sd(t)=v*Ebbinghaus(t-t0)
sd (t) represents a second link coefficient corresponding to the current time; t0 represents the last activated time corresponding to the current time; v represents the attenuation coefficient; ebbinghaus () characterizes the Ebingos forgetting curve function. Preferably, v is 1. The decay time may be set according to practical conditions, for example, the decay time may be set from 100% to 0% in 1 year. Through the attenuation setting, the link among the neurons in the word segmentation recognition word stock or the attenuation of the neurons is simulated to simulate the forgetting process of the human brain, namely, forgetting starts immediately after learning, and the forgetting process is not uniform. The forgetting speed is fast at first and gradually slows down later. The Ebinghaos memory forgetting curve is a function of time for keeping and forgetting so as to simulate a brain forgetting process to calculate links among neurons in a word segmentation recognition word bank or attenuation of the neurons.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon further includes: and aiming at least one second link relation corresponding to the deterministic word sequence, executing: and updating the link coefficient indicated by each second link relation by using a preset link constant. The accuracy of word segmentation by using the word segmentation recognition word stock can be further improved. For example, if "review" is a deterministic word, updating the link coefficient between the neuron corresponding to "review" and the neuron corresponding to "review" with a preset link constant; for another example, if "horizontal tiger-tibolon" is a deterministic word, a preset linking constant is used to update the linking coefficient between the neuron corresponding to "horizontal" and the neuron corresponding to "tiger", the linking coefficient between the neuron corresponding to "tiger" and the neuron corresponding to "hidden", and the linking coefficient between the neuron corresponding to "hidden" and the neuron corresponding to "dragon". Preferably, the preset link constant is 0.9.
In an embodiment of the present invention, the method for constructing a word segmentation recognition lexicon may further include: and aiming at least one third link relation corresponding to the deterministic non-word sequence, executing: and deleting at least one third link relation. Unnecessary link relations are reduced, and resource overhead caused by word segmentation is reduced. For example, deleting the corresponding link relation of the non-word sequence of "your", "my", or "his", so that the non-word sequence in the word segmentation recognition word bank can be greatly reduced, and deleting the non-word sequence from the neural network can make the output result of the neural network more accurate, and make the word segmentation more accurate.
Fig. 8 shows a method for performing chinese word segmentation based on the word segmentation recognition lexicon provided by the above embodiment. As shown in fig. 8, the method for performing chinese segmentation based on the above-mentioned segmentation recognition lexicon may include: aiming at each short sentence to be participled in the text to be participled, the following steps are executed:
s801: converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
s802: searching a matching neuron matched with each neuron to be segmented in a segmentation recognition word bank, wherein the segmentation recognition word bank comprises a plurality of neurons and link relations among the neurons;
s803: and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
For example, for a phrase to be segmented ABCDE, searching for an A neuron, a B neuron, a C neuron, a D neuron and an E neuron in a segmentation recognition word bank, and searching for a link relationship between the A neuron and the B neuron, a link relationship between the A neuron and the C neuron, a link relationship between the A neuron and the D neuron, a link relationship between the A neuron and the E neuron, a link relationship between the B neuron and the C neuron, a link relationship between the B neuron and the D neuron, a link relationship between the B neuron and the E neuron, a link relationship between the C neuron and the D neuron, a link relationship between the C neuron and the E neuron and a link relationship between the D neuron and the E neuron, determining a segmentation position according to the searched link relationship, wherein the segmentation position can be determined when an output signal calculated by using the link relationship is not greater than a preset segmentation threshold, for example, the output signal of AB is calculated to be greater than the word segmentation threshold through the link relationship between the a neuron and the B neuron, while the output signal of ABC is calculated to be less than the word segmentation threshold through the link relationship between the a neuron and the B neuron, the link relationship between the B neuron and the C neuron, and the link relationship between the a neuron and the C neuron, and the word segmentation position is between B and C. For example, if the output signal intensity corresponding to AB calculated by using the link coefficient is greater than the output signal intensity corresponding to ABC, the word segmentation position is between B and C.
In an embodiment of the present invention, a specific implementation manner of searching for a matching neuron matching each neuron to be classified includes: and sequentially searching matched neurons matched with each neuron to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented, and converting the matched neurons from a suppressed state to an activated state. For example, for a short sentence to be segmented ABCDE, the corresponding matching neuron is first searched for a, then the corresponding matching neuron is searched for B, and so on, the matching process may be first searched for by the neuron already in an activated state, and if the matching process is not searched for, then the whole segmentation recognition word bank is searched for. For example, searching a corresponding matched neuron for B, firstly, starting to search the matched neuron corresponding to A in an activated state, if the A can be penetrated to find the B, indicating that the AB is possibly a word, and if the A cannot be penetrated to find the B, indicating that the AB is not a word, and then, the word segmentation position is between the AB. Therefore, through the search of the activated neurons, on one hand, whether a link relation exists between the searched neurons and the activated neurons can be preliminarily determined so as to preliminarily judge whether the neurons belong to the same word, on the other hand, the search range can be greatly reduced, the resource overhead is saved, the search time is saved, and the search efficiency is improved.
Based on the neural network in the word segmentation recognition word stock, two ways of determining the word segmentation position can be provided.
A first way to determine the word segmentation position is shown in fig. 9, and the following steps may be performed for each neuron to be segmented abstracted for each word in the short sentence to be segmented:
s901: determining a first output signal strength of a neuron to be divided; when the neuron in the activated state exists in the word segmentation recognition word stock, executing step S902; when the neuron in the activated state does not exist in the word segmentation recognition word stock, executing the step S903;
s902: penetrating the neurons in the activated state by using the neurons to be classified, and executing the step S906;
s903: searching the neurons matched with the neurons to be classified in the word segmentation recognition word bank, and if the neurons are searched, executing S904; if not, executing S905;
s904: converting the matched neuron from an inhibition state to an activation state, and outputting a signal with first output signal intensity; and ending the current flow;
s905: outputting an abnormal signal and ending the current process;
s906: calculating the first penetration signal intensity generated by the neuron to be divided penetrating the neuron in the activated state according to the relative positions of the neuron in the activated state and the neuron to be divided in the word segmentation short sentence;
s907: calculating the first segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity;
s908: and performing word segmentation according to the first word segmentation signal strength.
A specific embodiment of determining the first output signal intensity of the neuron to be divided in step S901 is to calculate the first output signal intensity of the neuron to be divided by using the following calculation formula (5):
calculating formula (5):
OutS(X)=InS(X)×InW(X)×InW(NL0)×OutW(NL0)×OutW(X)
wherein OutS (X) represents the first output signal intensity of a neuron X to be divided; the InS (X) represents the input neural signal intensity of the neuron X to be segmented, if a neuron matching the neuron X to be segmented is found in the word segmentation recognition lexicon, the InS (X) is 1, otherwise, the neuron is a preset abnormal intensity parameter such as 0; the link coefficient (link coefficient between an entrance linker and the neuron X to be distributed) of the input of the neuron X to be distributed is represented by InW (X), if the neuron matched with the neuron X to be distributed is found in the word segmentation recognition word bank, the InW (X) is 1, otherwise, the link coefficient is a preset abnormal input link parameter such as 0; OutW (X) represents a link coefficient of an output of the neuron X to be segmented (a link coefficient between an exit linker and the neuron X to be segmented), if a neuron matching with the neuron X to be segmented is found in the word segmentation recognition lexicon, OutW (X) is 1, otherwise, a preset abnormal output link parameter such as 0 is obtained; InW (NL)0) Characterizing the input linking coefficients of neuron X to be classified to the matching neurons, if neurons matching neuron X to be classified are found in the word segmentation recognition lexicon, InW (NL)0) Otherwise, outputting a link parameter such as 0 for the preset exception; OutW (NL)0) Representing output link coefficients of the neuron X to be divided to the matched neuron; OutW (NL)0)=10×pow(0.1,0)=10。
That is, through the step S901, it can be determined that the neuron X to be segmented is found in the segmentation recognition lexicon, the first output signal intensity corresponding to the neuron X to be segmented is 10, and if the neuron X to be segmented is not found in the segmentation recognition lexicon, it is determined that X is an abnormal signal, that is, the character corresponding to the neuron X to be segmented is an abnormal character, and an abnormal signal is output as 0. After the abnormal signal is output, the word segmentation position is the position of the neuron X to be segmented, the word segmentation recognition word bank can be reset, and a new word segmentation is performed from the next position of the neuron X to be segmented. The word segmentation recognition word stock resetting means that a recovery signal is input into the word segmentation word stock to recover all neurons in an activated state to a suppressed state.
The specific implementation manner of step S906 is to determine a character interval between the neuron in the activated state and the neuron to be distributed according to the relative positions of the neuron in the activated state and the neuron to be distributed in the word segmentation short sentence, determine a value digit after the decimal point of the link coefficient in the link relationship between the neuron in the activated state and the neuron to be distributed according to the character interval (that is, the value digit after the decimal point of the link coefficient in the link relationship between the neuron and the neuron to be distributed is equal to the character interval), and if there is no link relationship between the neuron in the activated state and the neuron to be distributed, the link coefficient between the neuron in the activated state and the neuron to be distributed is 0. The character pitch can be calculated by the following calculation formula (6):
calculating formula (6):
KX->Y=KX-KY
wherein, KX->YCharacterizing the character space between a neuron X to be divided and a neuron Y which penetrates through the neuron X and is in an activated state; kXCharacterizing that the neuron X to be segmented belongs to the Kth in the word segmentation short sentenceXA character; kYCharacterizing neurons in the activated state Y belonging to the Kth in the word-segmentation phraseYA character, wherein KY≤KXFor example, for a participle phrase CDEF, neurons in an activated state Y are neurons matching C, D, and E, respectively, where C belongs to the 1 st character in the participle phrase, D belongs to the 2 nd character in the participle phrase, E belongs to the 3 rd character in the participle phrase, F is a neuron X to be participled, and F belongs to the 4 th character in the participle phrase, then K isF->C=3,KF->D=2,KF->ECorrespondingly, the link coefficient from the neuron matched with C to the neuron matched with F takes 3 bits after the decimal point, namely the first penetration signal intensity generated by the neuron matched with F penetrating through the neuron matched with C; the link coefficient from the D matched neuron to the F matched neuron takes 2 bits after the decimal point, namely the F matched neuronA first penetration signal intensity generated by a penetration D matched neuron; and taking the decimal point 1 bit of the link coefficient from the neuron matched with the E to the neuron matched with the F, namely the first penetration signal intensity generated by the neuron matched with the F penetrating through the neuron matched with the E.
It is worth to be noted that the numeric value can be the first digit after the decimal point is taken, and other digits are filled with zeros; or several digits after decimal point, for example, the character spacing between the short sentences ABCDE, AE is 4; in the neural network, the link coefficient between the A and the E is 0.1223, the value digit can be 0.0003 which is the 4 th digit after the decimal point is taken, and the value digit can also be 0.1223 which is the 4 th digit after the decimal point is taken; the specific selection of which number of values can be done in a set or selected manner.
In the step S907, the first segmentation signal strength corresponding to the neuron to be segmented is calculated as: calculated according to the following calculation formula (7):
calculating formula (7):
Figure BDA0002315507170000191
wherein OutS (1- > X) represents the first segmentation signal strength from the 1 st character to the X character in the short sentence to be segmented; OutS (X) represents the first output signal intensity of a neuron X to be divided corresponding to the character X, wherein X is a non-1 st character; InSiCharacterizing the input signal strength, InS, of the ith characteri=1;OutW(NLi) And characterizing a first penetration signal intensity generated by a neuron corresponding to the X character penetrating through a neuron corresponding to an ith character positioned before the X character.
And when the first word segmentation signal strength meets the word segmentation condition, performing word segmentation. The word segmentation condition can be that OutS (1- > X) is not less than a preset word segmentation threshold and OutS (1- > X +1) < the preset word segmentation threshold, and then a word segmentation position is between an X character and an X +1 character in the short sentence to be segmented. The word segmentation condition can be OutS (1- > X) > OutS (1- > X +1), and then the word segmentation position is between the X character and the X +1 character in the short sentence to be segmented.
Taking the example that partial main neural networks in the word segmentation recognition lexicon given in fig. 10 perform word segmentation on short sentences to be segmented AE, BC, EBC, BCE and CDE respectively, setting the threshold value to be 10.2, the method for determining the word segmentation position will be described.
For a phrase AE, the neural signal corresponding to a and the neural signal corresponding to E are sequentially input to the linker to obtain an output signal: OutS (a) > 10, OutS (a- > E) > OutS (E) + InS1×OutW(NL1) 10+1 × 0.2 is 10.2, because AE differs by 1 character in the phrase AE to be divided, OutW (NL)1) The linking coefficients corresponding to a to E take one digit after the decimal point. Then 10.2 ≧ threshold (10.2), AE is a word.
And (3) inputting the neural signal corresponding to the B and the neural signal corresponding to the C into a linker in sequence aiming at the phrase BC to obtain an output signal: OutS (B) > 10, OutS (B- > C) > OutS (C) + InS1×OutW(NL1) 10+1 × 0.2-10.2, because in the short sentence BC to be divided, BC differs by 1 character, OutW (NL)1) The linking coefficient corresponding to B to C takes one digit after the decimal point. Then 10.2 ≧ threshold (10.2), BC is a word.
And (3) inputting the neural signal corresponding to the E, the neural signal corresponding to the B and the neural signal corresponding to the C into a linker in sequence aiming at the phrase EBC to obtain output signals: OutS (E) > 10, OutS (E- > B) > OutS (B) + InS1×OutW(NL1) 10+1 × 0.0 equals 10.0, OutS (B- > C) equals 10.2, and because there are 1 character differences between EBs in the short sentence to be divided EBC, OutW (NL)1) The linking coefficient corresponding to E to B takes one digit after the decimal point. Then 10.0 ≦ threshold (10.2), then EB is not a word, and BC was previously known to be a word, then the participle position is between E and B.
And (3) inputting the neural signal corresponding to the B, the neural signal corresponding to the C and the neural signal corresponding to the E into a linker in sequence aiming at the phrase BCE to obtain output signals: OutS (B) > 10, OutS (B- > C) > OutS (C) + InS1×OutW(NL1)=10+1×0.2=10.2,
Figure BDA0002315507170000201
Figure BDA0002315507170000202
Since the BE differs by 2 characters in the short sentence BCE to BE divided, OutW (NL)2) Taking two digits after decimal point for the corresponding link coefficient from B to E; if there is a 1 character difference between CEs, OutW (NL)1) The linking coefficient corresponding to C to E takes one digit after the decimal point. Then 10.14 ≦ threshold (10.2), then BCE is not a word, and BC is a word, then the participle position is between C and E.
For the phrase CDE, the neural signal corresponding to C, the neural signal corresponding to D, and the neural signal corresponding to E are sequentially input into the linker, and an output signal is obtained: OutS (C) > 10, OutS (C- > D) > OutS (D) + InS1×OutW(NL1)=10+1×0.2=10.2,
Figure BDA0002315507170000203
Figure BDA0002315507170000204
Alternatively, the first and second electrodes may be,
Figure BDA0002315507170000205
Figure BDA0002315507170000206
since CEs differ by 2 characters in the short sentence to be divided CDE, OutW (NL)2) The link coefficient corresponding to C to E takes two digits after the decimal point or takes the 2 nd digit after the decimal point; if the difference between DE is 1 character, OutW (NL)1) The linking coefficient corresponding to D to E takes one digit after the decimal point. Then 10.43 ≧ threshold (10.2) or 10.33 ≧ threshold (10.2), then CDE is a word.
It is to be understood that the link coefficients given in fig. 2, 5, 6, and 10 are only given by way of example, and do not constitute a limitation on the link coefficients.
A second way of determining the position of the word segmentation is shown in fig. 11, which may include the following steps:
s1101: positioning the current word segmentation position and the previous word segmentation position corresponding to the current word segmentation position;
s1102: determining second output signal intensity of a neuron to be divided corresponding to a first word to be divided between the current word position to be divided and the last word position to be divided;
for every two target words to be segmented between the current word to be segmented position and the last word to be segmented position, executing steps S1103 to S1107:
s1103: determining second penetration signal intensity corresponding to the two target characters to be segmented according to the relative positions of the two target characters to be segmented in the short sentence to be segmented and the link coefficients corresponding to the two target characters to be segmented;
s1104: calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
s1105: judging whether the second word segmentation signal strength meets a preset word segmentation condition, if so, executing a step S1106; otherwise, step S1107 is executed;
s1106: determining the current word segmentation position as a word segmentation position, and ending the current flow;
s1107: determining that the current position to be segmented is not a segmentation position, taking the next segmentation position as the current segmentation position, and executing S1103.
The explanation is given by taking ABC as an example between the current word segmentation position and the last word segmentation position. The method comprises the following specific steps:
initial state reset: the RESET signal (restore signal) is input to the main inlet and is sent to the activated neurons through the linker to RESET the activated neurons to a suppressed state.
Inputting a first word: inputting a neural signal corresponding to the word A from the entrance linker, wherein the signal intensity InS (A) of the neural signal is 1 (the default assigned intensity of the input signal is 1), activating the neuron A corresponding to the word A through the connector, and outputting an ERROR signal to the exit linker if the connector is not linked to the word A; if the A word is connected, a signal A is sent to a neuron of the A word, the strength is 1, the neuron A is activated and outputs an A signal to an exit linker, and the signal strength is calculated by the above calculation formula (5):
OutS(A)=InS(A)×InW(A)×InW(NL0)×OutW(NL0)×OutW(A)
=1×1×1×10×1=10
inputting a second word: n is 1, the signal intensity of the input B word from the entrance linker InS (B) is 1 (the default assigned intensity of the input signal is 1), the B signal activates neuron B and penetrates activated neuron a, and there are two corresponding output signal intensities, one of which is B signal which activates neuron B, and the corresponding neuron B output signal intensity is OutS (B0) InS (B) x InW (B) x InW (NL)0)×OutW(NL0) X OutW (B) ═ 1 × 1 × 1 × 10 × 1 ═ 10; the other is the B signal crossing the activated neuron A if the linkage coefficient InW of neuron A to B (NL1(A->B))<A preset threshold value, outputting OutS (A->B) 0 if neuron A links to B InW (NL1(A->B))>A predetermined threshold value is set from A->The B-linked NL1 outputs the signal B strength OutS (A->B)=InS(B)×InW(A->B)×OutW(NL1(A->B) Assume that the linkage coefficient from neuron A to neuron B is 0.88 to prevent A->B mutual interference of the activation values of a plurality of interval bit words, OutS (NL1(A->B) Only the 1 st bit after the decimal point, namely 0 ≦ OutS (A->B) Less than or equal to 0.9. The signal strength corresponding to the B signal output to the egress linker is OutS (B) ═ OutS (B0) + OutS (a->B)=10+0.8=10.8。
Inputting a third word C: n is 2, the signal strength InS (C) of the input word C from the ingress linker is 1 (the input signal is assigned a strength of 1 by default), the C signal activates neuron C and penetrates activated neurons a and B, and the neuron C outputs a signal strength of OutS (C0) × InS (C) × InW (C) × InW (NL0) × OutW (NL0) × OutW (C) ═ 10; since neuron a and neuron B have been activated, neuron a receives a C signal, OutS (a- > C) ═ InS (C) × InW (a- > C) × OutW (NL2(a- > C)), 0 ≦ OutS (a- > C ≦ 0.09; neuron B receives a C signal, OutS (B- > C) ═ InS (C) × InW (B- > C) × OutW (NL1(B- > C)), 0 ≦ OutS (C2) ≦ 0.9; the output signal strength received by the egress link is OutS (C0), OutS (A- > C), OutS (B- > C), and the output OutS (C) ═ OutS (C0) + OutS (A- > C) + OutS (A- > C), and 0 ≦ OutS (C) ≦ 10.99.
It is worth to say that the above-mentioned OutW (NLm (Q-)>Z)) means that when the difference between the character Z and the character Q is m characters, the m-th bit (the rest are filled with 0) is after the neuron corresponding to the character Q takes the decimal point to the neuron corresponding to the character Z (wherein, the m-th bit is filled with 0),m=Kz-KQWherein, K iszCharacterised by the character Z belonging to the Kth in the clause to be dividedzCharacter, KQCharacterizing that the character Q belongs to the Kth in the short sentence to be dividedQIndividual characters); for example, for ABC phrase, let us assume that the neuron A to neuron B link coefficient is 0.88, let us assume that the neuron A to neuron C link coefficient is 0.088, OutW (NL1(A->B))=0.8,OutW(NL2(A->C))=0.08。
In conclusion, text content is firstly separated into short sentences according to marks, spaces and other symbols, and then each short sentence is input into a neural network in a word segmentation recognition word bank from left to right in sequence; on a neural network which passes through RESET (RESET), firstly, when a first character is input, counting the number n of the characters to be 0, and inputting a neural signal corresponding to the first character T; if the exit linker outputs ERROR signal, it shows that the character T is abnormal character, and RESETs (RESET) neural network by restoring signal; if the signal intensity corresponding to the output T of the exit linker is T, the character T is a recognizable character; aiming at the nth character F (n is a positive integer larger than 1) in the short sentence, obtaining the strength of an output signal after a nerve signal corresponding to the character F penetrates through neurons corresponding to the first n-1 characters; if the output signal intensity is smaller than a preset threshold value and the output signal intensity obtained by the previous character corresponding to the character F is larger than the preset threshold value, the word segmentation position is between the character F and the previous character corresponding to the character F; and if the output signal intensity obtained by the previous character corresponding to the character F is greater than the output signal intensity corresponding to the character F, the word segmentation position is between the character F and the previous character corresponding to the character F, after the segmentation is finished, a recovery signal is input to reset the neural network, and a new word segmentation is carried out again.
It can be understood that the preset thresholds can be set by the user according to the actual situation of the word segmentation recognition lexicon.
As shown in fig. 12, an embodiment of the present invention provides a device 1200 for constructing a segmented word recognition lexicon, where the device 1200 for constructing a segmented word recognition lexicon includes: a building unit 1201 and a first linker 1202, wherein,
a constructing unit 1201, configured to execute, for a short sentence in the training text set: removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
the first linker 1202 is configured to connect the neurons constructed by the construction unit 1201, and fuse each short sentence neural network constructed by the construction unit 1201 to the word segmentation recognition lexicon.
In an embodiment of the present invention, the word segmentation recognition lexicon comprises a main neural network, and the first linker 1202 is further configured to perform, for each short sentence neural network: linking each neuron in the phrase neural network; traversing each neuron in the short sentence neural network; and when the traversal result shows that the neurons with the same signal types exist between the main neural network and the phrase neural network, deleting the neurons with the same signal types in the phrase neural network, and connecting the related linkage relations of the neurons with the same signal types to the main neural network.
In an embodiment of the present invention, the first linker 1202 is further configured to, when the traversal result indicates that there is a link relationship with the same signal transmission direction between the master neural network and the phrasal neural network, update the link coefficient indicated by the link relationship with the same signal transmission direction on the master neural network according to the link coefficient indicated by the link relationship with the same signal transmission direction.
In an embodiment of the present invention, the first linker 1202 is further configured to obtain a new phrase; and aiming at each added word in the newly added short sentence, executing: converting the added words into corresponding neurons; searching a first neuron matched with the neuron corresponding to the added word on the main neural network, and activating the first neuron; when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function; the link coefficient indicated by the first link relation is updated using the calculated first link coefficient.
In one embodiment of the present invention, the first linker 1202 is further configured to set an active state and a suppressed state for the neuron, wherein the active state indicates that the neuron is used and the suppressed state indicates that the neuron is not used; when a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state.
In one embodiment of the invention, the neuron further indicates signal strength; a first linker 1202 further for finding a first neuron matching the signal type indicated by the neuron corresponding to the add word; and when the signal intensity indicated by the neuron corresponding to the increased word is not less than a preset threshold value, activating the first neuron.
In an embodiment of the present invention, the word segmentation recognition lexicon constructing device further includes: a computing unit (not shown in the figure) for performing, for each short sentence in the training text set: calculating md5 codes corresponding to the short sentences; and judging whether md5 codes have been recorded or not, if so, ignoring the short sentence, otherwise, recording md5 codes and outputting the short sentence to the construction unit.
In an embodiment of the present invention, the word segmentation recognition lexicon constructing device further includes: an updating unit (not shown in the figure) for setting corresponding decay periods for the link relation and the neuron, respectively; when the word segmentation recognition word stock is used and the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation, deleting the link relation; and when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron, deleting the neuron and the link relation linked with the neuron.
In an embodiment of the present invention, the word segmentation recognition lexicon constructing device further includes: an updating unit (not shown in the figure), further configured to calculate a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time, and a preset attenuation function; and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
In an embodiment of the present invention, the word segmentation recognition lexicon constructing device further includes: and an updating unit (not shown in the figure) further configured to update the link coefficient indicated by each second link relation by using a preset link constant.
In an embodiment of the present invention, the word segmentation recognition lexicon constructing device further includes: and aiming at least one third link relation corresponding to the deterministic non-word sequence, executing: and deleting at least one third link relation.
The first linker is divided into an inlet first linker and an outlet first linker, wherein the inlet first linker is arranged at the neural signal inlet, and the outlet first linker is arranged at the neural signal outlet and is linked with neurons in a neural network in a word segmentation recognition word bank.
As shown in fig. 13, an embodiment of the present invention provides a chinese word segmentation apparatus 1300, where the chinese word segmentation apparatus 1300 includes: a conversion unit 1301 and a second linker 1302, wherein,
a converting unit 1301, configured to, for each short sentence to be participated in the text to be participated, perform: converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
a second linker 1302, configured to search a matching neuron matched with each neuron to be segmented in a word segmentation recognition word bank, where the word segmentation recognition word bank includes a plurality of neurons and link relationships between the neurons;
the second linker 1302 is further configured to determine the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relationship between every two matching neurons.
In an embodiment of the present invention, the second linker 1302 is further configured to sequentially search matching neurons matching each word to be segmented according to a position sequence of each word to be segmented in the short sentence to be segmented, and convert the matching neurons from a suppressed state to an activated state.
In an embodiment of the present invention, the second linker 1302 is further configured to, for each neuron to be divided, perform: determining a first output signal strength of a neuron to be divided; when the neurons in the activated state exist in the word segmentation recognition word stock, penetrating the neurons in the activated state by using the neurons to be segmented; calculating the strength of a first penetration signal generated by the neuron to be divided penetrating through the neuron in the activated state according to the relative positions of the neuron in the activated state and the neuron to be divided in the word segmentation short sentence; calculating the first segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity; and performing word segmentation according to the first word segmentation signal strength.
In an embodiment of the present invention, the second linker 1302 is further configured to locate a current position to be segmented and a previous segmentation position corresponding to the current position to be segmented; determining second output signal intensity of a neuron to be divided corresponding to a first word to be divided between the current word position to be divided and the last word position to be divided; aiming at every two target words to be segmented between the current word to be segmented position and the last word to be segmented position, executing: determining second penetration signal intensity corresponding to the two target characters to be segmented according to the relative positions of the two target characters to be segmented in the short sentence to be segmented and the link coefficients corresponding to the two target characters to be segmented; calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity; and judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current word segmentation position to be the word segmentation position.
The second linker is divided into an inlet second linker and an outlet second linker, wherein the inlet second linker is arranged at the neural signal inlet, and the outlet second linker is arranged at the neural signal outlet and is linked with neurons in a neural network in the word segmentation recognition word bank.
Fig. 14 shows an exemplary system architecture 1400 to which the word segmentation recognition word stock construction method or the word segmentation recognition word stock construction apparatus or the chinese word segmentation method or the chinese word segmentation apparatus according to the embodiment of the present invention can be applied.
As shown in fig. 14, the system architecture 1400 may include terminal devices 1401, 1402, 1403, a network 1404, and a server 1405. The network 1404 serves to provide a medium for communication links between the terminal devices 1401, 1402, 1403 and the server 1405. The network 1404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may use terminal devices 1401, 1402, 1403 to interact with a server 1405 via a network 1404, to receive or send messages or the like. The terminal devices 1401, 1402, 1403 may have installed thereon various communication client applications, such as a word segmentation class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 1401, 1402, 1403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 1405 may be a server that provides various services, such as a background management server (for example only) that provides word segmentation recognition word stock construction or word segmentation support for information-based websites browsed by users using the terminal devices 1401, 1402, and 1403. The background management server may analyze and perform other processing on the received text content and the like to construct a word segmentation recognition word bank or perform word segmentation, and feed back a processing result (for example, a word segmentation result — only an example) to the terminal device.
It should be noted that the word segmentation recognition word stock construction method or the chinese word segmentation method provided in the embodiment of the present invention is generally executed by the server 1405, and accordingly, the word segmentation recognition word stock construction device or the chinese word segmentation device is generally disposed in the server 1405.
It should be understood that the number of terminal devices, networks, and servers in fig. 14 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 15, a block diagram of a computer system 1500 suitable for use as a server in implementing embodiments of the present invention is shown. The terminal device shown in fig. 15 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
15, the computer system 1500 includes a Central Processing Unit (CPU)1501 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1502 or a program loaded from a storage section Y08 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the system 1500 are also stored. The CPU 1501, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.
The following components are connected to the I/O interface 1505: an input portion 1506 including a keyboard, a mouse, and the like; an output portion 1507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1508 including a hard disk and the like; and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. A removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1510 as necessary, so that a computer program read out therefrom is mounted into the storage section 1508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1509, and/or installed from the removable medium 1511. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a building unit and a first linker. Where the names of these units do not in some cases constitute a limitation on the units themselves, for example, a building unit may also be described as a "unit that trains a neural network using training samples".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: for short sentences in the training text set, executing: removing the duplicates of the short sentences, and constructing corresponding neurons for each word in the short sentences after the duplicates are removed, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction; and fusing each short sentence neural network to a word segmentation recognition word stock.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: the word segmentation recognition word bank comprises a main neural network and a linker linked by neurons in the main neural network; each short sentence neural network is fused to a word segmentation recognition word stock, and the method comprises the following steps: performing for each phrase neural network: linking each neuron in the phrasal neural network to a linker; traversing each neuron in the short sentence neural network through a linker; and when the traversal result shows that the neurons with the same signal types exist between the main neural network and the phrase neural network, deleting the neurons with the same signal types in the phrase neural network, and connecting the related linkage relations of the neurons with the same signal types to the main neural network.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a newly added short sentence; and aiming at each added word in the newly added short sentence, executing: converting the added words into corresponding neurons; on the master neural network, searching a first neuron matched with a neuron corresponding to the added word through a linker, and activating the first neuron; when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function; the link coefficient indicated by the first link relation is updated using the calculated first link coefficient.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: aiming at each short sentence to be participled in the text to be participled, executing: converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented; searching a matching neuron matched with each neuron to be segmented in a segmentation recognition word bank, wherein the segmentation recognition word bank comprises a plurality of neurons and link relations among the neurons; and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: according to the position sequence of each word to be segmented in the short sentence to be segmented, sequentially searching a matched neuron matched with each neuron to be segmented, and converting the matched neuron from a suppression state to an activation state; for each neuron to be divided, performing: determining a first output signal strength of a neuron to be divided; when the neurons in the activated state exist in the word segmentation recognition word stock, penetrating the neurons in the activated state by using the neurons to be segmented; calculating the first penetration signal intensity generated by the neuron to be divided penetrating the neuron in the activated state according to the relative positions of the neuron in the activated state and the neuron to be divided in the word segmentation short sentence; determining the word segmentation position of the short sentence to be segmented, comprising the following steps: calculating the first segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity; and performing word segmentation according to the first word segmentation signal strength.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: positioning the current word segmentation position and the previous word segmentation position corresponding to the current word segmentation position; determining second output signal intensity of a neuron to be divided corresponding to a first word to be divided between the current word position to be divided and the last word position to be divided; aiming at every two target words to be segmented between the current word to be segmented position and the last word to be segmented position, executing: determining second penetration signal intensity corresponding to the two target characters to be segmented according to the relative positions of the two target characters to be segmented in the short sentence to be segmented and the link coefficients corresponding to the two target characters to be segmented; calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity; and judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current word segmentation position to be the word segmentation position.
According to the technical scheme of the embodiment of the invention, the neurons are constructed through each word in the short sentences in the training text set, the link relation between the neurons is constructed according to the relative position and the occurrence frequency between every two words, so that the word segmentation recognition word stock comprises the neurons of the words and the link relation between the words instead of storing the words or the phrases, when the words are segmented based on the word segmentation recognition word stock, the words are segmented according to the link relation between the words, compared with the word stock storing the words or the phrases, the embodiment of the invention constructs the words based on the words and the link relation, the link relation between the words can embody more words, and correspondingly, the word segmentation process is also based on the words and the link relation and is not restricted by the words or the phrases. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word bank and the accuracy of word segmentation.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (19)

1. A method for constructing a word segmentation recognition word stock is characterized by comprising the following steps:
for short sentences in the training text set, executing:
de-duplicating the short sentence, and constructing a corresponding neuron for each word in the de-duplicated short sentence, wherein the signal type indicated by the neuron is matched with the word corresponding to the neuron;
constructing a link relation between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
and fusing the short sentence neural networks to form a word segmentation recognition word bank.
2. The method of constructing a thesaurus for word segmentation recognition according to claim 1,
the word segmentation recognition word stock comprises a main neural network and a linker linked by neurons in the main neural network;
the fusing each short sentence neural network comprises:
performing, for each of the phrase neural networks:
linking each of the neurons in the phrase neural network to the linker;
traversing each of the neurons in the phrase neural network through the linker;
and when the traversal result shows that the neurons with the same signal types exist between the main neural network and the short-sentence neural network, deleting the neurons with the same signal types in the short-sentence neural network, and connecting the related link relations of the neurons with the same signal types to the main neural network.
3. The method of constructing a thesaurus for word segmentation recognition according to claim 2,
when the traversal result shows that the main neural network and the phrase neural network have the link relation with the same signal transmission direction,
and updating the link coefficients indicated by the link relations with the same signal transmission direction on the master neural network according to the link coefficients indicated by the link relations with the same signal transmission direction.
4. The method for constructing a word segmentation recognition lexicon according to claim 2, further comprising:
acquiring a newly added short sentence;
and executing the following steps aiming at each added word in the added short sentence:
converting the added words into corresponding neurons;
on the master neural network, searching a first neuron matched with the neuron corresponding to the added word through the linker, and activating the first neuron;
when a first link relation exists between the two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
updating the link coefficient indicated by the first link relation with the calculated first link coefficient.
5. The method for constructing a word segmentation recognition lexicon according to claim 4, further comprising:
setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used;
when a recovery signal is acquired, the activation state of the neuron is converted to a suppression state.
6. The method of constructing a thesaurus for word segmentation recognition according to claim 5,
the neuron further indicates a signal strength;
the searching for the first neuron matching the neuron corresponding to the added word comprises:
searching a first neuron matched with the signal type indicated by the neuron corresponding to the increased word;
and when the signal intensity indicated by the neuron corresponding to the increased word is not less than a preset threshold value, activating the first neuron.
7. The method for constructing a word segmentation recognition lexicon according to any one of claims 1 to 4, 5 and 6, further comprising:
for each short sentence in the training text set, performing:
calculating md5 codes corresponding to the short sentences;
and judging whether the md5 codes are recorded or not, if so, ignoring the short sentence, otherwise, recording the md5 codes, and executing the steps of removing the duplicates of the short sentence and constructing a corresponding neuron for each word in the short sentence after the duplicates are removed.
8. The method for constructing a word segmentation recognition lexicon according to any one of claims 1 to 4, 5 and 6, further comprising:
setting corresponding attenuation periods for the link relation and the neuron respectively;
when the word segmentation recognition lexicon is used,
deleting the link relation when the time length of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation;
and deleting the neuron and the link relation linked with the neuron when the length of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron.
9. The method for constructing a word segmentation recognition lexicon according to claim 8, further comprising:
calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset attenuation function;
and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
10. The method for constructing a word segmentation recognition lexicon according to any one of claims 1 to 4, 5 and 6, further comprising:
and aiming at least one second link relation corresponding to the deterministic word sequence, executing:
and updating the link coefficient indicated by each second link relation by using a preset link constant.
11. The method for constructing a word segmentation recognition lexicon according to any one of claims 1 to 4, 5 and 6, further comprising:
and aiming at least one third link relation corresponding to the deterministic non-word sequence, executing:
deleting the at least one third linking relationship.
12. A Chinese word segmentation method is characterized by comprising the following steps:
aiming at each short sentence to be participled in the text to be participled, executing:
converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
searching a matched neuron matched with each neuron to be segmented in a segmentation recognition word bank, wherein the segmentation recognition word bank is constructed by the segmentation recognition word bank construction method of any one of claims 1 to 11, and the segmentation recognition word bank comprises a plurality of neurons and link relations among the neurons;
and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
13. The method of chinese segmentation according to claim 12, wherein the searching for a matching neuron matching each of the neurons to be segmented comprises:
and sequentially searching matched neurons matched with each neuron to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented, and converting the matched neurons from a suppression state to an activation state.
14. The method of chinese tokenization of claim 13, further comprising:
for each neuron to be classified, performing:
determining a first output signal strength of the neuron to be divided;
when the neurons in the activated state exist in the word segmentation recognition word stock, penetrating the neurons in the activated state by using the neurons to be segmented;
calculating the first penetration signal intensity generated by the neuron to be divided penetrating through the neuron to be divided according to the relative positions of the neuron to be divided and the neuron to be divided in the word segmentation short sentence;
the determining the word segmentation position of the short sentence to be segmented comprises the following steps:
calculating the first segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity;
and performing word segmentation according to the first word segmentation signal strength.
15. The method for Chinese segmentation according to claim 12, wherein the determining the segmentation position of the short sentence to be segmented comprises:
positioning a current word segmentation position and a previous word segmentation position corresponding to the current word segmentation position;
determining second output signal intensity of a neuron to be segmented corresponding to a first word to be segmented between the current word to be segmented position and the last word to be segmented position;
executing, for every two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position:
determining second penetration signal intensity corresponding to the two target characters to be segmented according to the relative positions of the two target characters to be segmented in the short sentence to be segmented and the link coefficients corresponding to the two target characters to be segmented;
calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
and judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current word segmentation position to be a word segmentation position.
16. A device for constructing a word segmentation recognition word stock is characterized by comprising: a building element and a first linker, wherein,
the construction unit is configured to execute, for short sentences in the training text set: de-duplicating the short sentence, and constructing a corresponding neuron for each word in the de-duplicated short sentence, wherein the signal type indicated by the neuron is matched with the word corresponding to the neuron; constructing a link system between two neurons corresponding to every two words according to the relative position and the occurrence frequency of every two words in the short sentence so as to form a short sentence neural network corresponding to the short sentence, wherein the link relation indicates a link coefficient and a signal transmission direction;
the first linker is used for connecting the neurons constructed by the construction unit and fusing each short sentence neural network constructed by the construction unit to a word segmentation recognition word stock.
17. A Chinese word segmentation device is characterized by comprising: a conversion unit and a second linker, wherein,
the conversion unit is used for executing, aiming at each short sentence to be participled in the text to be participled: converting each word to be segmented in the short sentence to be segmented into a corresponding neuron to be segmented;
the second linker is used for searching a matching neuron matched with each neuron to be segmented in a word segmentation recognition word bank, wherein the word segmentation recognition word bank comprises a plurality of neurons and link relations among the neurons;
the second linker is further used for determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the found link relation of every two matched neurons.
18. A word segmentation recognition word stock construction server is characterized by comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-15.
19. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-15.
CN201911288705.7A 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device Active CN111178065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911288705.7A CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911288705.7A CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Publications (2)

Publication Number Publication Date
CN111178065A true CN111178065A (en) 2020-05-19
CN111178065B CN111178065B (en) 2023-06-27

Family

ID=70652028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911288705.7A Active CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Country Status (1)

Country Link
CN (1) CN111178065B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779207A (en) * 2020-12-03 2021-12-10 北京沃东天骏信息技术有限公司 Visual angle layering method and device for dialect text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458694A (en) * 2008-10-09 2009-06-17 浙江大学 Chinese participle method based on tree thesaurus
US20120054192A1 (en) * 2010-08-30 2012-03-01 Microsoft Corporation Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458694A (en) * 2008-10-09 2009-06-17 浙江大学 Chinese participle method based on tree thesaurus
US20120054192A1 (en) * 2010-08-30 2012-03-01 Microsoft Corporation Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YAN NIU等: "An Improved Chinese Segmentation Algorithm Based on New Dictionary Construction" *
吴建源;: "基于BP神经网络的中文分词算法研究" *
周程远;朱敏;杨云;: "基于词典的中文分词算法研究" *
王坚,赵恒永: "专业搜索引擎中文分词算法的实现与研究" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779207A (en) * 2020-12-03 2021-12-10 北京沃东天骏信息技术有限公司 Visual angle layering method and device for dialect text

Also Published As

Publication number Publication date
CN111178065B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
US11501182B2 (en) Method and apparatus for generating model
JP7253593B2 (en) Training method and device for semantic analysis model, electronic device and storage medium
CN112560501B (en) Semantic feature generation method, model training method, device, equipment and medium
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
US11636341B2 (en) Processing sequential interaction data
CN106445915B (en) New word discovery method and device
CN113657100B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN113268560A (en) Method and device for text matching
CN112926308A (en) Method, apparatus, device, storage medium and program product for matching text
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN112464642A (en) Method, device, medium and electronic equipment for adding punctuation to text
US10043511B2 (en) Domain terminology expansion by relevancy
JP2022088540A (en) Method for generating user interest image, device, electronic apparatus and storage medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN114036921A (en) Policy information matching method and device
CN111178065B (en) Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device
US20230052623A1 (en) Word mining method and apparatus, electronic device and readable storage medium
CN110895655A (en) Method and device for extracting text core phrase
CN113033179B (en) Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium
CN115577082A (en) Document keyword extraction method and device, electronic equipment and storage medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN112836049B (en) Text classification method, device, medium and computing equipment
CN114925185B (en) Interaction method, model training method, device, equipment and medium
CN116304385A (en) Method and device for searching interest points of Chinese webpage and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220930

Address after: 12 / F, 15 / F, 99 Yincheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Applicant after: Jianxin Financial Science and Technology Co.,Ltd.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant