CN111178065B - Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device - Google Patents

Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device Download PDF

Info

Publication number
CN111178065B
CN111178065B CN201911288705.7A CN201911288705A CN111178065B CN 111178065 B CN111178065 B CN 111178065B CN 201911288705 A CN201911288705 A CN 201911288705A CN 111178065 B CN111178065 B CN 111178065B
Authority
CN
China
Prior art keywords
word
neuron
neurons
word segmentation
segmented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911288705.7A
Other languages
Chinese (zh)
Other versions
CN111178065A (en
Inventor
李胤文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN201911288705.7A priority Critical patent/CN111178065B/en
Publication of CN111178065A publication Critical patent/CN111178065A/en
Application granted granted Critical
Publication of CN111178065B publication Critical patent/CN111178065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a word segmentation recognition word stock construction method, a Chinese word segmentation method and a Chinese word segmentation device, and relates to the technical field of computers. One embodiment of the method comprises the following steps: for a phrase in the training text set, performing: the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction; and fusing the short sentence neural networks to form a word segmentation recognition word library. The word library word quantity and word segmentation accuracy can be effectively improved.

Description

Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device
Technical Field
The invention relates to the technical field of computers, in particular to a word segmentation recognition word stock construction method, a Chinese word segmentation method and a Chinese word segmentation device.
Background
Chinese word segmentation based on word stock is one of the more common word segmentation modes at present. Therefore, constructing and maintaining word stock is the basis for realizing word segmentation.
The existing word stock is mainly constructed and maintained manually, i.e. some existing words such as words in the modern Chinese standard dictionary, some new words appearing in the network, etc. are collected manually, and the collected words are stored in the word stock.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:
compared with massive text data in a network, the words collected manually are very limited, so that the word quantity stored in a word stock is very limited. Then, when word segmentation is performed based on the word stock constructed or maintained manually, the word quantity stored in the word stock often cannot meet the word segmentation requirement.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a word segmentation recognition word stock construction method, a Chinese word segmentation method and a server, which can effectively improve the word quantity of the word stock and the word segmentation accuracy.
In order to achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a word stock construction method including:
For a phrase in the training text set, performing:
the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons;
according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction;
and fusing the short sentence neural networks to form a word segmentation recognition word library.
Preferably, the method comprises the steps of,
the word segmentation recognition word library comprises a main neural network and a linker linked with neurons in the main neural network;
fusing each phrase neural network, including:
for each phrase neural network, performing:
linking each neuron in the phrase neural network to a linker;
traversing each neuron in the phrase neural network through a linker;
and when the traversing result is that the neurons with the same signal type exist between the main neural network and the phrase neural network, deleting the neurons with the same signal type in the phrase neural network, and connecting the link relation related to the neurons with the same signal type to the main neural network.
Preferably, the method comprises the steps of,
when the result of the traversal is that the link relation with the same signal transmission direction exists between the main neural network and the phrase neural network,
and updating the link coefficients indicated by the link relation with the same signal transmission direction on the main neural network according to the link coefficients indicated by the link relation with the same signal transmission direction.
Preferably, the method comprises the steps of,
the word segmentation recognition word stock construction method further comprises the following steps:
acquiring a new short sentence;
for each add word in the newly added phrase, performing:
converting the augmentation word into a corresponding neuron;
searching a first neuron matched with a neuron corresponding to the increment word through a linker on a main neural network, and activating the first neuron;
when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
and updating the link coefficient indicated by the first link relation by using the calculated first link coefficient.
Preferably, the method comprises the steps of,
the word segmentation recognition word stock construction method further comprises the following steps:
setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used;
When a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state.
Preferably, the method comprises the steps of,
neurons further indicate signal strength;
searching for a first neuron that matches a neuron corresponding to the increment word, comprising:
searching a first neuron matched with the signal type indicated by the neuron corresponding to the increment word;
and when the signal intensity indicated by the neuron corresponding to the increment word is not smaller than a preset threshold value, activating the first neuron.
Preferably, the method comprises the steps of,
the word segmentation recognition word stock construction method further comprises the following steps:
for each phrase in the training text set, performing:
calculating an md5 code corresponding to the short sentence;
judging whether the md5 code is already recorded, if so, ignoring the phrase, otherwise, recording the md5 code, performing de-duplication of the phrase, and constructing a corresponding neuron for each word in the de-duplicated phrase.
Preferably, the method comprises the steps of,
the word segmentation recognition word stock construction method further comprises the following steps:
setting corresponding attenuation periods for the link relation and the neurons respectively;
when a word segmentation recognition thesaurus is used,
deleting the link relation when the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation;
and deleting the neuron and the link relation to which the neuron is linked when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron.
Preferably, the method comprises the steps of,
the word segmentation recognition word stock construction method further comprises the following steps:
calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset decay function;
and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
Preferably, the word segmentation recognition word stock construction method further comprises the steps of:
executing, for at least one second link relation corresponding to the deterministic word sequence:
and updating the link coefficient indicated by each second link relation by using a preset link constant.
Preferably, the word segmentation recognition word stock construction method further comprises the steps of:
executing, for at least one third link relation corresponding to the deterministic non-word sequence:
at least one third linking relation is deleted.
According to a second aspect of the embodiment of the present invention, there is provided a chinese word segmentation method, including:
executing, for each word to be segmented phrase in the word to be segmented text:
converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
searching matched neurons matched with each neuron to be segmented in a word segmentation recognition word library, wherein the word segmentation recognition word library is constructed by any one of the word segmentation recognition word library construction methods, and comprises a plurality of neurons and a link relation among the neurons;
And determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched link relation of every two matched neurons.
Preferably, the method comprises the steps of,
searching for a matched neuron that matches each neuron to be classified, comprising:
and according to the position sequence of each word to be segmented in the word to be segmented phrase, sequentially searching the matched neurons matched with each neuron to be segmented, and converting the matched neurons from the inhibition state to the activation state.
Preferably, the Chinese word segmentation method further comprises:
for each neuron to be divided, performing:
determining a first output signal strength of the neuron to be divided;
when the word segmentation recognition word stock has neurons in an activated state, the neurons in the activated state are penetrated by the neurons to be segmented;
according to the relative positions of the neurons in the activated state and the neurons to be segmented in the segmentation phrases, calculating the first penetration signal intensity generated by the neurons in the activated state of penetration of the neurons to be segmented;
determining word segmentation positions of short sentences to be segmented comprises the following steps:
calculating first word segmentation signal intensity corresponding to neurons to be segmented according to the first output signal intensity and the first penetration signal intensity;
And performing word segmentation according to the first word segmentation signal intensity.
Preferably, the method comprises the steps of,
determining word segmentation positions of short sentences to be segmented comprises the following steps:
positioning the current word segmentation position and the last word segmentation position corresponding to the current word segmentation position;
determining the second output signal intensity of a neuron to be segmented corresponding to a first word to be segmented between the current word segmentation position and the last word segmentation position;
for every two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position, executing:
determining second penetration signal strength corresponding to the two target words to be separated according to the relative positions of the two target words to be separated in the word to be separated short sentence and the link coefficients corresponding to the two target words to be separated;
calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current position to be segmented as a word segmentation position.
According to a third aspect of the embodiment of the present invention, there is provided a word segmentation recognition thesaurus construction apparatus, including: a building unit and a first linker, wherein,
a construction unit, configured to execute, for a phrase in a training text set: the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction;
The first linker is used for connecting the neurons constructed by the construction unit and fusing each phrase neural network constructed by the construction unit to the word segmentation recognition word stock.
According to a fourth aspect of an embodiment of the present invention, there is provided a chinese word segmentation apparatus, including: a conversion unit and a second linker, wherein,
the conversion unit is used for executing, for each short sentence to be segmented in the text to be segmented: converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
the second linker is used for searching for a matched neuron matched with each neuron to be segmented in the word segmentation recognition word stock, wherein the word segmentation recognition word stock comprises a plurality of neurons and a link relation among the neurons;
the second linker is further used for determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched link relation of every two matched neurons.
One embodiment of the above invention has the following advantages or benefits: by training each word in the short sentence in the text set to construct a neuron, and constructing a link relation between the neurons according to the relative position and occurrence frequency between every two words, the word library is identified by the word, which comprises the neurons with the words and the link relation between the words, instead of storing the words or phrases, when the word library is identified based on the word library for word segmentation, the word is segmented according to the link relation between the words and the phrases. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word stock and the word segmentation accuracy.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a word segmentation recognition thesaurus construction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a portion of a neural network, according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a relationship between a linker and a neuron according to an embodiment of the invention;
FIG. 4 is a schematic diagram of the main flow of a word segmentation recognition thesaurus construction method according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of a phrase neural network, according to another embodiment of the invention;
FIG. 6 is a schematic diagram of a portion of a principal neural network in a word stock for word segmentation recognition, according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of the main flow of a word segmentation recognition thesaurus construction method according to another embodiment of the present invention;
FIG. 8 is a schematic diagram of the main flow of a Chinese word segmentation method according to one embodiment of the invention;
FIG. 9 is a schematic diagram of a main flow of determining word segmentation locations according to one embodiment of the invention;
FIG. 10 is a schematic diagram of a portion of a principal neural network in a word stock for word segmentation recognition, according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a main flow of determining word segmentation locations according to another embodiment of the present invention;
FIG. 12 is a schematic diagram of the main units of the word segmentation recognition thesaurus construction apparatus according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of the main units of a Chinese word segmentation apparatus according to an embodiment of the present invention;
FIG. 14 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
FIG. 15 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Short sentences refer to sentences separated in an article by symbols such as commas, stop signs, quotation marks, periods, question marks, and the like.
Neurons refer to abstract nodes with electrical signals that make up a neural network.
Words are the smallest unit of language that can be used independently.
Chinese word segmentation is the process of recombining a sequence of consecutive words into a sequence of words according to a certain specification. Chinese is simply a word, sentence, and paragraph that can be simply delimited by obvious delimiters, with the exception that no word has a delimiter in the form of a word.
Neurons are the basic structural and functional units that make up a neural network, and process input link information for output to other neurons via output links. With the type of signal and the strength of the signal,
a neural network refers to a network composed of neurons and link relationships between neurons.
Neural signals refer to signals transmitted between neurons, which are indicative of signal type and signal strength. In constructing the lexicon or the word segmentation, when the signal type of the neural signal is the same as the signal type of a neuron, the neuron may be activated if the signal strength is greater than a threshold. The signal type is unchanged in the operation process, and the signal intensity of the nerve signal is changed. The input signal type and the output signal type are divided into an input signal strength and an output signal strength.
The link coefficient refers to a coefficient of a link between two neurons, which is used to control the intensity ratio of input signals/output signals of a neuron link formed by two or more words, and is divided into an input coefficient and an output coefficient. The output signal intensity of the neural signal can be calculated according to the input signal intensity and the connection coefficient of the neural signal, so that word stock or word segmentation is constructed.
The suppression state refers to that the neuron in the word segmentation recognition word stock is usually in the suppression state, when the neuron is in the suppression state, the neuron can be activated when receiving a signal which is matched with the neuron and has the intensity larger than a threshold value, and the unmatched signal is not processed.
The activation state refers to that neurons in an inhibition state in the word segmentation recognition word stock are activated by matched neural signals and become an activation state. The neurons in the activated state can penetrate all input neural signals, and return to the inhibited state when the neurons in the activated state receive a recovery signal. For example, the a neuron becomes an activated state after receiving the a-nerve signal, and then receives the b-nerve signal, and because the a-neuron is currently in the activated state, a neuron operation is performed between the b-nerve signal and the a-neuron, if a recovery signal is received, all neurons enter an inhibited state, and the a-neuron is activated again only when the a-nerve signal is received.
And the linker is used for butting neurons and sending input neuron signals to output neurons which are the same as or matched with the input neuron signals in type, if the input neuron signals are not matched in the word segmentation word stock construction process, the neurons corresponding to the input neuron signals are stored in the word stock, and if the input neuron signals are not matched in the word segmentation process, abnormal signals are output. The link coefficient between the linker and the butt-joint neuron is 1 (i.e. the process of transmitting a neuron signal after the linker receives the neuron signal does not attenuate the intensity of the neuron signal, and can directly penetrate through the neuron signal). Each word corresponds to a type of neuron signal; inputting a neuron signal to an output neuron of the same/matching type as the input neuron signal, the input neuron signal is said to express the same word as the output neuron of the same/matching type as the neuron signal.
Fig. 1 is a word segmentation recognition thesaurus construction method according to an embodiment of the present invention, as shown in fig. 1, the word segmentation recognition thesaurus construction method may include the steps of:
s101 and S102 are performed for phrases in the training text set:
s101: the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons;
s102: according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction;
s103: and fusing the short sentence neural networks to form a word segmentation recognition word library.
The training text album is from dictionary, teaching material, newspaper, network article and other text content capable of being read.
The link coefficient in the phrase neural network is calculated according to the following calculation formula (1) according to the distance between the current word and the subsequent word.
Calculation formula (1):
W(NL i )=10×pow(0.1,i)
wherein NL is i Characterizing two words of the phrase with a spacing i; inW (NL) i ) Characterizing a linking coefficient between two words of distance i in a phrase; pow (0.1, i) characterizes an i-th power function that computes 0.1, i is an integer no less than 0, i is 0 when it is itself, i=1 when two words are adjacent, and so on. For example, in apples, the interval between "this" and "yes" is 1; the interval between "this" and "apple" is 2; the interval between the 'this' and the 'fruit' is 3; thus, a phrase "this is apple" is passed through step S10 above The neural network constructed in step S102 and 1 is shown in fig. 2. As a result of the above calculation formula (1), the signal transfer direction "the" neuron to "yes" neuron link coefficient was 0.1, the signal transfer direction "the" neuron to "apple" neuron link coefficient was 0.01, the signal transfer direction "the" neuron to "fruit" neuron link coefficient was 0.001, the signal transfer direction "the" neuron to "apple" neuron link coefficient was 0.1, the signal transfer direction "the" neuron to "fruit" neuron link coefficient was 0.1, and the signal transfer direction "apple" neuron to "fruit" neuron link coefficient was 0.2.
That is, the link coefficient is a value calculated according to the relative position between two words, such as the number of characters in the interval between two words, and the frequency of the two words appearing in a phrase at the same time, and the link coefficient can reflect the probability that the two words belong to the same word or to the same phrase to some extent. The farther apart the distance, the lower the probability of belonging to the same word and, therefore, the smaller the link coefficient. Therefore, the word segmentation recognition word library constructed by the embodiment of the invention can show the relation between the words.
The signal transmission direction refers to the sequence between two words, for example, the phrase "i love eating apples", i "," i love "," eating apples "and" i "respectively construct corresponding neurons, and the signal transmission direction between the neurons corresponding to" i "and the neurons corresponding to" i "is from the neurons corresponding to" i "to the neurons corresponding to" i ", for example, the phrase" i Chinese ", i.e." i "," m "," w "respectively construct corresponding neurons, and the signal transmission direction between the neurons corresponding to" i "and the neurons corresponding to" i "is from the neurons corresponding to" i "to the neurons corresponding to" i ".
In the scheme provided by the embodiment of the invention, the neurons are constructed by training each word in the short sentence in the text set, and the link relation between the neurons is constructed according to the relative position and the occurrence frequency between every two words, so that the word-dividing recognition word library comprises the neurons with the words and the link relation between the words, and the word-dividing recognition word library is not stored according to words or phrases, can embody more words and is not restricted by the words or phrases any more. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word stock.
In one embodiment of the invention, the word segmentation recognition word stock comprises a main neural network and a linker linked with neurons in the main neural network; the relationship between each neuron and the linker is shown in fig. 3, and as can be seen from fig. 3, the linker includes an inlet linker and an outlet linker, and the neurons (character a neuron, character B neuron, …, character C neuron, character D neuron, …) corresponding to each character on the main neural network are respectively connected to the inlet linker and the outlet linker to input the neurons or the neural signals from the inlet linker and output the results from the outlet linker.
Accordingly, based on the linker, the specific embodiment of step S103 may include: as shown in fig. 4, steps S401 to S403 are performed for each phrase neural network:
s401: linking each neuron in the phrase neural network to a linker;
s402: traversing each neuron in the phrase neural network through a linker;
s403: and when the traversing result is that the neurons with the same signal type exist between the main neural network and the phrase neural network, deleting the neurons with the same signal type in the phrase neural network, and connecting the link relation related to the neurons with the same signal type to the main neural network.
The neural network shown in fig. 2 is taken as a main neural network in a word stock for word segmentation recognition, the phrase neural network shown in fig. 5 is fused into the neural network shown in fig. 2, each neuron in the phrase neural network shown in fig. 5 ("i's corresponding neuron," love' corresponding neuron, "apple" corresponding neuron and "fruit" corresponding neuron) is traversed, wherein the neurons with the same signal type between the phrase neural network in fig. 5 and the neural network shown in fig. 2 are the neurons with the corresponding signal type "apple" and the neurons with the corresponding "fruit", the neurons with the same signal type in the phrase neural network are deleted, and the link relation of the neurons with the same signal type is connected to the main neural network, so that the neural network shown in fig. 6 is obtained.
In one embodiment of the invention, when the traversing result is that the link relation with the same signal transmission direction exists between the main neural network and the phrase neural network, the link coefficient indicated by the link relation with the same signal transmission direction on the main neural network is updated according to the link coefficient indicated by the link relation with the same signal transmission direction. If the signal transmission direction in the link relationship from the "apple" corresponding neuron to the "fruit" corresponding neuron existing in the phrase neural network shown in fig. 5 is the same as the signal transmission direction in the link relationship from the "apple" corresponding neuron to the "fruit" corresponding neuron in the neural network shown in fig. 2, the link coefficient from the "apple" corresponding neuron to the "fruit" corresponding neuron in fig. 6 is updated by the link coefficient from the "apple" corresponding neuron to the "fruit" corresponding neuron existing in the phrase neural network shown in fig. 5.
The above gives the fusion of the phrase neural network to the main neural network to update the main neural network. An embodiment will be given below as a procedure for sequentially inputting each word in a phrase to the main neural network to update the main neural network. In one embodiment of the present invention, as shown in fig. 7, the method for constructing the word stock for word segmentation and recognition may further include the following steps:
s701 to S704 are performed for each add word in the newly added phrase:
s701: converting the augmentation word into a corresponding neuron;
s702: searching a first neuron matched with a neuron corresponding to the increment word through a linker on a main neural network, and activating the first neuron;
s703: when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
s704: and updating the link coefficient indicated by the first link relation by using the calculated first link coefficient.
In the step S702, if the first neuron that is matched with the first neuron cannot be found, a neuron is newly added in the word stock for the neuron corresponding to the increment word; neurons in a lexicon are typically in a suppressed state, and when the neuron is matched, the neuron is activated, and the state of the neuron changes to an activated state.
The activation function of the above step S703 is shown in the following calculation formula (2):
calculation formula (2):
W′ n+1 (NL)=Su(Su(W′ n (NL))+1)
wherein W' n+1 (NL) characterizing the link coefficient between neurons corresponding to two words of the n+1th activation NL; w'. n (NL) characterizing a link coefficient between neurons corresponding to two words of the nth activation NL; n is an integer not less than 0; su (P) =sigmoid (P-N); for W' n+1 (NL), P represents W' n (NL) or Su (W)' n (NL)) +1; n characterizes the set-up parameter, which is the x-axis center point position of the sigmoid function.
Wherein, the formula form of Su (P) =sigmoid (P-N) is shown in the following calculation formula (3):
calculation formula (3):
Figure BDA0002315507170000121
the sigmoid function has a value between 0 and 1, is centrosymmetric at 0.5 and is an S-shaped curve, so that the link coefficient can be kept between 0 and 1, and the result is incremental along with the increase of the activation times, thereby facilitating the subsequent word segmentation calculation. In a preferred embodiment, n=10.
In an embodiment of the present invention, the method for constructing the word segmentation recognition word stock further includes: setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used; when a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state. Through the activation state, the inhibition state and the recovery signal, in the searching process, the target neuron can be searched by searching the neuron in the activation state, so that the link coefficient can be counted more efficiently and rapidly, and words can be segmented more efficiently and rapidly. In addition, the resources occupied by neurons in the word segmentation recognition word stock can be effectively reduced through the inhibition state and the recovery signal.
Such as: the neural network shown in the figure 2 exists in the word segmentation recognition word stock, so after each word in the 'I love apple' is abstracted into a corresponding neural signal, the 'I', 'love', 'apple' and 'fruit' neural signals are sequentially input to the linker, the neural signals corresponding to the 'I', 'love' respectively do not find matched neurons, and any neurons are not activated; the corresponding neural signal of "apple" will activate the "apple" neuron as it matches the "apple" neuron in fig. 2; the "fruit" nerve signal can penetrate through the link relation between the "apple" neuron and the "fruit" neuron, and activate the neuron corresponding to the "fruit". For another example, after each word in the word "this effect" is abstracted into a corresponding neural signal, the "this" and "effect" neural signals are sequentially input to the linker, and the "this" neural signal is activated when the "this" neural signal is matched with the "this" neuron in fig. 2; the "fruit" neural signal may penetrate the link relationship between the "neuron and the" fruit "neuron and activate the neuron corresponding to the" fruit ". In the process of word segmentation and word library construction, a link relation is penetrated once, namely the link relation is activated once.
In one embodiment of the invention, the neuron is further indicative of signal strength; the specific embodiment of step S702 may include: searching a first neuron matched with the signal type indicated by the neuron corresponding to the increment word; and when the signal intensity indicated by the neuron corresponding to the increment word is not smaller than a preset threshold value, activating the first neuron. Since the link coefficient between the linker and the neuron is 1, the process of searching the neuron through the linker does not reduce the signal intensity of the neuron.
In an embodiment of the present invention, the method for constructing a word segmentation recognition word library may further include: for each phrase in the training text set, performing: calculating an md5 code corresponding to the short sentence; judging whether the md5 code is already recorded, if so, ignoring the phrase, otherwise, recording the md5 code, performing duplication removal for the phrase, and constructing a corresponding neuron for each word in the duplication-removed phrase. Wherein MD5 is a 128 bit (bit) signature obtained by mathematically transforming the phrase according to the disclosed MD5 algorithm. MD5 corresponding to the same phrase is the same. The md5 code can avoid the training of the same phrase, thereby effectively improving the accuracy of the link relation between neurons.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction method may further include: setting corresponding attenuation periods for the link relation and the neurons respectively; when the word segmentation recognition word stock is used, deleting the link relation when the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation; and deleting the neuron and the link relation to which the neuron is linked when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron. The method and the device realize deleting unused link relations and neurons, reduce unnecessary link relations and neurons to occupy resources, and avoid unnecessary resource expenditure, thereby further reducing the resource occupation and effectively improving word segmentation efficiency.
In an embodiment of the present invention, the method for constructing the word segmentation recognition word stock further includes: calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset decay function; and updating the link coefficient indicated by the link relation by using the calculated second link coefficient. The link coefficient can show the activity degree of the link relation so as to more accurately indicate the activity degree of the words corresponding to the link relation. Meanwhile, the attenuation process is embodied on a link coefficient, can represent forgetting of long-term unused words in the neural network, is beneficial to filtering the recently unused words, and reduces recognition calculation overhead in the word segmentation recognition process.
The specific embodiment of calculating the second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and the preset decay function may include:
calculating a second link coefficient corresponding to the current time by using the following calculation formula (4):
calculation formula (4):
Sd(t)=v*Ebbinghaus(t-t0)
wherein Sd (t) represents a second link coefficient corresponding to the current time; t0 represents the last activated time corresponding to the current time; v represents an attenuation coefficient; ebbinghaus () characterizes an Ebinghaus forgetting curve function. Preferably, v=1. The decay time may be set according to practical situations, for example, a 1 year decay from 100% to 0% may be set. By the attenuation setting, the linkage between neurons in the word segmentation recognition word stock or the attenuation of the neurons is realized to simulate the forgetting process of human brain, namely, forgetting starts immediately after learning, and the forgetting process is not uniform. Initially the forgetting speed is fast and later gradually slow. The Ebinhaos memory forgetting curve is a function of time for keeping and forgetting, so as to simulate the brain forgetting process to calculate the links between neurons or the attenuation of neurons in the word segmentation recognition word stock.
In an embodiment of the present invention, the method for constructing the word segmentation recognition word stock further includes: executing, for at least one second link relation corresponding to the deterministic word sequence: and updating the link coefficient indicated by each second link relation by using a preset link constant. The accuracy of word segmentation by utilizing the word segmentation recognition word stock can be further improved. For example, if "review" is a deterministic word, a preset link constant is used to update the link coefficient between the neuron corresponding to "review" and the neuron corresponding to "reading"; for example, if "lying tiger Tibetan" is a deterministic word, a preset link constant is used to update the link coefficient between the neuron corresponding to "lying" and the neuron corresponding to "tiger", the link coefficient between the neuron corresponding to "tiger" and the neuron corresponding to "Tibetan" and the link coefficient between the neuron corresponding to "Tibetan" and the neuron corresponding to "dragon". Preferably, the preset link constant is 0.9.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction method may further include: executing, for at least one third link relation corresponding to the deterministic non-word sequence: at least one third linking relation is deleted. Unnecessary link relation is reduced, so that resource expenditure generated by word segmentation is reduced. For example, the link relation corresponding to the non-word sequence of your "," my "," his "is deleted, so that the non-word sequence in the word stock can be greatly reduced, and the deletion of the non-word sequence from the neural network can enable the output result of the neural network to be more accurate and the word segmentation to be more accurate.
Fig. 8 shows a method for chinese word segmentation based on the word segmentation recognition word stock provided in the above embodiment. As shown in fig. 8, the method for performing chinese word segmentation based on the word segmentation recognition word stock may include: for each short sentence to be segmented in the text to be segmented, executing the following steps:
s801: converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
s802: searching matched neurons matched with each neuron to be segmented in a word segmentation recognition word stock, wherein the word segmentation recognition word stock comprises a plurality of neurons and a link relation among the neurons;
s803: and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched link relation of every two matched neurons.
For example, for a short sentence ABCDE to be segmented, searching a neuron a, a neuron B, a neuron C, a neuron D and a neuron E in a word segmentation recognition word library, searching a link relation between the neuron a and the neuron B, a link relation between the neuron a and the neuron C, a link relation between the neuron a and the neuron D, a link relation between the neuron a and the neuron E, a link relation between the neuron B and the neuron C, a link relation between the neuron B and the neuron E, a link relation between the neuron C and the neuron D, a link relation between the neuron C and the neuron E and a link relation between the neuron D and the neuron E, and determining a word segmentation position according to the found link relation, wherein when an output signal calculated by using the link relation is not greater than a preset word segmentation threshold, the word segmentation position can be calculated by the link relation between the neuron a and the neuron B, for example, the output signal of AB is greater than the word segmentation threshold by calculating the link relation between the neuron a and the neuron B, the word B and the neuron B, and the word position can be calculated by the link relation between the neuron B and the neuron B. For example, if the output signal intensity corresponding to AB calculated by using the link coefficient is greater than the output signal intensity corresponding to ABC, the word segmentation position is between B and C.
In one embodiment of the present invention, the searching for a matched neuron matched with each neuron to be classified includes: and according to the position sequence of each word to be segmented in the word to be segmented phrase, sequentially searching the matched neurons matched with each neuron to be segmented, and converting the matched neurons from the inhibition state to the activation state. For example, for the short sentence ABCDE to be segmented, the corresponding matching neuron is searched for a first, then the corresponding matching neuron is searched for B, and so on, the matching process can be firstly searched through the neuron already in the activated state, and if the neuron is not searched, the whole word segmentation recognition word library is searched. For example, if the matching neuron is found for B, the matching neuron is found from the matching neuron corresponding to a in the activated state, if the matching neuron can penetrate a and find B, the description AB may be a word, if the matching neuron cannot find B, the description AB is not a word, and the word segmentation position is between the AB. Therefore, through the active state neuron searching, on one hand, whether the searched neuron and the active state neuron have a link relation or not can be preliminarily determined so as to preliminarily judge whether the searched neuron and the active state neuron belong to the possibility of the same word or not, on the other hand, the searching range can be greatly reduced, the resource expenditure is saved, meanwhile, the searching time is saved, and the searching efficiency is improved.
There are two ways to determine the word segmentation position based on the neural network in the word segmentation recognition word library.
The first way of determining the word segmentation location is shown in fig. 9, and for each neuron to be segmented that is abstracted for each word in the phrase to be segmented, the following steps may be performed:
s901: determining a first output signal strength of the neuron to be divided; when the activated neurons exist in the word segmentation recognition word stock, executing step S902; when the word segmentation recognition word stock does not have the neuron in the activated state, executing step S903;
s902: penetrating neurons in an activated state by using the neurons to be divided, and executing step S906;
s903: searching neurons matched with the neurons to be segmented in a word segmentation recognition word stock, and executing S904 if the neurons are found; if not, S905 is performed;
s904: converting the matched neuron from an inhibited state to an activated state and outputting a signal of a first output signal strength; and ending the current flow;
s905: outputting an abnormal signal and ending the current flow;
s906: according to the relative positions of the neurons in the activated state and the neurons to be segmented in the segmentation phrases, calculating the first penetration signal intensity generated by the neurons in the activated state of penetration of the neurons to be segmented;
S907: calculating first word segmentation signal intensity corresponding to neurons to be segmented according to the first output signal intensity and the first penetration signal intensity;
s908: and performing word segmentation according to the first word segmentation signal intensity.
In the specific embodiment of determining the first output signal intensity of the neurons to be divided in the step S901, the first output signal intensity of the neurons to be divided is calculated by using the following calculation formula (5):
calculation formula (5):
OutS(X)=InS(X)×InW(X)×InW(NL 0 )×OutW(NL 0 )×OutW(X)
wherein, outS (X) represents the first output signal intensity of the neuron X to be divided; inS (X) characterizes the nerve signal intensity of the input of the neuron X to be divided if the nerves matched with the neuron X to be divided are found in the word segmentation recognition word stockThe warp element is InS (X) =1, otherwise, the warp element is a preset abnormal intensity parameter such as 0; inW (X) represents the input link coefficient of the neuron X to be divided (the link coefficient between the entrance linker and the neuron X to be divided), if a neuron matched with the neuron X to be divided is found in the word segmentation recognition word stock, inW (X) =1, otherwise, the input link coefficient is a preset abnormal input link parameter such as 0; outW (X) represents the link coefficient of the output of the neuron X to be divided (the link coefficient between the outlet linker and the neuron X to be divided), if a neuron matched with the neuron X to be divided is found in the word segmentation recognition word stock, outW (X) =1, otherwise, the link coefficient is a preset abnormal output link parameter such as 0; inW (NL) 0 ) Input link coefficients characterizing neuron X to be split to matched neurons, inW (NL) if a neuron matching neuron X to be split is found in the word recognition lexicon 0 ) Otherwise, a preset abnormal output link parameter such as 0; outW (NL) 0 ) Representing output link coefficients from neurons to be divided X to matched neurons; outW (NL) 0 )=10×pow(0.1,0)=10。
That is, through the step S901, it may be determined that the neuron X to be separated is found in the word-segmentation recognition word stock, the first output signal strength corresponding to the neuron X to be separated is 10, if the neuron X to be separated is not found in the word-segmentation recognition word stock, it is determined that the neuron X is an abnormal signal, that is, a character corresponding to the neuron X to be separated is an abnormal character, and an abnormal signal such as 0 is output. After the abnormal signal is output, the word segmentation position is the position of the neuron X to be segmented, the word segmentation recognition word stock can be reset, and a new round of word segmentation is performed from the next position of the neuron X to be segmented. The resetting of the word segmentation recognition word stock refers to inputting a recovery signal into the word segmentation word stock to recover all neurons in an activated state to a suppressed state.
In the specific embodiment of step S906, the character spacing between the neuron in the activated state and the neuron to be separated is determined according to the relative positions of the neuron in the activated state and the neuron to be separated in the word segmentation phrase, and the number of digits after the decimal point of the link coefficient in the link relationship between the neuron in the activated state and the neuron to be separated is determined according to the character spacing (i.e., the number of digits after the decimal point of the link coefficient in the link relationship between the neuron and the neuron to be separated is equal to the character spacing), and if the link relationship does not exist between the neuron in the activated state and the neuron to be separated, the link coefficient between the neuron in the activated state and the neuron to be separated is 0. The character spacing can be calculated by the following calculation formula (6):
Calculation formula (6):
K X->Y =K X -K Y
wherein K is X->Y Characterizing the character spacing between the neuron X to be divided and the neuron Y which is in an activated state and penetrated through the neuron X; k (K) X Characterization of neurons to be segmented X belonging to the K-th of the segmentation phrases X A character; k (K) Y Characterization of neurons Y in an activated state as belonging to the K-th of the phrase Y A character of K Y ≤K X For example, for a word segmentation phrase CDEF, the neurons in the activated state Y are C-matched neurons, D-matched neurons, and E-matched neurons, respectively, where C belongs to the 1 st character in the word segmentation phrase, D belongs to the 2 nd character in the word segmentation phrase, E belongs to the 3 rd character in the word segmentation phrase, F is the neuron X to be segmented and F belongs to the 4 th character in the word segmentation phrase, K F->C =3,K F->D =2,K F->E The link coefficient from the neuron matched with C to the neuron matched with F takes the last 3 bits of decimal points, namely the first penetration signal intensity generated by the neuron matched with F penetrating through the neuron matched with C; the link coefficient from the neuron matched with the D to the neuron matched with the F takes 2 bits after the decimal point, namely the first penetration signal intensity generated by the neuron matched with the D is penetrated by the neuron matched with the F; and taking 1 bit after the decimal point of the link coefficient from the neuron matched with E to the neuron matched with F, namely the first penetration signal intensity generated by the neuron matched with F penetrating through the neuron matched with E.
It is worth to say that, the number of the above-mentioned value digits can be what number of digits after taking the decimal place, other digits are zero-complemented; or taking several bits after decimal point, for example, the character spacing between phrases ABCDE, AE is 4; in the neural network, the link coefficient between A and E is 0.1223, the number of the valued bits can be 0.0003 which is the 4 th bit after taking the decimal point, and the number of the valued bits can also be 0.1223 which is the 4 th bit after taking the decimal point; the specific choice of which number of digits to take can be accomplished in a set or selected manner.
In the step S907, the first word segmentation signal strength corresponding to the neurons to be segmented is calculated as follows: according to the following calculation formula (7), the calculation results:
calculation formula (7):
Figure BDA0002315507170000191
wherein, outS (1- > X) characterizes the first word segmentation signal intensity from the 1 st character to the X character in the short sentence to be segmented; outS (X) represents the first output signal intensity of the neuron X to be divided corresponding to the X character, wherein X is a non-1 st character; inS (In) i Characterizing input signal strength of the ith character, inS i =1;OutW(NL i ) The intensity of a first penetration signal generated by a neuron corresponding to an ith character positioned before the X character is represented by the penetration of the neuron corresponding to the X character.
And when the first word segmentation signal strength meets the word segmentation condition, performing word segmentation. The word segmentation condition can be that OutS (1- > X) is more than or equal to a preset word segmentation threshold value and OutS (1- > X+1) < the preset word segmentation threshold value, and then a word segmentation position is arranged between X characters and X+1 characters in the short sentence to be segmented. The word segmentation condition can be OutS (1- > X) > OutS (1- > X+1), and then a word segmentation position is formed between X characters and X+1 characters in the short sentence to be segmented.
Taking the partial main neural network in the word segmentation recognition word library shown in fig. 10 as an example, the word segmentation is performed on the short sentences AE, BC, EBC, BCE to be segmented and the CDE, and the threshold value is set to be 10.2, so that the method for determining the word segmentation position is described.
For phrase AE, the neural signals corresponding to a and E are sequentially input into the linker to obtain output signals: outS (A) =10, outS (A- > E) =OutS (E) +InS 1 ×OutW(NL 1 ) =10+1×0.2=10.2, since in the phrase AE to be divided, AE differ by 1 character from AE to AEOutW (NL) 1 ) The link coefficients corresponding to a through E take the decimal point one bit later. 10.2 is greater than or equal to the threshold (10.2), AE is a term.
For phrase BC, sequentially inputting the neural signal corresponding to B and the neural signal corresponding to C into the linker, to obtain an output signal: outS (B) =10, outS (B- > C) =OutS (C) +InS 1 ×OutW(NL 1 ) =10+1×0.2=10.2, since in the phrase BC to be divided BC there is a 1 character difference between BC, then OutW (NL 1 ) The coefficient of the link corresponding to B to C takes the decimal point one bit later. 10.2 is greater than or equal to the threshold (10.2), BC is a word.
For the phrase EBC, the neural signal corresponding to E, the neural signal corresponding to B and the neural signal corresponding to C are sequentially input into the linker to obtain an output signal: outS (E) =10, outS (E- > B) =OutS (B) +InS 1 ×OutW(NL 1 ) =10+1×0.0=10.0, and outls (B- > C) =10.2, since in the short sentence EBC to be divided, EB differ by 1 character, outW (NL) 1 ) The coefficient of the link corresponding to E to B takes the decimal point one bit later. Then 10.0 is less than or equal to the threshold (10.2), then EB is not a word, previously known as BC is a word, then the word segmentation position is between E and B.
Aiming at the phrase BCE, the neural signal corresponding to B, the neural signal corresponding to C and the neural signal corresponding to E are sequentially input into a linker to obtain an output signal: outS (B) =10, outS (B- > C) =OutS (C) +InS 1 ×OutW(NL 1 )=10+1×0.2=10.2,
Figure BDA0002315507170000201
Figure BDA0002315507170000202
Since there are 2 characters different between BE in the short sentence BCE to BE divided, outW (NL) 2 ) Taking two decimal points for the link coefficients corresponding to B to E; the CEs differ by 1 character, then OutW (NL) 1 ) The link coefficients corresponding to C through E take the decimal point one bit later. 10.14 is less than or equal to the threshold (10.2), BCE is not a word, BC is a word, and the segmentation position is between C and E.
For short sentence CDE, cisInputting the neural signal corresponding to C, the neural signal corresponding to D and the neural signal corresponding to E into the linker to obtain an output signal: outS (C) =10, outS (C- > D) =OutS (D) +InS 1 ×OutW(NL 1 )=10+1×0.2=10.2,
Figure BDA0002315507170000203
Figure BDA0002315507170000204
Or (F)>
Figure BDA0002315507170000205
Figure BDA0002315507170000206
Since there are 2 characters different between CEs in the to-be-divided phrase CDE, outW (NL) 2 ) Taking two decimal points or 2 nd decimal points for the link coefficients corresponding to C to E; DE by 1 character, then OutW (NL) 1 ) The link coefficients corresponding to D through E take the decimal point one bit later. Then 10.43 is greater than or equal to the threshold (10.2) or 10.33 is greater than or equal to the threshold (10.2), then CDE is a word.
It will be appreciated that the linking coefficients given in fig. 2, 5, 6 and 10 are given by way of example only and do not constitute a limitation on the linking coefficients.
The second way of determining the word segmentation position is shown in fig. 11, and may include the following steps:
s1101: positioning the current word segmentation position and the last word segmentation position corresponding to the current word segmentation position;
s1102: determining the second output signal intensity of a neuron to be segmented corresponding to a first word to be segmented between the current word segmentation position and the last word segmentation position;
for each two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position, executing step S1103 to step S1107:
s1103: determining second penetration signal strength corresponding to the two target words to be separated according to the relative positions of the two target words to be separated in the word to be separated short sentence and the link coefficients corresponding to the two target words to be separated;
s1104: calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
S1105: judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, executing a step S1106; otherwise, step S1107 is performed;
s1106: determining the current position to be segmented as a segmentation position, and ending the current flow;
s1107: and determining that the current position to be segmented is not the segmentation position, taking the next segmentation position as the current segmentation position, and executing S1103.
Taking ABC as an example, the position between the current word segmentation position and the last word segmentation position is described. The method comprises the following steps:
initial state reset: the main portal inputs a RESET signal (restore signal) that is sent through the linker to the activated neurons to RESET the activated neurons to the inhibited state.
Inputting a first word: n=0, inputting a neural signal corresponding to the word a from the entrance linker, wherein the signal intensity InS (a) =1 (the default assigned intensity of the input signal is 1), activating the neuron a corresponding to the word a through the connector, and outputting an ERROR signal to the exit linker if the connector is not linked to the word a; if the A-word is connected, a signal A is sent to the A-word neuron, the intensity is 1, the neuron A is activated and outputs an A signal to the outlet linker, and the signal intensity is calculated by the calculation formula (5):
OutS(A)=InS(A)×InW(A)×InW(NL 0 )×OutW(NL 0 )×OutW(A)
=1×1×1×10×1=10
Inputting a second word: n=1, the signal intensity InS (B) =1 (default assigned input signal intensity is 1) of the B word inputted from the entrance linker, the B signal activates the neuron B and penetrates the activated neuron a, two output signal intensities are corresponding, one of which is the output signal intensity of the neuron B corresponding to the B signal activates the neuron B is out (B0) =ins (B) ×inw (B) × InW (NL) 0 )×OutW(NL 0 ) X OutW (B) =1×1 x 1 x 10 x 1 = 10; the other is B signal penetrationActivated neuron A, if the link coefficient InW of neurons A to B (NL 1 (A->B))<A preset threshold value, output OutS (A->B) =0, if the link coefficients InW of neurons a to B (NL 1 (a->B))>A preset threshold value is obtained from A->NL1 output signal B intensity OutS of B-Link (A->B)=InS(B)×InW(A->B)×OutW(NL1(A->B) Assuming that the link coefficient of neuron A to neuron B is 0.88, to prevent A->B mutual interference of a plurality of spacing bit word activation values, ouTS (NL 1 (A->B) 1 st bit after decimal point, i.e. 0.ltoreq.OutS (A->B) Less than or equal to 0.9. The signal strength corresponding to the B signal output to the outlet linker is OutS (B) =outs (B0) +outs (a->B)=10+0.8=10.8。
Inputting a third word C: n=2, the signal strength InS (C) =1 (the default assigned strength of the input signal is 1) of the input word C from the ingress linker, the C signal activates the neuron C and penetrates the activated neurons a and B, and the output signal strength of the neuron C is out (C0) =ins (C) ×inw (C) × InW (NL 0) ×outw (C) =10; since neuron A and neuron B have been activated, neuron A receives the C signal, outS (A- > C) =InS (C) × InW (A- > C) ×OutW (NL 2 (A- > C)), 0.ltoreq.OutS (A- > C). Ltoreq.0.09; neuron B receives the C signal, outS (B- > C) =InS (C) × InW (B- > C) ×OutW (NL 1 (B- > C)), and 0.ltoreq.OutS (C2). Ltoreq.0.9; the output signal intensity received by the outlet linker is OutS (C0), outS (A- > C), outS (B- > C), and OutS (C) =OutS (C0) +OutS (A- > C), and OutS (C) is more than or equal to 0 and less than or equal to 10.99.
The above-mentioned OutW (NLm (Q->Z)) means that when the character Z and the character Q differ by m characters, the coefficient of the link from the neuron corresponding to the character Q to the neuron corresponding to the character Z takes the m th bit (the rest is complemented by 0) after taking the decimal point (wherein, m=K z -K Q Wherein K is z Characterization of character Z in the phrase to be separated as belonging to K z Characters, K Q Characterization of character Q in the phrase to be separated as belonging to K Q A character); for example, for the ABC phrase, let the link coefficient of neuron a to neuron B be 0.88, let the link coefficient of neuron a to neuron C be 0.088, outw (NL 1 (a->B))=0.8,OutW(NL2(A->C))=0.08。
In summary, the text content is firstly divided into short sentences according to punctuations, spaces and other symbols, and then each short sentence is input into a neural network in a word segmentation recognition word stock in a word-by-word manner according to the sequence from left to right; on the neural network that has passed the RESET signal RESET (RESET), first, when a first word is input, counting the number of characters n=0, and inputting a neural signal corresponding to the first character T; if the egress linker outputs an ERROR signal, the description character T is an abnormal character, resetting (RESET) the neural network through a restoration signal; if the signal intensity corresponding to the output T of the outlet linker is T, the character T is an identifiable character; aiming at an nth character F (n is a positive integer greater than 1) in the phrase, obtaining output signal strength by calculating neural signals corresponding to the character F to penetrate through neurons corresponding to the n-1 characters before penetration; if the output signal strength is smaller than a preset threshold value and the output signal strength obtained by the previous character corresponding to the character F is larger than the preset threshold value, the word segmentation position is between the character F and the previous character corresponding to the character F; if the output signal intensity obtained by the previous character corresponding to the character F is larger than the output signal intensity corresponding to the character F, the word segmentation position is between the character F and the previous character corresponding to the character F, after one word is segmented, a restoring signal is input to reset the neural network, and a new round of word segmentation is performed again.
It can be understood that each preset threshold can be set by the user according to the actual situation of the word stock recognition by the word segmentation.
As shown in fig. 12, an embodiment of the present invention provides a word segmentation recognition thesaurus construction apparatus 1200, where the word segmentation recognition thesaurus construction apparatus 1200 includes: a build unit 1201, and a first linker 1202, wherein,
a building unit 1201, configured to perform, for a phrase in the training text set: the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction;
a first linker 1202 for connecting the neurons constructed by the construction unit 1201 and fusing each phrase neural network constructed by the construction unit 1201 to the word segmentation recognition word stock.
In one embodiment of the invention, the word stock comprises a primary neural network, the first linker 1202 further for performing, for each phrase neural network: linking each neuron in the phrase neural network; traversing each neuron in the phrase neural network; and when the traversing result is that the neurons with the same signal type exist between the main neural network and the phrase neural network, deleting the neurons with the same signal type in the phrase neural network, and connecting the link relation related to the neurons with the same signal type to the main neural network.
In one embodiment of the present invention, the first linker 1202 is further configured to update the link coefficients indicated by the link relationships with the same signal transmission direction on the main neural network according to the link coefficients indicated by the link relationships with the same signal transmission direction when the traversing result is that the link relationships with the same signal transmission direction exist between the main neural network and the phrase neural network.
In one embodiment of the present invention, the first linker 1202 is further configured to obtain a new phrase; for each add word in the newly added phrase, performing: converting the augmentation word into a corresponding neuron; searching a first neuron matched with a neuron corresponding to the increment word on the main neural network, and activating the first neuron; when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function; and updating the link coefficient indicated by the first link relation by using the calculated first link coefficient.
In one embodiment of the present invention, the first linker 1202 is further configured to set an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used; when a recovery signal is acquired, the activation state of the neuron is converted to the inhibition state.
In one embodiment of the invention, the neuron is further indicative of signal strength; a first linker 1202 further for searching for a first neuron matching a signal type indicated by a neuron corresponding to the increment word; and when the signal intensity indicated by the neuron corresponding to the increment word is not smaller than a preset threshold value, activating the first neuron.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction apparatus further includes: a calculation unit (not shown in the figure) for performing, for each phrase in the training text set: calculating an md5 code corresponding to the short sentence; judging whether the md5 code is already recorded, if so, ignoring the phrase, otherwise, recording the md5 code, and outputting the phrase to the construction unit.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction apparatus further includes: an updating unit (not shown in the figure) for setting corresponding decay periods for the link relation and the neurons, respectively; when the word segmentation recognition word stock is used, deleting the link relation when the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation; and deleting the neuron and the link relation to which the neuron is linked when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction apparatus further includes: an updating unit (not shown in the figure) further configured to calculate a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time, and a preset decay function; and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction apparatus further includes: an updating unit (not shown in the figure) further configured to update the link coefficient indicated by each of the second link relationships with a preset link constant.
In one embodiment of the present invention, the word segmentation recognition thesaurus construction apparatus further includes: executing, for at least one third link relation corresponding to the deterministic non-word sequence: at least one third linking relation is deleted.
The first linker is divided into an inlet first linker and an outlet first linker, wherein the inlet first linker is arranged at a neural signal inlet, the outlet first linker is arranged at a neural signal outlet, and the inlet first linker is connected with neurons in a neural network in a word segmentation recognition word stock.
As shown in fig. 13, an embodiment of the present invention provides a chinese word segmentation apparatus 1300, where the chinese word segmentation apparatus 1300 includes: a conversion unit 1301 and a second linker 1302, wherein,
a conversion unit 1301, configured to execute, for each word to be segmented phrase in the word to be segmented text: converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
a second linker 1302, configured to search for a matched neuron that is matched with each neuron to be segmented in a word segmentation recognition word library, where the word segmentation recognition word library includes a plurality of neurons and a link relationship between the neurons;
the second linker 1302 is further configured to determine a word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched linking relationship of every two matched neurons.
In one embodiment of the present invention, the second linker 1302 is further configured to sequentially search the matched neurons matched with each of the neurons to be segmented according to the position sequence of each of the words to be segmented in the phrase to be segmented, and convert the matched neurons from the suppression state to the activation state.
In one embodiment of the present invention, the second linker 1302 is further configured to perform, for each neuron to be divided: determining a first output signal strength of the neuron to be divided; when the word segmentation recognition word stock has neurons in an activated state, penetrating the neurons in the activated state by using the neurons to be segmented; according to the relative positions of the neurons in the activated state and the neurons to be segmented in the word segmentation phrases, calculating the first penetration signal intensity generated by the neurons in the activated state when the neurons to be segmented penetrate through; calculating first word segmentation signal intensity corresponding to neurons to be segmented according to the first output signal intensity and the first penetration signal intensity; and performing word segmentation according to the first word segmentation signal intensity.
In one embodiment of the present invention, the second linker 1302 is further configured to locate a current position to be segmented and a previous position corresponding to the current position to be segmented; determining the second output signal intensity of a neuron to be segmented corresponding to a first word to be segmented between the current word segmentation position and the last word segmentation position; for every two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position, executing: determining second penetration signal strength corresponding to the two target words to be separated according to the relative positions of the two target words to be separated in the word to be separated short sentence and the link coefficients corresponding to the two target words to be separated; calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity; judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current position to be segmented as a word segmentation position.
The second linker is divided into an inlet second linker and an outlet second linker, wherein the inlet second linker is arranged at a neural signal inlet, the outlet second linker is arranged at a neural signal outlet, and the inlet second linker is connected with neurons in a neural network in the word segmentation recognition word stock.
Fig. 14 illustrates an exemplary system architecture 1400 to which the word segmentation recognition thesaurus construction method or the word segmentation recognition thesaurus construction apparatus or the chinese word segmentation method or the chinese word segmentation apparatus of the embodiments of the present invention can be applied.
As shown in fig. 14, the system architecture 1400 may include end devices 1401, 1402, 1403, a network 1404, and a server 1405. The network 1404 serves as a medium to provide communications links between the terminal devices 1401, 1402, 1403 and the server 1405. The network 1404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 1405 through the network 1404 using the terminal devices 1401, 1402, 1403 to receive or send messages, etc. Various communication client applications such as a word segmentation application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the terminal devices 1401, 1402, 1403.
The terminal devices 1401, 1402, 1403 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 1405 may be a server providing various services, such as a background management server (merely an example) providing a word segmentation recognition thesaurus construction or a word segmentation support for an information-class website browsed by the user using the terminal devices 1401, 1402, 1403. The background management server may perform analysis and other processes on the received text content and the like to construct a word segmentation recognition word library or perform word segmentation, and feed back a processing result (e.g., a word segmentation result—merely an example) to the terminal device.
It should be noted that, the word segmentation recognition word stock construction method or the chinese word segmentation method provided by the embodiment of the present invention is generally executed by the server 1405, and accordingly, the word segmentation recognition word stock construction device or the chinese word segmentation device is generally disposed in the server 1405.
It should be understood that the number of terminal devices, networks and servers in fig. 14 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 15, there is illustrated a schematic diagram of a computer system 1500 suitable for use in implementing a server of an embodiment of the present invention. The terminal device shown in fig. 15 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
15, the computer system 1500 includes a Central Processing Unit (CPU) 1501, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1502 or a program loaded from a storage portion Y08 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data required for the operation of the system 1500 are also stored. The CPU 1501, ROM 1502, and RAM 1503 are connected to each other through a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.
The following components are connected to I/O interface 1505: an input section 1506 including a keyboard, mouse, and the like; an output portion 1507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 1508 including a hard disk and the like; and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. Removable media 1511, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1510 as needed so that a computer program read therefrom is mounted into the storage section 1508 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1509, and/or installed from the removable medium 1511. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1501.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes a build unit and a first linker. Where the names of the units do not constitute a limitation on the unit itself in some cases, for example, a building unit may also be described as "a unit that trains a neural network with training samples".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: for a phrase in the training text set, performing: the method comprises the steps of de-duplicating short sentences, and constructing corresponding neurons for each word in the de-duplicated short sentences, wherein the signal types indicated by the neurons are matched with the words corresponding to the neurons; according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction; and fusing each phrase neural network to a word segmentation recognition word stock.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: the word segmentation recognition word library comprises a main neural network and a linker linked with neurons in the main neural network; each phrase neural network is fused to a word segmentation recognition thesaurus, comprising: for each phrase neural network, performing: linking each neuron in the phrase neural network to a linker; traversing each neuron in the phrase neural network through a linker; and when the traversing result is that the neurons with the same signal type exist between the main neural network and the phrase neural network, deleting the neurons with the same signal type in the phrase neural network, and connecting the link relation related to the neurons with the same signal type to the main neural network.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring a new short sentence; for each add word in the newly added phrase, performing: converting the augmentation word into a corresponding neuron; searching a first neuron matched with a neuron corresponding to the increment word through a linker on a main neural network, and activating the first neuron; when a first link relation exists between two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function; and updating the link coefficient indicated by the first link relation by using the calculated first link coefficient.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: executing, for each word to be segmented phrase in the word to be segmented text: converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented; searching matched neurons matched with each neuron to be segmented in a word segmentation recognition word stock, wherein the word segmentation recognition word stock comprises a plurality of neurons and a link relation among the neurons; and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched link relation of every two matched neurons.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: according to the position sequence of each word to be segmented in the word to be segmented phrase, sequentially searching matched neurons matched with each neuron to be segmented, and converting the matched neurons from a suppression state to an activation state; for each neuron to be divided, performing: determining a first output signal strength of the neuron to be divided; when the word segmentation recognition word stock has neurons in an activated state, the neurons in the activated state are penetrated by the neurons to be segmented; according to the relative positions of the neurons in the activated state and the neurons to be segmented in the segmentation phrases, calculating the first penetration signal intensity generated by the neurons in the activated state of penetration of the neurons to be segmented; determining word segmentation positions of short sentences to be segmented comprises the following steps: calculating first word segmentation signal intensity corresponding to neurons to be segmented according to the first output signal intensity and the first penetration signal intensity; and performing word segmentation according to the first word segmentation signal intensity.
The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: positioning the current word segmentation position and the last word segmentation position corresponding to the current word segmentation position; determining the second output signal intensity of a neuron to be segmented corresponding to a first word to be segmented between the current word segmentation position and the last word segmentation position; for every two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position, executing: determining second penetration signal strength corresponding to the two target words to be separated according to the relative positions of the two target words to be separated in the word to be separated short sentence and the link coefficients corresponding to the two target words to be separated; calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity; judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current position to be segmented as a word segmentation position.
According to the technical scheme of the embodiment of the invention, the neurons are constructed by training each word in the short sentence in the text set, and the link relation between the neurons is constructed according to the relative position and occurrence frequency between every two words, so that the word segmentation recognition word stock comprises the neurons with the words and the link relation between the words instead of storing the words or phrases, when the word segmentation is carried out based on the word segmentation recognition word stock, the words are segmented according to the link relation between the words and the phrases, compared with the word stock storing the words or phrases, the word segmentation method is characterized in that the words are constructed based on the words and the link relation, the link relation between the words and the phrases can embody more words, and correspondingly, the word segmentation process is also based on the words and the link relation for word segmentation without being restricted by the words or phrases. Therefore, the scheme provided by the embodiment of the invention can effectively improve the word quantity of the word stock and the word segmentation accuracy.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (19)

1. The word stock construction method for word segmentation recognition is characterized by comprising the following steps:
for a phrase in the training text set, performing:
the method comprises the steps of de-duplicating a phrase, and constructing a corresponding neuron for each word in the de-duplicated phrase, wherein the signal type indicated by the neuron is matched with the word corresponding to the neuron;
according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction, and fusing the phrase neural networks to form a word segmentation recognition word stock;
wherein,,
the link coefficient is a value calculated according to the relative position between two words and the frequency of the two words appearing in a short sentence at the same time;
the signal transmission direction refers to the sequence between two words.
2. The method for constructing a word stock for word segmentation according to claim 1, wherein,
the word segmentation recognition word stock comprises a main neural network and a linker linked with neurons in the main neural network;
Said fusing each of said phrase neural networks includes:
for each of the phrase neural networks, performing:
linking each of the neurons in the phrase neural network to the linker;
traversing each of the neurons in the phrase neural network by the linker;
and deleting the neurons with the same signal type in the phrase neural network and connecting the link relation related to the neurons with the same signal type to the main neural network when the traversing result is that the neurons with the same signal type exist between the main neural network and the phrase neural network.
3. The method for constructing a word stock for word segmentation according to claim 2, wherein,
when the result of the traversal is that the link relation with the same signal transmission direction exists between the main neural network and the phrase neural network,
and updating the link coefficients indicated by the link relation with the same signal transmission direction on the main neural network according to the link coefficients indicated by the link relation with the same signal transmission direction.
4. The method of claim 2, further comprising:
Acquiring a new short sentence;
for each add word in the new add phrase, performing:
converting the increment word into a corresponding neuron;
searching a first neuron matched with a neuron corresponding to the increment word through the linker on the main neural network, and activating the first neuron;
when a first link relation exists between the two first neurons, calculating a first link coefficient corresponding to the first link relation by using a preset activation function;
and updating the link coefficient indicated by the first link relation by using the calculated first link coefficient.
5. The method of claim 4, further comprising:
setting an activation state and an inhibition state for the neuron, wherein the activation state indicates that the neuron is used and the inhibition state indicates that the neuron is not used;
and when a recovery signal is acquired, converting the activation state of the neuron into an inhibition state.
6. The method for constructing a word stock for word segmentation as set forth in claim 5, wherein,
the neuron is further indicative of signal strength;
the searching the first neuron matched with the neuron corresponding to the increment word comprises the following steps:
Searching a first neuron which is matched with the signal type indicated by the neuron corresponding to the increment word;
and when the signal intensity indicated by the neuron corresponding to the increment word is not smaller than a preset threshold value, activating the first neuron.
7. The word segmentation recognition word stock construction method according to any one of claims 1 to 4, 5, and 6, further comprising:
for each phrase in the training text set, performing:
calculating an md5 code corresponding to the phrase;
and judging whether the md5 code is recorded, if so, ignoring the phrase, otherwise, recording the md5 code, and executing the steps of de-duplicating the phrase and constructing corresponding neurons for each word in the de-duplicated phrase.
8. The word segmentation recognition word stock construction method according to any one of claims 1 to 4, 5, and 6, further comprising:
setting corresponding attenuation periods for the link relation and the neurons respectively;
when the word segmentation recognition thesaurus is used,
deleting the link relation when the duration of the link relation in the inhibition state reaches the attenuation period corresponding to the link relation;
and deleting the neuron and the link relation to which the neuron is linked when the duration of the neuron in the inhibition state reaches the attenuation period corresponding to the neuron.
9. The method of claim 8, further comprising:
calculating a second link coefficient corresponding to the current time according to the current time, the last activated time corresponding to the current time and a preset decay function;
and updating the link coefficient indicated by the link relation by using the calculated second link coefficient.
10. The word segmentation recognition word stock construction method according to any one of claims 1 to 4, 5, and 6, further comprising:
executing, for at least one second link relation corresponding to the deterministic word sequence:
and updating the link coefficient indicated by each second link relation by using a preset link constant.
11. The word segmentation recognition word stock construction method according to any one of claims 1 to 4, 5, and 6, further comprising:
executing, for at least one third link relation corresponding to the deterministic non-word sequence:
and deleting the at least one third link relation.
12. A method for word segmentation in chinese, comprising:
executing, for each word to be segmented phrase in the word to be segmented text:
Converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
searching matched neurons matched with each neuron to be segmented in a word segmentation recognition word stock, wherein the word segmentation recognition word stock is constructed by the word segmentation recognition word stock construction method according to any one of claims 1 to 11, and the word segmentation recognition word stock comprises a plurality of neurons and a link relation among the neurons;
and determining the word segmentation position of the short sentence to be segmented according to the position sequence of each word to be segmented in the short sentence to be segmented and the searched link relation of every two matched neurons.
13. The method of claim 12, wherein the searching for a matching neuron that matches each of the neurons to be segmented comprises:
and according to the position sequence of each word to be segmented in the word to be segmented phrase, sequentially searching matched neurons matched with each neuron to be segmented, and converting the matched neurons from an inhibition state to an activation state.
14. The method of chinese word segmentation as set forth in claim 13, further comprising:
for each of the neurons to be divided, performing:
Determining a first output signal strength of the neurons to be divided;
when the word segmentation recognition word stock has neurons in an activated state, penetrating the neurons in the activated state by using the neurons to be segmented;
according to the relative positions of the neurons in the activated state and the neurons to be divided in the word segmentation phrases, calculating the first penetration signal intensity generated by the neurons in the activated state when the neurons to be divided penetrate through the neurons;
the determining the word segmentation position of the short sentence to be segmented comprises the following steps:
calculating first word segmentation signal intensity corresponding to the neuron to be segmented according to the first output signal intensity and the first penetration signal intensity;
and performing word segmentation according to the first word segmentation signal intensity.
15. A method of chinese word segmentation according to claim 12, wherein said determining the word segmentation position of the short sentence to be segmented comprises:
positioning a current word segmentation position and a last word segmentation position corresponding to the current word segmentation position;
determining the second output signal intensity of neurons to be segmented corresponding to a first word to be segmented between the current word segmentation position and the last word segmentation position;
executing, for each two target to-be-segmented words between the current to-be-segmented word position and the last segmented word position:
Determining second penetration signal intensities corresponding to the two target words to be separated according to the relative positions of the two target words to be separated in the word to be separated short sentence and the link coefficients corresponding to the two target words to be separated;
calculating second word segmentation signal intensity corresponding to the neuron to be segmented according to the second output signal intensity and the second penetration signal intensity;
judging whether the second word segmentation signal strength meets a preset word segmentation condition, and if so, determining the current position to be segmented as a word segmentation position.
16. The word segmentation recognition word stock construction device is characterized by comprising: a building unit and a first linker, wherein,
the building unit is configured to perform, for a phrase in the training text set: the method comprises the steps of de-duplicating a phrase, and constructing a corresponding neuron for each word in the de-duplicated phrase, wherein the signal type indicated by the neuron is matched with the word corresponding to the neuron; according to the relative position and occurrence frequency between every two words in the phrase, constructing a link relation between two neurons corresponding to every two words to form a phrase neural network corresponding to the phrase, wherein the link relation indicates a link coefficient and a signal transmission direction;
The first linker is used for connecting the neurons constructed by the construction unit and fusing the phrase neural networks constructed by the construction unit to a word segmentation recognition word stock;
the link coefficient is a value calculated according to the relative position between two words and the frequency of the two words appearing in a short sentence at the same time;
the signal transmission direction refers to the sequence between two words.
17. A chinese word segmentation apparatus, comprising: a conversion unit and a second linker, wherein,
the conversion unit is configured to execute, for each phrase to be segmented in the text to be segmented, the following steps: converting each word to be segmented in the short sentence to be segmented into corresponding neurons to be segmented;
the second linker is configured to search a word segmentation recognition word library for a matched neuron that is matched with each neuron to be segmented, where the word segmentation recognition word library is constructed by the word segmentation recognition word library construction method according to any one of claims 1 to 11, and the word segmentation recognition word library includes a plurality of neurons and a link relationship between the neurons;
the second linker is further configured to determine a word segmentation position of the short sentence to be segmented according to a position sequence of each word to be segmented in the short sentence to be segmented and the searched link relationship of each two matched neurons.
18. A word segmentation recognition thesaurus construction server, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-15.
19. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-15.
CN201911288705.7A 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device Active CN111178065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911288705.7A CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911288705.7A CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Publications (2)

Publication Number Publication Date
CN111178065A CN111178065A (en) 2020-05-19
CN111178065B true CN111178065B (en) 2023-06-27

Family

ID=70652028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911288705.7A Active CN111178065B (en) 2019-12-12 2019-12-12 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device

Country Status (1)

Country Link
CN (1) CN111178065B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779207A (en) * 2020-12-03 2021-12-10 北京沃东天骏信息技术有限公司 Visual angle layering method and device for dialect text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458694A (en) * 2008-10-09 2009-06-17 浙江大学 Chinese participle method based on tree thesaurus
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977624B2 (en) * 2010-08-30 2015-03-10 Microsoft Technology Licensing, Llc Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458694A (en) * 2008-10-09 2009-06-17 浙江大学 Chinese participle method based on tree thesaurus
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Yan Niu等.An Improved Chinese Segmentation Algorithm Based on New Dictionary Construction.2009 International Conference on Computational Science and Engineering.2009,全文. *
吴建源 ; .基于BP神经网络的中文分词算法研究.佛山科学技术学院学报(自然科学版).2012,(第02期),全文. *
周程远 ; 朱敏 ; 杨云 ; .基于词典的中文分词算法研究.计算机与数字工程.2009,(第03期),全文. *
王坚,赵恒永.专业搜索引擎中文分词算法的实现与研究.福建电脑.2005,(第07期),全文. *

Also Published As

Publication number Publication date
CN111178065A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
US20220350965A1 (en) Method for generating pre-trained language model, electronic device and storage medium
CN112560501B (en) Semantic feature generation method, model training method, device, equipment and medium
CN107273503B (en) Method and device for generating parallel text in same language
US20230025317A1 (en) Text classification model training method, text classification method, apparatus, device, storage medium and computer program product
JP6901816B2 (en) Entity-related data generation methods, devices, devices, and storage media
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
JP7358698B2 (en) Training method, apparatus, device and storage medium for word meaning representation model
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN108628830B (en) Semantic recognition method and device
CN112560496A (en) Training method and device of semantic analysis model, electronic equipment and storage medium
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN111368551B (en) Method and device for determining event main body
CN114861889B (en) Deep learning model training method, target object detection method and device
CN113657100A (en) Entity identification method and device, electronic equipment and storage medium
JP7357114B2 (en) Training method, device, electronic device and storage medium for living body detection model
JP7369228B2 (en) Method, device, electronic device, and storage medium for generating images of user interest
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
KR102550340B1 (en) Chapter-level text translation method and device
CN115017898A (en) Sensitive text recognition method and device, electronic equipment and storage medium
CN113076756A (en) Text generation method and device
CN111178065B (en) Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device
CN111538817A (en) Man-machine interaction method and device
CN110852057A (en) Method and device for calculating text similarity
CN112560425A (en) Template generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220930

Address after: 12 / F, 15 / F, 99 Yincheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Applicant after: Jianxin Financial Science and Technology Co.,Ltd.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant