Specific embodiment
Hereinafter describe the embodiment of the present invention in detail with reference to the accompanying drawings.It should be appreciated that following embodiments and unawareness
The figure limitation present invention, also, about the means according to the present invention solved the problems, such as, it is not absolutely required to be retouched according to following embodiments
The whole combinations for the various aspects stated.For simplicity, to identical structure division or step, identical label or mark have been used
Number, and the description thereof will be omitted.
[hardware configuration of information processing unit]
Firstly, the hardware configuration of description information processing unit 1000 referring to Fig.1.In addition, being used as example in the present embodiment
Following construction is described, but information processing unit of the invention is not limited to construction shown in FIG. 1.
Fig. 1 is the figure for showing the hardware construction of the information processing unit 1000 in the present embodiment.In the present embodiment, with intelligence
Energy phone provides description as the example of information processing unit.Although it is noted that illustrating mobile terminal in the present embodiment
(including but not limited to smart phone, smartwatch, Intelligent bracelet, music player devices) are used as information processing unit 1000, but
It is to be certainly not limited thereto, information processing unit of the invention can be laptop, tablet computer, PDA, and (individual digital helps
Reason), PC either has the internet device of touching display screen and the information processing function (such as Digital photographic is mechanical, electrical
Refrigerator, television set etc.) etc. various devices.
As shown in Figure 1, information processing unit 1000 (2000,3000) includes connecing via the input that system bus is connected to each other
Mouth 102, CPU 103, ROM 104, RAM 105, external memory 106, output interface 107, display 108, communication unit 109
With short-range wireless communication unit 110.Input interface 102 is referred to for receiving the execution of data and function that user is inputted
The interface of order, and be to receive for the operating unit (not shown) via such as key, button or touch screen from user to input
Data and operational order interface.It note that the display 108 being described later on and operating unit can at least partly collect
At, also, for example, it may be carry out picture output in same picture and receive the construction of user's operation.
CPU 103 is system control unit, and generally comprehensively controls information processing unit 1000.In addition, for example,
CPU 103 carries out the display control of the display 108 of information processing unit 1000.ROM 104 stores CPU 103 and executes such as
The fixed data of tables of data and control program and operating system (OS) program etc..In the present embodiment, it is stored in ROM 104
Each control program, for example, such as being dispatched under the management of the OS stored in ROM 104, task switches and interrupt processing
Deng software execute control.
RAM 105 (internal storage unit) for example by need backup power source SRAM (static random access memory),
The construction such as DRAM.In this case, RAM 105 can store the important of control variable of program etc. in a non-volatile manner
Data.In addition, for storing depositing for setting information, management data of information processing unit 1000 of information processing unit 1000 etc.
Storage area domain is also disposed in RAM 105.In addition, RAM 105 is used as the working storage and main memory of CPU 103.
External memory 106 stores such as predefined dictionary, sequence labelling model, for executing participle according to the present invention
The application program etc. of processing method.In addition, the storage of external memory 106 is such as via communication unit 109 and communication device
(not shown) send/the various programs of received information transmission/receiving control program etc. and these programs use it is each
Kind information.
Output interface 107 is the display picture for being controlled display 108 to show information and application program
Interface.Display 108 is for example constructed by LCD (liquid crystal display).Have such as numerical value defeated by arranging on a display device 108
The soft keyboard for entering the key of key, mode setting button, decision key, cancel key and power key etc. can receive and come via display 108
From the input of user.
Information processing unit 1000 is via communication unit 109 for example, by channel radios such as Wi-Fi (Wireless Fidelity) or bluetooth
Letter method executes data communication with external device (ED) (not shown).
In addition, information processing unit 1000 can also via short-range wireless communication unit 110, in short-range with
External device (ED) etc. is wirelessly connected and executes data communication.And short-range wireless communication unit 110 by with communication unit
109 different communication means are communicated.It is, for example, possible to use its communication range is shorter than the communication means of communication unit 109
Communication means of the Bluetooth Low Energy (BLE) as short-range wireless communication unit 110.In addition, as short-distance wireless communication list
The communication means of member 110, for example, it is also possible to use NFC (near-field communication) or Wi-Fi perception (Wi-Fi Aware).
[first embodiment]
Next, illustrating the software configuration of information processing unit according to first embodiment referring to Fig. 2.
As shown in Fig. 2, information processing unit 1000 includes: selecting unit 1101, to participle object, (such as user passes through touching
Touch the sentence of screen input) it is segmented, it obtains to include the group of multiple words the word segmentation result that indicates;First concatenation unit 1102
Splicing is carried out to the adjacent word in group;Sequence labelling unit 1103 utilizes sequence labelling model, splices to by described first
Each word in the combination after unit progress splicing carries out sequence labelling, and according to the result of sequence labelling to described group
Word in conjunction merges, wherein sequence labelling unit 1103 includes extraction unit 11031, from by first concatenation unit
The word of predefined type is extracted in each participle in the combination after 1102 progress splicings;Prediction section 11032, according to institute
Predefined type is stated to predict the correspondence word segmentation result of extracted word;Selector 11033 is selected from the word segmentation result predicted
Select word segmentation result;And merging portion 11034, basis is configured as by the selected word segmentation result of the selector, to institute
The word stated in combination merges;And second concatenation unit 1104 according to pre-defined rule to by the sequence labelling unit carry out
The word in combination after merging is spliced.
In the following, illustrating participle processing method according to a first embodiment of the present invention referring to Fig. 3.
As shown in figure 3, the participle processing method, it may include following steps S101-S104:
In step s101, it is matched by obtaining sentence to be segmented with word in dictionary for word segmentation, then all
The word combination being fitted on all takes out, and calculates the combination of the highest scoring in participle strategy in each combination, the participle plan
It slightly include: term weighing and language model scores.
Next, entering step S102, in step s 102, adjacent word in word segmentation result is stitched together, if should
As a result occur in dictionary, which is just replaced to the former word segmentation result in dictionary.
Then, step S103 is entered, in step s 103, the word segmentation result that previous step is generated, into sequence labelling
Model screens the result of sequence labelling model after carrying out sequence labelling, and according to the result after screening by previous step
Word segmentation result Partial Fragment merges, and enters step S104.
In step S104, some common collocation in the word segmentation result of previous step generation are spliced, such as: quantity
Word, date, time and letter expressing etc., and by result after splicing as final word segmentation result.
Hereinafter, with to sentence, " on January 29th, 2016, area lead Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Jun
Victory, Sun Qijun visited Ministry of Foreign Affairs, the People Daily agency's area Deng Zhu center unit " segmented for, illustrate above-mentioned participle
The process of processing.
In step s101, basic participle, each word in the sentence of acquisition, in different ways by participle pair are carried out
As being split as multiple words, multiple contaminations are formed.Each word in each combination is carried out with the word in dictionary for word segmentation respectively
Matching, then all takes out all word combinations being matched to.
For example, the word in sentence is split as following several combinations:
(1) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu
Army, victory, grandson, its, army, visit, diplomacy, portion, People's Daily, society, etc., stay area, center, unit;
(2) 2016, year, 1, the moon, 29, day, area, leader, Wu Gui, English, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu,
Army's victory, grandson, its, army, visited, the Ministry of Foreign Affairs people, day newspaper office etc., stayed area, central unit;
(3) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hong, will, old, great waves, it is sweet, quiet, in, Liu
Army, victory, grandson, its, army, visited, Ministry of Foreign Affairs, the people, day newspaper office, etc., stay area, center, unit.
Then score of each combination in participle strategy is calculated according to participle strategy, and selects the combination of highest scoring.
Participle strategy includes but is not limited to term weighing and language model scores.Wherein, for segmenting vocabulary used in the step,
When establishing vocabulary, in addition to saving word itself, the word frequency that the word occurs in corpus will also save.Term weighing is to work as
Each word word frequency is the sum of cumulative in preceding participle combination.
Illustrate the process of above-mentioned calculating score with a better simply example below.Such as " I ", " love ", " Beijing day
In the combination of peace door ", the word frequency of " I " is 130132, and the word frequency of " love " is 74150, and the word frequency in " Beijing Tian An-men " is 5924,
The term weighing of the combination is exactly 210206.Then multiple combined term weighings point are normalized, each term weighing point is returned
One calculation changed are as follows: highest term weighing is as denominator using in all combinations, and current term weighing is as molecule.So
Afterwards, use the bigram probabilistic language model that entirely combines as the score of language model.Finally, the score and word of language model
The score of language weight, which is multiplied, is used as final score.
It is calculated using the participle strategy of term weighing combination (1)-(3), scoring event difference is as follows:
Combine the score of (1) are as follows: 0.7411.
Combine the score of (2) are as follows: 1.0.
Combine the score of (3) are as follows: 0.8951.
When the participle strategy using language model scores is as follows to calculate each combined scoring event:
Combine the score of (1) are as follows: 0.9013.
Combine the score of (2) are as follows: 0.7542.
Combine the score of (3) are as follows: 0.9631.
The final score for combining (1)-(3) is respectively 0.6680,0.7542,0.8620, selects the group of highest scoring, i.e., and the
(3) group continues the processing of next step.
In step s 102, for the word segmentation result in combination (3), adjacent word is stitched together, for example, by " people "
" day newspaper office " is stitched together, and the result of spliced word segmentation processing is as follows:
2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun,
Victory, grandson, its, army, visited, Ministry of Foreign Affairs, People Daily agency, etc., stay area, center, unit.
If spliced above-mentioned word segmentation result is being not present in dictionary, existed with the spliced word segmentation result replacement
The former word segmentation result recorded in dictionary.
Step S103 includes the steps that S1031-S1031 as shown in Figure 4.
In step S1031, from previous step generate word segmentation result, i.e., " 2016, year, 1, the moon, 29, day, area, leader,
Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, victory, grandson, its, army, visited, Ministry of Foreign Affairs, People's Daily
Society, etc., stay area, center, unit ", it is middle to extract related with name word, i.e., " Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet,
It is quiet, in, Liu Jun, victory, grandson, its, army ".
It whether is word related with name according to extracted word in step S1032, to predict point of extracted word
Word result:
Wu, Gui Ying: Wu Guiying
King, Hao: Wang Hao
Old, great waves: Chen Tao
It is sweet, quiet, in: Gan Jingzhong
Liu Jun, victory: Liu Junsheng
Grandson, its, army: Sun Qijun.
In step S1033, the result of sequence labelling model is screened, removal is clearly not the result of name.Example
Such as, it for certain participle objects, is possible to will appear in sequence labelling result and similar " man of king " is accidentally labeled as name
As a result, it is therefore desirable to which annotation results are further screened.
In step S1034, according to after screening as a result, the word segmentation result Partial Fragment of previous step is merged, conjunction
And the word segmentation result obtained afterwards is as follows:
2016, year, 1, the moon, 29, day, area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qi
Army, visited, Ministry of Foreign Affairs, People Daily agency, etc., stay area, center, unit.
In step S104, some common collocation in the word segmentation result of previous step generation are spliced, such as: it will
" 2016, year, 1, the moon, 29, day " are spliced into " on January 29th, 2016 ".Common collocation includes numeral-classifier compound, date, time and text
Expression etc., and by result after splicing as final word segmentation result.For this example, spliced result are as follows: on January 29th, 2016,
Area, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qijun, has been visited, Ministry of Foreign Affairs, People's Daily leader
Society, etc., stay area, center, unit.
According to the present invention, by being spliced to word segmentation result, marking and being spliced again, of word segmentation result can be increased
Granularity.
[second embodiment]
In the first embodiment, to the reading of dictionary and sequence labelling model, using from RAM105 reading program and at it
The mode of middle operation program.And in a second embodiment, in sequence labelling processing, sequence labelling unit is in external memory
The sequence labelling processing is carried out in 106.
In information processing unit in the prior art, the memory (internal storage unit) of RAM etc. is generally included, with
And the external memory (external memory) of SD card and hard disk etc..RAM is commonly used to operation application program.And external memory is commonly used to deposit
Store up database and application program.According to common technology, sequence criteria model can be stored in external memory, and be run in memory
The corresponding program of sequence criteria model.This will lead to that the memory that mobile phone is occupied when being segmented is more, and processing speed is slower.And
In the present embodiment, although sequence labelling model also is stored in external memory, but running corresponding program is to carry out in external memory
's.
It will illustrate the word segmentation processing carried out in second embodiment below with Fig. 5 and Fig. 6.Fig. 5 is instantiated according to the present invention
The functional block diagram of the information processing unit of second embodiment.
As shown in figure 5, information processing unit 2000 includes: participle unit 2102, participle object is segmented, and obtains
It must be expressed as the word segmentation result of multiple contaminations;And sequence labelling unit 2103, the sequence labelling unit are deposited in outside
The sequence labelling processing is carried out in reservoir, for will segment object carry out segment acquisition, be expressed as multiple contaminations
Word segmentation result, sequence labelling is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling
Word in the combination is merged.Wherein, sequence labelling unit 2103 includes: storage unit 21031, by sequence criteria mould
The emission probability and state probability of type are stored in the first file of external memory;Calculation part 21032, for the combination
In word characteristic function carry out Hash operation, by each characteristic function and emission probability corresponding with this feature function or shape
The storage location of state probability is stored in the second file with cryptographic Hash;Extraction unit 21033, from what is stored by the calculation part
The storage location extracts the probability that adjacent word in the combination combines word as one;And merging portion 21034, matched
It is set to and each word in the combination is spliced according to extracted probability.
Fig. 6 instantiates the flow chart of progress participle processing method according to a second embodiment of the present invention.
Referring to Fig. 6, is segmented with to segment object " I loves Beijing Tian An-men " and be illustrated according to the present invention
The carry out participle processing method of two embodiments.
In step s 201, which is divided into: my love, north, capital, day, peace, door.
In step S202, sequence labelling is carried out to the word segmentation result in step S201, sequence labelling processing includes such as Fig. 7
Shown step S2021 to S2024.
In storing step S2021, the master mould parameter of sequence criteria model is divided into three parts storage, is characterized respectively
Function (the second parameter), emission probability and state probability (the first parameter), feature templates and other parameters (third parameter).Its
In, emission probability and state probability are stored into (the first file) as a unique file.
In calculating step S2022, Hash operation is carried out to characteristic function using severe snow hash algorithm, then by each feature
The storage location (value) of function and emission probability corresponding with this feature function or state probability, is stored in another with cryptographic Hash
In a binary file (the second file).Storage feature templates and other parameters are stored as third file.
Specifically, when sequence labelling model is placed in " north " word, there is a template in feature templates are as follows: U06:%x [0,
0]/%x [1,0], template are construed to current word and combining for its latter position word situation occur i.e.: U06: north/capital.We will
" U06: north/capital " is brought into severe snow hash function as variable, three cryptographic Hash is calculated: main cryptographic Hash M, left cryptographic Hash L
With right cryptographic Hash R.Wherein, binary system displacement operation is carried out using main cryptographic Hash to obtain storage value (i.e. character pair is specifically
Location, such as address of " north " and " capital " in file), and by the left cryptographic Hash of acquisition and right cryptographic Hash and pre-set left Kazakhstan
Uncommon value and right cryptographic Hash compare, if identical, it is determined that be stored in the storage location as main cryptographic Hash in the second file;Such as
Fruit is that very, then emission probability (or state probability) storage location for continuing to take out storage inside returns to -1 if it is vacation;Such as
Fruit is unequal, then adds 1 on the basis of M, repeats above-mentioned value and compares operation.
In extraction step S2023, characteristic function and hair corresponding with this feature function are stored from step S2022
The storage location for penetrating probability or state probability extracts the probability size that adjacent word combines word as one.
Specifically, the return value (address) that repeatedly value compares operation in step S2032 will be regard as emission probability
Position in (or state probability) its first file carries out the operation of position value.The weight or probability number and sequence mark of taking-up
The number of the label of note is identical, and when each weight represents current word label as B, the probability that " U06: north/capital " joint occurs is big
It is small.For example, the probability that " Beijing " joint occurs is 98%, the probability that " Tian An-men " joint occurs is 95%.
In merging step S2024, according to the probability calculated in step S2023, the word segmentation result in step 201 is carried out
Splicing.
Specifically, in the word segmentation result of step S201 are as follows: my love, north, capital, day, peace, door.According in step S2023
The probability of calculating, the probability that " Beijing " combines word as one are 98%, and " Tian An-men " combines the probability that word occurs as one
It is 95%, it is thus determined that " north " and " capital " is spliced into " Beijing ", " day ", " peace " and " door " is spliced into " Tian An-men ".In step
In S2024, the word segmentation result that finally obtains are as follows: my love, Beijing, Tian An-men.
Second embodiment according to the present invention carries out sequence labelling processing, reduces pair in external memory rather than in memory
The occupancy of information processing unit memory improves the speed of service of information processing unit.
[3rd embodiment]
The hardware configuration of the information processing unit of the third embodiment of the present invention and first embodiment and second embodiment
The hardware configuration of information processing unit is identical.The technical solution of 3rd embodiment is the technology of first embodiment and second embodiment
The combination of scheme.That is, the information processing unit of 3rd embodiment includes selecting unit in first embodiment, the first concatenation unit
With the external memory and sequence labelling unit in the second concatenation unit and second embodiment.
Fig. 8 instantiates the functional block diagram of information processing unit according to a third embodiment of the present invention.
As shown in figure 8, information processing unit 3000 includes: selecting unit 3101, to participle object, (such as user passes through touching
Touch the sentence of screen input) it is segmented, it obtains to include the group of multiple words the word segmentation result that indicates;First concatenation unit 3102
Splicing is carried out to the adjacent word in group;Sequence labelling unit 3103 utilizes sequence labelling model, splices to by described first
Each word in the combination after unit progress splicing carries out sequence labelling, and according to the result of sequence labelling to described group
Word in conjunction merges;Second concatenation unit 3104 according to pre-defined rule to being merged by the sequence labelling unit after
Word in combination is spliced.
Wherein, sequence labelling unit 3103 includes: storage unit 31031, by the emission probability and shape of sequence criteria model
State probability is stored in the first file of external memory;Calculation part 31032, for the characteristic function to the word in the combination
Hash operation is carried out, by the storage position of each characteristic function and emission probability corresponding with this feature function or state probability
It sets, is stored in the second file with cryptographic Hash;Extraction unit 31033 is mentioned from the storage location stored by the calculation part
Take probability of the adjacent word as a joint word in the combination;And merging portion 31034, it is configured as according to extracted
Probability splices each word in the combination.
In the participle processing method of 3rd embodiment, including selection step, the first splicing step in first embodiment
With second splicing step, and first splicing step and second splicing step between sequence labelling step, then be second embodiment
In sequence labelling step.
According to a third embodiment of the present invention, it can obtain that a kind of participle granularity is big and committed memory is few and handles
Fireballing information processing unit.
Information processing unit of the invention can obtain following technical effect: have as far as possible common collocation and semantically
The combination of meaning is cut out, it may be convenient to more meaningful segment is extracted from word segmentation result.
Although exemplary embodiments describe the present invention for reference above, above-described embodiment is only to illustrate this hair
Bright technical concepts and features, it is not intended to limit the scope of the present invention.What all Spirit Essences according to the present invention were done
Any equivalent variations or modification, should be covered by the protection scope of the present invention.