CN107291695B - Information processing unit and its participle processing method - Google Patents

Information processing unit and its participle processing method Download PDF

Info

Publication number
CN107291695B
CN107291695B CN201710505392.0A CN201710505392A CN107291695B CN 107291695 B CN107291695 B CN 107291695B CN 201710505392 A CN201710505392 A CN 201710505392A CN 107291695 B CN107291695 B CN 107291695B
Authority
CN
China
Prior art keywords
word
sequence labelling
participle
combination
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710505392.0A
Other languages
Chinese (zh)
Other versions
CN107291695A (en
Inventor
王卓然
亓超
马宇驰
侯兴林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Triangle Animal (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Triangle Animal (beijing) Technology Co Ltd filed Critical Triangle Animal (beijing) Technology Co Ltd
Priority to CN201811400632.1A priority Critical patent/CN109492228B/en
Priority to CN201710505392.0A priority patent/CN107291695B/en
Publication of CN107291695A publication Critical patent/CN107291695A/en
Application granted granted Critical
Publication of CN107291695B publication Critical patent/CN107291695B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

It includes: selecting unit that the present invention, which provides a kind of information processing unit and its participle processing method, the information processing unit, is configured to segment participle object, obtains the word segmentation result for being expressed as multiple contaminations;First concatenation unit is configured as carrying out splicing to the adjacent word in the combination;Sequence labelling unit, is configured as using sequence labelling model, carries out sequence labelling to by each word in the combination after first concatenation unit progress splicing, and merge to the word in the combination according to the result of sequence labelling;And second concatenation unit, it is configured to splice the word after being merged by the sequence labelling unit according to pre-defined rule.

Description

Information processing unit and its participle processing method
Technical field
The present invention relates to a kind of information processing units and its participle processing method for being able to carry out word segmentation processing.
Background technique
Existing segmenting method mainly includes the following three types: the segmenting method based on string matching, point based on understanding Word method and segmenting method based on statistics.For example, the prior art (Chinese patent application of Publication No. CN104462051A) In describe a kind of segmenting method based on statistics comprising: obtain a period of time in word in different search fields be searched Number, the statistics score of word is calculated according to searched number;The length score of word is calculated according to the length gauge of word;According to The statistics score and length score of word obtain the score value of word, generate dictionary for word segmentation by the score value of word and word;Obtain sentence to be segmented The sentence to be segmented is matched with the word in the dictionary for word segmentation to obtain multiple word segmentation results, is calculated each by son The score value of word segmentation result, using the high word segmentation result of score value as the word segmentation result of the sentence to be segmented.
However, in the participle technique disclosed in above-mentioned patent gazette, since word segmentation result excessively relies on dictionary for word segmentation, such as Fruit is used for the information processing unit such as mobile phone or tablet computer, then because not being available excessive dictionary, and there are word segmentation results The too thin problem of granularity.Simultaneously as needing to run program in memory, excessive memory source is occupied, therefore there is system The slower problem of the speed of service of uniting.
Summary of the invention
In view of the above problem in the prior art, for solve above-mentioned these problems whole or at least one, propose this Invention, big, the fireballing word segmentation processing technology of word segmentation processing that the purpose of the present invention is to provide a kind of participle granularities.
According to an aspect of the present invention, a kind of information processing unit being able to carry out word segmentation processing is provided, which is characterized in that The information processing unit includes: selecting unit, is configured to segment participle object, obtains the group for being expressed as multiple words The word segmentation result of conjunction;First concatenation unit is configured as carrying out splicing to the adjacent word in the combination;Sequence labelling Unit is configured as using sequence labelling model, to by the combination after first concatenation unit progress splicing In each word carry out sequence labelling, and the word in the combination is merged according to the result of sequence labelling;And second spell Order member, is configured to splice the word after being merged by the sequence labelling unit according to pre-defined rule.
The technical solution of first aspect through the invention realizes a kind of information processing unit that participle granularity is big.
According to another aspect of the present invention, a kind of information processing unit being able to carry out word segmentation processing, the information are provided Processing unit includes the external memory for storing sequence labelling model, which is characterized in that the information processing unit includes: participle Unit is configured as segmenting participle object, and obtains the word segmentation result for being expressed as multiple contaminations;And sequence Mark unit, be configured as will segment object carry out segment acquisition, be expressed as the word segmentation result of multiple contaminations, Sequence labelling processing is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling to described group Word in conjunction merges, wherein the sequence labelling unit carries out the sequence labelling processing in the external memory.
The technical solution of second aspect through the invention, realizes that a kind of committed memory is small, at the fast information of processing speed Manage device.
According to another aspect of the present invention, a kind of participle processing method for information processing unit, the participle are provided Processing method includes the following steps: to select step, segment to participle object, and obtains point for being expressed as multiple contaminations Word result;First splicing step, carries out splicing to the adjacent word in the combination;Sequence labelling step utilizes sequence mark Injection molding type carries out sequence labelling to each word in the combination after the progress splicing in the first splicing step, and The word in the combination is merged according to the result of sequence labelling;And the second splicing step, it is configured according to predetermined Rule splices the word in the combination after merging in the sequence labelling step.
According to another aspect of the present invention, a kind of participle processing method for information processing unit, the information are provided Processing unit includes the external memory for storing sequence labelling model, and the participle processing method includes the following steps: participle step Suddenly, participle object is segmented, and obtains the word segmentation result for being expressed as multiple contaminations;Sequence labelling step, for general Participle object carry out participle acquisition, be expressed as the word segmentation result of multiple contaminations, using sequence labelling model to described group Word in conjunction carries out sequence labelling processing, and is merged according to the result of sequence labelling to the word in the combination, wherein In sequence labelling step, sequence labelling processing is carried out in the external memory.
Information processing unit and its participle processing method of the invention is realized and is segmented with biggish granularity, And less memory source is occupied, to accelerate the processing speed of information processing unit.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations as described in this application Example, without creative efforts, can also be according to these attached drawings for this field or those of ordinary skill Obtain other attached drawings.
Fig. 1 instantiates the block diagram of the hardware configuration of information processing unit according to the present invention.
Fig. 2 instantiates the functional block diagram of information processing unit according to a first embodiment of the present invention.
Fig. 3 instantiates the flow chart of participle processing method according to a first embodiment of the present invention.
Fig. 4 instantiates the method flow diagram of progress sequence labelling processing according to a first embodiment of the present invention.
Fig. 5 instantiates the functional block diagram of information processing unit according to a second embodiment of the present invention.
Fig. 6 instantiates the flow chart of progress participle processing method according to a second embodiment of the present invention.
Fig. 7 instantiates the method flow diagram of progress sequence labelling processing according to a second embodiment of the present invention.
Fig. 8 instantiates the functional block diagram of information processing unit according to a third embodiment of the present invention.
Specific embodiment
Hereinafter describe the embodiment of the present invention in detail with reference to the accompanying drawings.It should be appreciated that following embodiments and unawareness The figure limitation present invention, also, about the means according to the present invention solved the problems, such as, it is not absolutely required to be retouched according to following embodiments The whole combinations for the various aspects stated.For simplicity, to identical structure division or step, identical label or mark have been used Number, and the description thereof will be omitted.
[hardware configuration of information processing unit]
Firstly, the hardware configuration of description information processing unit 1000 referring to Fig.1.In addition, being used as example in the present embodiment Following construction is described, but information processing unit of the invention is not limited to construction shown in FIG. 1.
Fig. 1 is the figure for showing the hardware construction of the information processing unit 1000 in the present embodiment.In the present embodiment, with intelligence Energy phone provides description as the example of information processing unit.Although it is noted that illustrating mobile terminal in the present embodiment (including but not limited to smart phone, smartwatch, Intelligent bracelet, music player devices) are used as information processing unit 1000, but It is to be certainly not limited thereto, information processing unit of the invention can be laptop, tablet computer, PDA, and (individual digital helps Reason), PC either has the internet device of touching display screen and the information processing function (such as Digital photographic is mechanical, electrical Refrigerator, television set etc.) etc. various devices.
As shown in Figure 1, information processing unit 1000 (2000,3000) includes connecing via the input that system bus is connected to each other Mouth 102, CPU 103, ROM 104, RAM 105, external memory 106, output interface 107, display 108, communication unit 109 With short-range wireless communication unit 110.Input interface 102 is referred to for receiving the execution of data and function that user is inputted The interface of order, and be to receive for the operating unit (not shown) via such as key, button or touch screen from user to input Data and operational order interface.It note that the display 108 being described later on and operating unit can at least partly collect At, also, for example, it may be carry out picture output in same picture and receive the construction of user's operation.
CPU 103 is system control unit, and generally comprehensively controls information processing unit 1000.In addition, for example, CPU 103 carries out the display control of the display 108 of information processing unit 1000.ROM 104 stores CPU 103 and executes such as The fixed data of tables of data and control program and operating system (OS) program etc..In the present embodiment, it is stored in ROM 104 Each control program, for example, such as being dispatched under the management of the OS stored in ROM 104, task switches and interrupt processing Deng software execute control.
RAM 105 (internal storage unit) for example by need backup power source SRAM (static random access memory), The construction such as DRAM.In this case, RAM 105 can store the important of control variable of program etc. in a non-volatile manner Data.In addition, for storing depositing for setting information, management data of information processing unit 1000 of information processing unit 1000 etc. Storage area domain is also disposed in RAM 105.In addition, RAM 105 is used as the working storage and main memory of CPU 103.
External memory 106 stores such as predefined dictionary, sequence labelling model, for executing participle according to the present invention The application program etc. of processing method.In addition, the storage of external memory 106 is such as via communication unit 109 and communication device (not shown) send/the various programs of received information transmission/receiving control program etc. and these programs use it is each Kind information.
Output interface 107 is the display picture for being controlled display 108 to show information and application program Interface.Display 108 is for example constructed by LCD (liquid crystal display).Have such as numerical value defeated by arranging on a display device 108 The soft keyboard for entering the key of key, mode setting button, decision key, cancel key and power key etc. can receive and come via display 108 From the input of user.
Information processing unit 1000 is via communication unit 109 for example, by channel radios such as Wi-Fi (Wireless Fidelity) or bluetooth Letter method executes data communication with external device (ED) (not shown).
In addition, information processing unit 1000 can also via short-range wireless communication unit 110, in short-range with External device (ED) etc. is wirelessly connected and executes data communication.And short-range wireless communication unit 110 by with communication unit 109 different communication means are communicated.It is, for example, possible to use its communication range is shorter than the communication means of communication unit 109 Communication means of the Bluetooth Low Energy (BLE) as short-range wireless communication unit 110.In addition, as short-distance wireless communication list The communication means of member 110, for example, it is also possible to use NFC (near-field communication) or Wi-Fi perception (Wi-Fi Aware).
[first embodiment]
Next, illustrating the software configuration of information processing unit according to first embodiment referring to Fig. 2.
As shown in Fig. 2, information processing unit 1000 includes: selecting unit 1101, to participle object, (such as user passes through touching Touch the sentence of screen input) it is segmented, it obtains to include the group of multiple words the word segmentation result that indicates;First concatenation unit 1102 Splicing is carried out to the adjacent word in group;Sequence labelling unit 1103 utilizes sequence labelling model, splices to by described first Each word in the combination after unit progress splicing carries out sequence labelling, and according to the result of sequence labelling to described group Word in conjunction merges, wherein sequence labelling unit 1103 includes extraction unit 11031, from by first concatenation unit The word of predefined type is extracted in each participle in the combination after 1102 progress splicings;Prediction section 11032, according to institute Predefined type is stated to predict the correspondence word segmentation result of extracted word;Selector 11033 is selected from the word segmentation result predicted Select word segmentation result;And merging portion 11034, basis is configured as by the selected word segmentation result of the selector, to institute The word stated in combination merges;And second concatenation unit 1104 according to pre-defined rule to by the sequence labelling unit carry out The word in combination after merging is spliced.
In the following, illustrating participle processing method according to a first embodiment of the present invention referring to Fig. 3.
As shown in figure 3, the participle processing method, it may include following steps S101-S104:
In step s101, it is matched by obtaining sentence to be segmented with word in dictionary for word segmentation, then all The word combination being fitted on all takes out, and calculates the combination of the highest scoring in participle strategy in each combination, the participle plan It slightly include: term weighing and language model scores.
Next, entering step S102, in step s 102, adjacent word in word segmentation result is stitched together, if should As a result occur in dictionary, which is just replaced to the former word segmentation result in dictionary.
Then, step S103 is entered, in step s 103, the word segmentation result that previous step is generated, into sequence labelling Model screens the result of sequence labelling model after carrying out sequence labelling, and according to the result after screening by previous step Word segmentation result Partial Fragment merges, and enters step S104.
In step S104, some common collocation in the word segmentation result of previous step generation are spliced, such as: quantity Word, date, time and letter expressing etc., and by result after splicing as final word segmentation result.
Hereinafter, with to sentence, " on January 29th, 2016, area lead Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Jun Victory, Sun Qijun visited Ministry of Foreign Affairs, the People Daily agency's area Deng Zhu center unit " segmented for, illustrate above-mentioned participle The process of processing.
In step s101, basic participle, each word in the sentence of acquisition, in different ways by participle pair are carried out As being split as multiple words, multiple contaminations are formed.Each word in each combination is carried out with the word in dictionary for word segmentation respectively Matching, then all takes out all word combinations being matched to.
For example, the word in sentence is split as following several combinations:
(1) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Army, victory, grandson, its, army, visit, diplomacy, portion, People's Daily, society, etc., stay area, center, unit;
(2) 2016, year, 1, the moon, 29, day, area, leader, Wu Gui, English, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu, Army's victory, grandson, its, army, visited, the Ministry of Foreign Affairs people, day newspaper office etc., stayed area, central unit;
(3) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hong, will, old, great waves, it is sweet, quiet, in, Liu Army, victory, grandson, its, army, visited, Ministry of Foreign Affairs, the people, day newspaper office, etc., stay area, center, unit.
Then score of each combination in participle strategy is calculated according to participle strategy, and selects the combination of highest scoring. Participle strategy includes but is not limited to term weighing and language model scores.Wherein, for segmenting vocabulary used in the step, When establishing vocabulary, in addition to saving word itself, the word frequency that the word occurs in corpus will also save.Term weighing is to work as Each word word frequency is the sum of cumulative in preceding participle combination.
Illustrate the process of above-mentioned calculating score with a better simply example below.Such as " I ", " love ", " Beijing day In the combination of peace door ", the word frequency of " I " is 130132, and the word frequency of " love " is 74150, and the word frequency in " Beijing Tian An-men " is 5924, The term weighing of the combination is exactly 210206.Then multiple combined term weighings point are normalized, each term weighing point is returned One calculation changed are as follows: highest term weighing is as denominator using in all combinations, and current term weighing is as molecule.So Afterwards, use the bigram probabilistic language model that entirely combines as the score of language model.Finally, the score and word of language model The score of language weight, which is multiplied, is used as final score.
It is calculated using the participle strategy of term weighing combination (1)-(3), scoring event difference is as follows:
Combine the score of (1) are as follows: 0.7411.
Combine the score of (2) are as follows: 1.0.
Combine the score of (3) are as follows: 0.8951.
When the participle strategy using language model scores is as follows to calculate each combined scoring event:
Combine the score of (1) are as follows: 0.9013.
Combine the score of (2) are as follows: 0.7542.
Combine the score of (3) are as follows: 0.9631.
The final score for combining (1)-(3) is respectively 0.6680,0.7542,0.8620, selects the group of highest scoring, i.e., and the (3) group continues the processing of next step.
In step s 102, for the word segmentation result in combination (3), adjacent word is stitched together, for example, by " people " " day newspaper office " is stitched together, and the result of spliced word segmentation processing is as follows:
2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, Victory, grandson, its, army, visited, Ministry of Foreign Affairs, People Daily agency, etc., stay area, center, unit.
If spliced above-mentioned word segmentation result is being not present in dictionary, existed with the spliced word segmentation result replacement The former word segmentation result recorded in dictionary.
Step S103 includes the steps that S1031-S1031 as shown in Figure 4.
In step S1031, from previous step generate word segmentation result, i.e., " 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, victory, grandson, its, army, visited, Ministry of Foreign Affairs, People's Daily Society, etc., stay area, center, unit ", it is middle to extract related with name word, i.e., " Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, It is quiet, in, Liu Jun, victory, grandson, its, army ".
It whether is word related with name according to extracted word in step S1032, to predict point of extracted word Word result:
Wu, Gui Ying: Wu Guiying
King, Hao: Wang Hao
Old, great waves: Chen Tao
It is sweet, quiet, in: Gan Jingzhong
Liu Jun, victory: Liu Junsheng
Grandson, its, army: Sun Qijun.
In step S1033, the result of sequence labelling model is screened, removal is clearly not the result of name.Example Such as, it for certain participle objects, is possible to will appear in sequence labelling result and similar " man of king " is accidentally labeled as name As a result, it is therefore desirable to which annotation results are further screened.
In step S1034, according to after screening as a result, the word segmentation result Partial Fragment of previous step is merged, conjunction And the word segmentation result obtained afterwards is as follows:
2016, year, 1, the moon, 29, day, area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qi Army, visited, Ministry of Foreign Affairs, People Daily agency, etc., stay area, center, unit.
In step S104, some common collocation in the word segmentation result of previous step generation are spliced, such as: it will " 2016, year, 1, the moon, 29, day " are spliced into " on January 29th, 2016 ".Common collocation includes numeral-classifier compound, date, time and text Expression etc., and by result after splicing as final word segmentation result.For this example, spliced result are as follows: on January 29th, 2016, Area, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qijun, has been visited, Ministry of Foreign Affairs, People's Daily leader Society, etc., stay area, center, unit.
According to the present invention, by being spliced to word segmentation result, marking and being spliced again, of word segmentation result can be increased Granularity.
[second embodiment]
In the first embodiment, to the reading of dictionary and sequence labelling model, using from RAM105 reading program and at it The mode of middle operation program.And in a second embodiment, in sequence labelling processing, sequence labelling unit is in external memory The sequence labelling processing is carried out in 106.
In information processing unit in the prior art, the memory (internal storage unit) of RAM etc. is generally included, with And the external memory (external memory) of SD card and hard disk etc..RAM is commonly used to operation application program.And external memory is commonly used to deposit Store up database and application program.According to common technology, sequence criteria model can be stored in external memory, and be run in memory The corresponding program of sequence criteria model.This will lead to that the memory that mobile phone is occupied when being segmented is more, and processing speed is slower.And In the present embodiment, although sequence labelling model also is stored in external memory, but running corresponding program is to carry out in external memory 's.
It will illustrate the word segmentation processing carried out in second embodiment below with Fig. 5 and Fig. 6.Fig. 5 is instantiated according to the present invention The functional block diagram of the information processing unit of second embodiment.
As shown in figure 5, information processing unit 2000 includes: participle unit 2102, participle object is segmented, and obtains It must be expressed as the word segmentation result of multiple contaminations;And sequence labelling unit 2103, the sequence labelling unit are deposited in outside The sequence labelling processing is carried out in reservoir, for will segment object carry out segment acquisition, be expressed as multiple contaminations Word segmentation result, sequence labelling is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling Word in the combination is merged.Wherein, sequence labelling unit 2103 includes: storage unit 21031, by sequence criteria mould The emission probability and state probability of type are stored in the first file of external memory;Calculation part 21032, for the combination In word characteristic function carry out Hash operation, by each characteristic function and emission probability corresponding with this feature function or shape The storage location of state probability is stored in the second file with cryptographic Hash;Extraction unit 21033, from what is stored by the calculation part The storage location extracts the probability that adjacent word in the combination combines word as one;And merging portion 21034, matched It is set to and each word in the combination is spliced according to extracted probability.
Fig. 6 instantiates the flow chart of progress participle processing method according to a second embodiment of the present invention.
Referring to Fig. 6, is segmented with to segment object " I loves Beijing Tian An-men " and be illustrated according to the present invention The carry out participle processing method of two embodiments.
In step s 201, which is divided into: my love, north, capital, day, peace, door.
In step S202, sequence labelling is carried out to the word segmentation result in step S201, sequence labelling processing includes such as Fig. 7 Shown step S2021 to S2024.
In storing step S2021, the master mould parameter of sequence criteria model is divided into three parts storage, is characterized respectively Function (the second parameter), emission probability and state probability (the first parameter), feature templates and other parameters (third parameter).Its In, emission probability and state probability are stored into (the first file) as a unique file.
In calculating step S2022, Hash operation is carried out to characteristic function using severe snow hash algorithm, then by each feature The storage location (value) of function and emission probability corresponding with this feature function or state probability, is stored in another with cryptographic Hash In a binary file (the second file).Storage feature templates and other parameters are stored as third file.
Specifically, when sequence labelling model is placed in " north " word, there is a template in feature templates are as follows: U06:%x [0, 0]/%x [1,0], template are construed to current word and combining for its latter position word situation occur i.e.: U06: north/capital.We will " U06: north/capital " is brought into severe snow hash function as variable, three cryptographic Hash is calculated: main cryptographic Hash M, left cryptographic Hash L With right cryptographic Hash R.Wherein, binary system displacement operation is carried out using main cryptographic Hash to obtain storage value (i.e. character pair is specifically Location, such as address of " north " and " capital " in file), and by the left cryptographic Hash of acquisition and right cryptographic Hash and pre-set left Kazakhstan Uncommon value and right cryptographic Hash compare, if identical, it is determined that be stored in the storage location as main cryptographic Hash in the second file;Such as Fruit is that very, then emission probability (or state probability) storage location for continuing to take out storage inside returns to -1 if it is vacation;Such as Fruit is unequal, then adds 1 on the basis of M, repeats above-mentioned value and compares operation.
In extraction step S2023, characteristic function and hair corresponding with this feature function are stored from step S2022 The storage location for penetrating probability or state probability extracts the probability size that adjacent word combines word as one.
Specifically, the return value (address) that repeatedly value compares operation in step S2032 will be regard as emission probability Position in (or state probability) its first file carries out the operation of position value.The weight or probability number and sequence mark of taking-up The number of the label of note is identical, and when each weight represents current word label as B, the probability that " U06: north/capital " joint occurs is big It is small.For example, the probability that " Beijing " joint occurs is 98%, the probability that " Tian An-men " joint occurs is 95%.
In merging step S2024, according to the probability calculated in step S2023, the word segmentation result in step 201 is carried out Splicing.
Specifically, in the word segmentation result of step S201 are as follows: my love, north, capital, day, peace, door.According in step S2023 The probability of calculating, the probability that " Beijing " combines word as one are 98%, and " Tian An-men " combines the probability that word occurs as one It is 95%, it is thus determined that " north " and " capital " is spliced into " Beijing ", " day ", " peace " and " door " is spliced into " Tian An-men ".In step In S2024, the word segmentation result that finally obtains are as follows: my love, Beijing, Tian An-men.
Second embodiment according to the present invention carries out sequence labelling processing, reduces pair in external memory rather than in memory The occupancy of information processing unit memory improves the speed of service of information processing unit.
[3rd embodiment]
The hardware configuration of the information processing unit of the third embodiment of the present invention and first embodiment and second embodiment The hardware configuration of information processing unit is identical.The technical solution of 3rd embodiment is the technology of first embodiment and second embodiment The combination of scheme.That is, the information processing unit of 3rd embodiment includes selecting unit in first embodiment, the first concatenation unit With the external memory and sequence labelling unit in the second concatenation unit and second embodiment.
Fig. 8 instantiates the functional block diagram of information processing unit according to a third embodiment of the present invention.
As shown in figure 8, information processing unit 3000 includes: selecting unit 3101, to participle object, (such as user passes through touching Touch the sentence of screen input) it is segmented, it obtains to include the group of multiple words the word segmentation result that indicates;First concatenation unit 3102 Splicing is carried out to the adjacent word in group;Sequence labelling unit 3103 utilizes sequence labelling model, splices to by described first Each word in the combination after unit progress splicing carries out sequence labelling, and according to the result of sequence labelling to described group Word in conjunction merges;Second concatenation unit 3104 according to pre-defined rule to being merged by the sequence labelling unit after Word in combination is spliced.
Wherein, sequence labelling unit 3103 includes: storage unit 31031, by the emission probability and shape of sequence criteria model State probability is stored in the first file of external memory;Calculation part 31032, for the characteristic function to the word in the combination Hash operation is carried out, by the storage position of each characteristic function and emission probability corresponding with this feature function or state probability It sets, is stored in the second file with cryptographic Hash;Extraction unit 31033 is mentioned from the storage location stored by the calculation part Take probability of the adjacent word as a joint word in the combination;And merging portion 31034, it is configured as according to extracted Probability splices each word in the combination.
In the participle processing method of 3rd embodiment, including selection step, the first splicing step in first embodiment With second splicing step, and first splicing step and second splicing step between sequence labelling step, then be second embodiment In sequence labelling step.
According to a third embodiment of the present invention, it can obtain that a kind of participle granularity is big and committed memory is few and handles Fireballing information processing unit.
Information processing unit of the invention can obtain following technical effect: have as far as possible common collocation and semantically The combination of meaning is cut out, it may be convenient to more meaningful segment is extracted from word segmentation result.
Although exemplary embodiments describe the present invention for reference above, above-described embodiment is only to illustrate this hair Bright technical concepts and features, it is not intended to limit the scope of the present invention.What all Spirit Essences according to the present invention were done Any equivalent variations or modification, should be covered by the protection scope of the present invention.

Claims (18)

1. a kind of information processing unit for being able to carry out word segmentation processing, which is characterized in that the information processing unit includes:
Selecting unit is configured as being segmented and obtained multiple word segmentation results, the multiple word segmentation result to participle object In each be represented as multiple contaminations;
First concatenation unit is configured as carrying out splicing to the adjacent word in the combination;
Sequence labelling unit, is configured as using sequence labelling model, carries out splicing to by first concatenation unit Each word in the combination afterwards carries out sequence labelling, and is closed according to the result of sequence labelling to the word in the combination And;And
Second concatenation unit is configured as carrying out the word after being merged by the sequence labelling unit according to pre-defined rule Splicing,
Wherein, the sequence labelling unit includes:
Extraction unit is configured as in each participle in the combination after carrying out splicing by first concatenation unit Extract the word of predefined type;
Prediction section is configured as according to the predefined type, to predict the correspondence word segmentation result of extracted word;
Selector is configured as selecting word segmentation result from the word segmentation result predicted;And
Merging portion is configured as carrying out the word in the combination according to by the selected word segmentation result of the selector Merge.
2. information processing unit according to claim 1, wherein the pre-defined rule includes by may be with thing in adjacent word Part, date, numeral-classifier compound or the related word of letter expressing are spliced.
3. information processing unit according to claim 1, wherein the predefined type includes name, place name and mechanism name.
4. information processing unit according to claim 1, wherein the selecting unit calculates separately institute according to participle strategy The score of multiple contaminations is stated, and selects the combination of highest scoring from the multiple contamination.
5. information processing unit according to claim 4, wherein the participle strategy includes term weighing and language model Score.
6. a kind of information processing unit for being able to carry out word segmentation processing, the information processing unit includes storage sequence labelling model External memory, which is characterized in that the information processing unit includes:
Participle unit is configured as being segmented and obtained multiple word segmentation results, the multiple word segmentation result to participle object In each be represented as multiple contaminations;And
Sequence labelling unit, be configured as will segment object segment acquisition, be expressed as multiple contaminations Word segmentation result carries out sequence labelling processing to the word in the combination using sequence labelling model, and according to the knot of sequence labelling Fruit merges the word in the combination,
Wherein, the sequence labelling unit carries out the sequence labelling processing in the external memory, and
Wherein, the sequence labelling unit includes:
Storage unit is configured as the emission probability of sequence criteria model and state probability being stored in the first of external memory In file;
Calculation part is configured as carrying out Hash operation to the characteristic function of the word in the combination, by each characteristic function and The storage location of emission probability corresponding with this feature function or state probability, is stored in the second file with cryptographic Hash;
Extraction unit is configured as extracting adjacent word in the combination from the storage location stored by the calculation part and making Combine the probability of word for one;
Merging portion is configured as splicing each word in the combination according to extracted probability.
7. information processing unit according to claim 6, wherein the sequence labelling unit is carrying out the sequence labelling When processing, by calculating address of the sequence labelling model in the external memory, the sequence is obtained from the address Corresponding informance of the column marking model in the external memory uses the sequence criteria model.
8. information processing unit according to claim 6, wherein the external memory is hard disk.
9. information processing unit according to claim 6,
Wherein, the calculation part obtains the main cryptographic Hash of characteristic function, left Kazakhstan by carrying out Hash operation to the characteristic function Uncommon value and right cryptographic Hash,
Wherein, the storage location is stored in the second file with main cryptographic Hash, and
The left cryptographic Hash and right cryptographic Hash are used to determine whether to store the storage location.
10. a kind of participle processing method for information processing unit, the participle processing method include the following steps:
Step is selected, is segmented and obtained multiple word segmentation results to participle object, each quilt in the multiple word segmentation result It is expressed as multiple contaminations;
First splicing step, carries out splicing to the adjacent word in the combination;
Sequence labelling step, using sequence labelling model, described in after the progress splicing in the first splicing step Each word in combination carries out sequence labelling, and is merged according to the result of sequence labelling to the word in the combination;And
Second splicing step carries out the word in the combination after merging in the sequence labelling step according to pre-defined rule Splicing,
Wherein, the sequence labelling step includes:
Extraction step extracts predetermined from each participle spliced in the combination after step carries out splicing by described first The word of type;
Prediction steps: it is configured as the correspondence word segmentation result that extracted word is predicted according to the predefined type;
Step is selected, is configured as selecting word segmentation result from the word segmentation result predicted;And
Merge step, is configured as according to selected word segmentation result in the selection step, in the combination Word merges.
11. participle processing method according to claim 10, wherein the pre-defined rule includes by may be in adjacent word Event, date, numeral-classifier compound or the related word of letter expressing are spliced.
12. participle processing method according to claim 10, wherein the predefined type includes name, place name and mechanism Name.
13. participle processing method according to claim 10, wherein in the selection step, according to participle strategy point The score of multiple contaminations is not calculated, and the combination of highest scoring is selected from the multiple contamination.
14. participle processing method according to claim 13, wherein the participle strategy includes term weighing and language mould Type score.
15. a kind of participle processing method for information processing unit, the information processing unit includes storage sequence labelling mould The external memory of type, the participle processing method include the following steps:
Step is segmented, is segmented and obtained multiple word segmentation results to participle object, each quilt in the multiple word segmentation result It is expressed as multiple contaminations;
Sequence labelling step, for will segment object carry out participle acquisition, be expressed as the word segmentation result of multiple contaminations, benefit Sequence labelling processing is carried out to the word in the combination with sequence labelling model, and according to the result of sequence labelling to the combination In word merge,
Wherein, in sequence labelling step, sequence labelling processing is carried out in the external memory, and
Wherein, the sequence labelling step includes:
The emission probability of sequence criteria model and state probability are stored in the first file by storing step;
Step is calculated, Hash operation is carried out to the characteristic function of the word in the combination, by each characteristic function and and this feature The storage location of the corresponding emission probability of function or state probability, is stored in the second file with cryptographic Hash;
Extraction step, the storage location that stores from the calculating step extract in the combination adjacent word as one The probability of a joint word;
Merge step, each word in the combination is spliced according to extracted probability.
16. participle processing method according to claim 15, wherein when carrying out sequence labelling processing, pass through meter Address of the sequence labelling model in the external memory is calculated, obtains the sequence labelling model in institute from the address The corresponding informance in external memory is stated, the sequence criteria model is used.
17. participle processing method according to claim 15, wherein the external memory is hard disk.
18. participle processing method according to claim 15, wherein in the calculating step, by described Characteristic function carries out main cryptographic Hash, left cryptographic Hash and the right cryptographic Hash that Hash operation obtains characteristic function,
Wherein, the storage location is stored in the second file with main cryptographic Hash, and
The left cryptographic Hash and right cryptographic Hash are used to determine whether to store the storage location.
CN201710505392.0A 2017-06-28 2017-06-28 Information processing unit and its participle processing method Active CN107291695B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811400632.1A CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof
CN201710505392.0A CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710505392.0A CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201811400632.1A Division CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof

Publications (2)

Publication Number Publication Date
CN107291695A CN107291695A (en) 2017-10-24
CN107291695B true CN107291695B (en) 2019-01-11

Family

ID=60098659

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201811400632.1A Active CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof
CN201710505392.0A Active CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201811400632.1A Active CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof

Country Status (1)

Country Link
CN (2) CN109492228B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766539B (en) * 2018-11-30 2022-12-20 平安科技(深圳)有限公司 Standard word stock word segmentation method, device, equipment and computer readable storage medium
CN111339250B (en) * 2020-02-20 2023-08-18 北京百度网讯科技有限公司 Mining method for new category labels, electronic equipment and computer readable medium
CN115497465A (en) * 2022-09-06 2022-12-20 平安银行股份有限公司 Voice interaction method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060155782A1 (en) * 2005-01-11 2006-07-13 Viktors Berstis Systems, methods, and media for aggregating electronic document usage information
CN103984735A (en) * 2014-05-21 2014-08-13 北京京东尚科信息技术有限公司 Method and device for generating recommended delivery place name
CN104469002A (en) * 2014-12-02 2015-03-25 科大讯飞股份有限公司 Mobile phone contact person determination method and device
CN105095391A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for identifying organization name by word segmentation program

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7393665B2 (en) * 2005-02-10 2008-07-01 Population Genetics Technologies Ltd Methods and compositions for tagging and identifying polynucleotides
CN102360383B (en) * 2011-10-15 2013-07-31 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103646088B (en) * 2013-12-13 2017-03-15 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103823862B (en) * 2014-02-24 2017-02-15 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method
CN105718586B (en) * 2016-01-26 2018-12-28 中国人民解放军国防科学技术大学 The method and device of participle
CN106021229B (en) * 2016-05-19 2018-11-02 苏州大学 A kind of Chinese event synchronous anomalies method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060155782A1 (en) * 2005-01-11 2006-07-13 Viktors Berstis Systems, methods, and media for aggregating electronic document usage information
CN103984735A (en) * 2014-05-21 2014-08-13 北京京东尚科信息技术有限公司 Method and device for generating recommended delivery place name
CN104469002A (en) * 2014-12-02 2015-03-25 科大讯飞股份有限公司 Mobile phone contact person determination method and device
CN105095391A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for identifying organization name by word segmentation program

Also Published As

Publication number Publication date
CN109492228B (en) 2020-01-14
CN107291695A (en) 2017-10-24
CN109492228A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
KR102462365B1 (en) Method and apparatus for predicting text input based on user demographic information and context information
US20170337261A1 (en) Decision Making and Planning/Prediction System for Human Intention Resolution
CN106934069B (en) Data retrieval method and system
US10650068B2 (en) Search engine
CN110309316B (en) Method and device for determining knowledge graph vector, terminal equipment and medium
US20200279182A1 (en) Method and system for automatically producing plain-text explanation of machine learning models
US20150286943A1 (en) Decision Making and Planning/Prediction System for Human Intention Resolution
CN107291695B (en) Information processing unit and its participle processing method
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
EP2839391A1 (en) Conversational agent
US20210256076A1 (en) Integrated browser experience for learning and automating tasks
EP2919097B1 (en) Information processing system and information processing method for character input prediction
CN105653134A (en) An application switching method and a system thereof
CN110033382B (en) Insurance service processing method, device and equipment
KR20200009117A (en) Systems for data collection and analysis
CN110275962B (en) Method and apparatus for outputting information
US11126972B2 (en) Enhanced task management feature for electronic applications
CN112801425B (en) Method and device for determining information click rate, computer equipment and storage medium
US10896291B2 (en) Method and device for providing notes by using artificial intelligence-based correlation calculation
CN110113492A (en) Information display method and device based on notification information
CN114328838A (en) Event extraction method and device, electronic equipment and readable storage medium
CN117708428A (en) Recommendation information prediction method and device and electronic equipment
CN112905787A (en) Text information processing method, short message processing method, electronic device and readable medium
CN115048536A (en) Knowledge graph generation method and device, computer equipment and storage medium
EP4348524A1 (en) Apparatus and method for suggesting user-relevant digital content using edge computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wang Zhuoran

Inventor after: Qi Chao

Inventor after: Ma Yuchi

Inventor after: Hou Xinglin

Inventor before: Hou Xinglin

Inventor before: Qi Chao

Inventor before: Wang Zhuoran

Inventor before: Ma Yuchi

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200727

Address after: 518000 Nanshan District science and technology zone, Guangdong, Zhejiang Province, science and technology in the Tencent Building on the 1st floor of the 35 layer

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Address before: 100029, Beijing, Chaoyang District new East Street, building No. 2, -3 to 25, 101, 8, 804 rooms

Patentee before: Tricorn (Beijing) Technology Co.,Ltd.

TR01 Transfer of patent right