CN105573980A - Information segment generation method and device - Google Patents

Information segment generation method and device Download PDF

Info

Publication number
CN105573980A
CN105573980A CN201510918463.0A CN201510918463A CN105573980A CN 105573980 A CN105573980 A CN 105573980A CN 201510918463 A CN201510918463 A CN 201510918463A CN 105573980 A CN105573980 A CN 105573980A
Authority
CN
China
Prior art keywords
argument
information
subordinate sentence
sentence
carried out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510918463.0A
Other languages
Chinese (zh)
Inventor
张新展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510918463.0A priority Critical patent/CN105573980A/en
Publication of CN105573980A publication Critical patent/CN105573980A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/221Parsing markup language streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an information segment generation method and device. A specific implementation way of the method comprises the following steps: carrying out sentence segmentation on obtained information to obtain at least one sub-sentence; tagging arguments in the at least one sub-sentence according to a preset argument set; carrying out word segmentation on the sub-sentences with tagged arguments and carrying out part-of-speech tagging on the words obtained after word segmentation; and analyzing the sub-sentences with tagged arguments to generate subject-verb-object information segments on the basis of the preset word collocation pair set, the tagged arguments and the part-of-speech tagging result. The implementation way achieves accurate and concise expression of information by rapidly and effectively generating the information segments.

Description

Information segment generates method and apparatus
Technical field
The application relates to field of computer technology, is specifically related to field of terminal technology, particularly relates to information segment and generates method and apparatus.
Background technology
Along with the development that internet and informationization technology are advanced by leaps and bounds, make the information content containing magnanimity in internet.Keyword refers to that single medium is when making makes index of reference, used vocabulary, and it can the subject content of expressing information and core content.Critical sentence is the sentence of the core concept content for reflecting information.But, use keyword expressing information, because main body, acceptor, relation are indefinite, fully can not reflect the core content of information; Though critical sentence can reflect the core content of information, succinct not in statement, user can not be made to identify core content fast and accurately.Therefore, a kind of core content that fully can reflect information is needed, again the information representation form of simplicity of exposition.
Summary of the invention
The object of the application is that the information segment proposing a kind of improvement generates method and apparatus, solves the technical matters that above background technology part is mentioned.
First aspect, this application provides a kind of information segment generation method, described method comprises: carry out sentence cutting to the information obtained and obtain at least one subordinate sentence; Argument at least one subordinate sentence according to the argument set notation preset; Each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging; Based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, each subordinate sentence marking argument is analyzed, generate the information segment of SVO structure.
In certain embodiments, described method also comprises: to remove set based on domain lexicon and described Lexical collocation and there is ambiguity and/or the incomplete information segment of structure in the information segment generated.
In certain embodiments, the described information to obtaining is carried out sentence cutting and is obtained at least one subordinate sentence, comprising: according to the punctuation mark in the information obtained, the sentence in described information is carried out cutting, obtains at least one subordinate sentence.
In certain embodiments, the argument at least one subordinate sentence described in the argument set notation that described basis presets, comprising: set up word lookup tree according to described argument set; Judge in each subordinate sentence, whether to comprise argument in described argument set according to described word lookup tree, if comprised, then this argument is marked.
In certain embodiments, described each subordinate sentence marking argument is carried out participle, comprising: utilize complete syncopate method, and in conjunction with domain lexicon, each participle marking argument is carried out participle, obtain at least one vocabulary.
Second aspect, this application provides a kind of information segment generating apparatus, described device comprises: cutting unit, and the information be configured for obtaining is carried out sentence cutting and obtained at least one subordinate sentence; Mark unit, is configured for the argument at least one subordinate sentence according to the argument set notation preset; Participle unit, is configured for and each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging; Generation unit, to be configured for based on the Lexical collocation preset the argument of set, mark and part-of-speech tagging result, to analyze, generate the information segment of SVO structure to each subordinate sentence marking argument.
In certain embodiments, described device also comprises: removal unit, is configured for remove set based on domain lexicon and described Lexical collocation to there is ambiguity and/or the incomplete information segment of structure in the information segment generated.
In certain embodiments, described cutting unit is configured for further: according to the punctuation mark in the information obtained, the sentence in described information is carried out cutting, obtains at least one subordinate sentence.
In certain embodiments, described mark unit is configured for further: set up word lookup tree according to described argument set; Judge in each subordinate sentence, whether to comprise argument in described argument set according to described word lookup tree, if comprised, then this argument is marked.
In certain embodiments, described participle unit is configured for further: utilize complete syncopate method, and in conjunction with domain lexicon, each participle marking argument is carried out participle, obtains at least one vocabulary.
The information segment that the application provides generates method and apparatus, by the subordinate sentence mark argument to institute's obtaining information, then the subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging, finally based on Lexical collocation, the core content that can give full expression to information is generated to the argument of set, mark and part-of-speech tagging result, and the information segment of simplicity of exposition, thus information generated fragment fast and effectively, the expression that the information that achieves is accurate, succinct.
Accompanying drawing explanation
By reading the detailed description done non-limiting example done with reference to the following drawings, the other features, objects and advantages of the application will become more obvious:
Fig. 1 is the exemplary system architecture figure that the application can be applied to wherein;
Fig. 2 is the process flow diagram of an embodiment of information segment generation method according to the application;
Fig. 3 is the schematic diagram of an application scenarios of information segment generation method according to the application;
Fig. 4 is the process flow diagram of another embodiment of information segment generation method according to the application;
Fig. 5 is the structural representation of an embodiment of information segment generating apparatus according to the application;
Fig. 6 is the structural representation of the computer system be suitable for for the terminal device or server realizing the embodiment of the present application.
Embodiment
Below in conjunction with drawings and Examples, the application is described in further detail.Be understandable that, specific embodiment described herein is only for explaining related invention, but not the restriction to this invention.It also should be noted that, for convenience of description, in accompanying drawing, illustrate only the part relevant to Invention.
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.
Fig. 1 shows the exemplary system architecture 100 of the embodiment of information segment generation method or the information segment generating apparatus can applying the application.
As shown in Figure 1, system architecture 100 can comprise terminal device 101,102,103, network 104 and server 105.Network 104 is in order at terminal device 101, the medium providing communication link between 102,103 and server 105.Network 104 can comprise various connection type, such as wired, wireless communication link or fiber optic cables etc.
User can use terminal device 101,102,103 mutual by network 104 and server 105, to receive or to send message etc.Terminal device 101,102,103 can be provided with the application of various telecommunication customer end, such as web browser applications, news category application, search class application, JICQ, mailbox client, social platform software etc.
Terminal device 101,102,103 can be have display screen and the various electronic equipments of support information process, include but not limited to smart mobile phone, panel computer, E-book reader, MP3 player (MovingPictureExpertsGroupAudioLayerIII, dynamic image expert compression standard audio frequency aspect 3), MP4 (MovingPictureExpertsGroupAudioLayerIV, dynamic image expert compression standard audio frequency aspect 4) player, pocket computer on knee and desk-top computer etc.
Server 105 can be to provide the server of various service, such as, for terminal device 101,102,103 provides the backstage web page server of information.Information on internet can be sent to terminal device by backstage web page server, after also the information on internet can being carried out the process such as analysis, result is sent to terminal device.
It should be noted that, the information segment generation method that the embodiment of the present application provides can be performed separately by terminal device 101,102,103, or also jointly can be performed by terminal device 101,102,103 and server 105.Correspondingly, information segment generating apparatus can be arranged in terminal device 101,102,103, also the unit of information segment generating apparatus can be arranged in server 105.
Should be appreciated that, the number of the terminal device in Fig. 1, network and server is only schematic.According to realizing needs, the terminal device of arbitrary number, network and server can be had.
Continue with reference to figure 2, show the flow process 200 of an embodiment of the information segment generation method according to the application.Described information segment generation method, comprises the following steps:
Step 201, carries out sentence cutting to the information obtained and obtains at least one subordinate sentence.
In the present embodiment, information segment generation method runs electronic equipment (terminal device 101,102,103 such as shown in Fig. 1) thereon can from this locality, also can by wired connection mode or radio connection from obtaining information the background server providing information for it.When this information is Word message, above-mentioned information can be carried out cutting according to text fragment, font style, font size etc. by above-mentioned electronic equipment, obtains at least one subordinate sentence; When this information be pictorial information or voice messaging time, first this information can carry out identifying the Word message generated by above-mentioned electronic equipment, according to text fragment, font style, font size etc., above-mentioned information can be carried out cutting afterwards, obtain at least one subordinate sentence.
It is pointed out that above-mentioned radio connection can include but not limited to 3G/4G connection, WiFi connection, bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultrawideband) connection and other radio connection developed known or future now.
In some optional implementations of the present embodiment, this information cutting can, according to the punctuation mark in the information (such as Domestic News) obtained, be at least one subordinate sentence by above-mentioned electronic equipment.Here, above-mentioned punctuation mark can be specific certain punctuation mark (such as, fullstop), also can be the punctuation occurred in information.Such as, can, using each punctuation mark of occurring in above-mentioned information as separator, be at least one subordinate sentence by sentence cutting.Exemplarily, " area of wild animal integrated distribution is all generally backward area to long sentence.The local masses have made tremendous contribution in order to animal protection, also therefore lose a lot of opportunities to develop." can be four subordinate sentences according to punctuation mark cutting: " area of wild animal integrated distribution "; " being all generally backward area "; " the local masses in order to animal protection made tremendous contribution "; " also therefore lose a lot of opportunities to develop ".
Step 202, according to the argument at least one subordinate sentence of argument set notation preset.
In the present embodiment, based at least one subordinate sentence obtained in step 201, the word in each subordinate sentence can mate with the argument in the argument preset set by above-mentioned electronic equipment, and is marked by the argument that in subordinate sentence, the match is successful.Here, argument refers in a sentence with nominal word.Argument included by above-mentioned argument set can be the argument set according to actual needs, can be the argument relevant to the information segment that will generate.Such as, if the information segment that will extract is relevant to branched stock, then can comprise the stock name of this branched stock in argument set.
In some optional implementations of the present embodiment, first above-mentioned electronic equipment can set up word lookup tree (Trie tree) according to above-mentioned argument set.The root node of above-mentioned word lookup tree be sky, and except root node, each node only comprises some Chinese characters of certain argument in above-mentioned argument set, letter or character.To get up composition argument from root node to the Chinese character of process the path of a certain node, letter or Connection operator, Chinese character, letter or character that all child nodes of each node comprise are not identical.Then, judge in each subordinate sentence, whether to comprise argument in above-mentioned argument set according to the word lookup tree of setting up, if comprised, then this argument is marked.
Step 203, carries out participle by each subordinate sentence marking argument, and the vocabulary obtained after participle is carried out part-of-speech tagging.
In the present embodiment, the electronic equipment that information segment generation method is run thereon can use participle instrument that each subordinate sentence having marked argument in step 202 is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging, determine the part of speech of each vocabulary.Here, above-mentioned participle instrument can be to realize any one in the participle instrument of participle and part-of-speech tagging.Such as, can be stammerer (jieba) the participle instrument, Chinese lexical analysis system ICTCLAS (InstituteofComputingTechnology, ChineseLexicalAnalysisSystem) etc. often used in Chinese.
In some optional implementations of the present embodiment, first above-mentioned electronic equipment can utilize complete syncopate method that each subordinate sentence marking argument is carried out cutting, and the word of false segmentation merges by recombination region dictionary afterwards, obtains at least one vocabulary.Such as, stock name " state's new forms of energy " may be " state/new forms of energy " by the cutting of mistake, then need to merge in conjunction with the word of equities dictionary by false segmentation.
Step 204, based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, analyzes each subordinate sentence marking argument, generates the information segment of SVO structure.
In the present embodiment, the argument that first above-mentioned electronic equipment can mark in searching step 202 from above-mentioned information, and record the position of each argument in above-mentioned information.Afterwards, the predicate verb of mark in searching step 203, and the predicate verb retrieved and the Lexical collocation preset are mated the predicate verb in gathering, if the match is successful, then record the position of this predicate verb in above-mentioned information.Then, judge whether have negative word between the argument marked that position is adjacent and the predicate verb that the match is successful, if do not had, and this predicate verb is intransitive verb, then by this argument and this predicate verb and above-mentioned Lexical collocation to the Lexical collocation in gathering to mating, and to export according to the information segment of this argument, predicate verb generation SVO structure when the match is successful; If had, and this predicate verb is intransitive verb, then by this argument and this predicate verb and above-mentioned Lexical collocation to the Lexical collocation in gathering to mating, and between this argument and this predicate verb, insert the negative word retrieved when the match is successful, and export according to the information segment of this argument, negative word and predicate verb generation SVO structure; If this predicate verb is transitive verb, then at this argument and this predicate verb and above-mentioned Lexical collocation to the Lexical collocation in gathering to after the match is successful, adjacent noun or noun phrase after this predicate verb of continuation retrieval, and export according to the information segment of this argument and/or negative word, predicate verb and the noun retrieved or noun phrase generation SVO structure.The information segment of above-mentioned SVO structure can be the information segment comprising subject, predicate and object, also can be the information segment only comprising subject and predicate, such as, and " enterprise, bankruptcy ", " student buys, pencil ".Above-mentioned Lexical collocation can comprise multiple Lexical collocation to (such as in set, subject-predicate collocation to, dynamic guest collocation to), these Lexical collocation, to being manually gather the Lexical collocation pair obtained after analyzing a large amount of information relevant to above-mentioned information, also can be according to the existing Lexical collocation pair of automatic acquisition from information relevant to above-mentioned information in a large number someway.
Continue a schematic diagram of the application scenarios see Fig. 3, Fig. 3 being information segment generation method according to the present embodiment.In the application scenarios of Fig. 3, user presets argument set " A team; B team " and Lexical collocation as required to set " team's failure; team is promoted to; team wins the championship ", and " on November 12 Beijing time, A team and B team have a match in sports center; through blood-and-thunder contention, and A team is promoted successfully by information." be input to terminal device by the input control 301 shown by terminal device, and after confirming submission by submit button 302, first terminal device carries out sentence cutting to the information obtained and obtains at least one subordinate sentence; Afterwards, according to the argument in each subordinate sentence of argument set notation; Then, the subordinate sentence of mark argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging; Finally, based on Lexical collocation to the argument of set, mark and part-of-speech tagging result, generate the information segment " A team is promoted to " of SVO structure, and shown by output control 303, will be as shown in Figure 3.
The method that above-described embodiment of the application provides, by generating the information segment of SVO structure, achieves the expression that core content in information is abundant, succinct.
With further reference to Fig. 4, it illustrates the flow process 400 of another embodiment of information segment generation method.The flow process 400 of this information segment generation method, comprises the following steps:
Step 401, carries out sentence cutting to the information obtained and obtains at least one subordinate sentence.
In the present embodiment, the electronic equipment that information segment generation method is run thereon can carry out sentence cutting to the information obtained, thus obtain at least one subordinate sentence, wherein, above-mentioned information can be directly Word message, also can be the Word message carrying out pictorial information and voice messaging to identify rear generation.
Step 402, according to the argument at least one subordinate sentence of argument set notation preset.
In the present embodiment, based at least one subordinate sentence obtained in step 401, the word in each subordinate sentence can mate with the argument in the argument preset set by above-mentioned electronic equipment, and is marked by the argument that in subordinate sentence, the match is successful.
Step 403, carries out participle by each subordinate sentence marking argument, and the vocabulary obtained after participle is carried out part-of-speech tagging.
In the present embodiment, each subordinate sentence having marked argument in step 402 can be carried out participle by above-mentioned electronic equipment, and the vocabulary obtained after participle is carried out part-of-speech tagging, thus determines the part of speech of each vocabulary.
Step 404, based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, analyzes each subordinate sentence marking argument, generates the information segment of SVO structure.
In the present embodiment, above-mentioned electronic equipment can by the argument of mark in step 402, and the predicate verb of mark in step 403, with the Lexical collocation preset to the Lexical collocation in gather to mating, and according to the information segment of matching result generation SVO structure.The information segment of above-mentioned SVO structure is the information segment comprising subject, predicate and/or object.
Step 405, to remove set based on domain lexicon and Lexical collocation and there is ambiguity and/or the incomplete information segment of structure in the information segment generated.
In the present embodiment, above-mentioned electronic equipment can do further process based on domain lexicon and the above-mentioned Lexical collocation preset to by the information segment of the SVO structure generated in step 404, specifically be treated to: remove and there is ambiguity and/or the incomplete information segment of structure, such as, the information segment of " scientist's proposition " this object disappearance is removed.Wherein, domain lexicon can be directly choose from existing domain lexicon (such as financial dictionary, electric power dictionary, mechanical dictionary etc.) according to the content of the information obtained, and also can build according to actual needs.
As can be seen from Figure 4, compared with the embodiment that Fig. 2 is corresponding, the flow process 400 of the information segment generation method in the present embodiment highlights the step of the removal to ambiguity and/or the incomplete information segment of structure.Thus, the scheme that the present embodiment describes can make the information segment of generation more accurate and effective, thus achieves more accurate information representation.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides an a kind of embodiment of information segment generating apparatus, this device embodiment is corresponding with the embodiment of the method shown in Fig. 2, and this device specifically can be applied in various electronic equipment.
As shown in Figure 5, the information segment generating apparatus 500 described in the present embodiment comprises: cutting unit 501, mark unit 502, participle unit 503 and generation unit 504.Wherein, cutting unit 501, the information be configured for obtaining is carried out sentence cutting and is obtained at least one subordinate sentence; Mark unit 502, is configured for according to the argument at least one subordinate sentence of argument set notation preset; Participle unit 503, is configured for and each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging; Generation unit 504, to be configured for based on the Lexical collocation preset the argument of set, mark and part-of-speech tagging result, to analyze, generate the information segment of SVO structure to each subordinate sentence marking argument.
In the present embodiment, information according to the punctuation mark in the information (such as Domestic News) obtained, can be carried out sentence and carries out cutting, obtain at least one subordinate sentence by the cutting unit 501 of information segment generating apparatus 500.
In the present embodiment, based at least one subordinate sentence that cutting unit 501 cutting obtains, the word in each subordinate sentence can mate with the argument in the argument preset set by mark unit 502, and is marked by the argument that in subordinate sentence, the match is successful.
In the present embodiment, mark unit 502 can be marked each subordinate sentence after argument and carry out participle by participle unit 503, and the vocabulary obtained after participle is carried out part-of-speech tagging, determines the part of speech of each vocabulary.
In the present embodiment, above-mentioned generation unit 504 based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, can be analyzed each subordinate sentence marking argument, generates the information segment of SVO structure.
It will be understood by those skilled in the art that above-mentioned information segment generating apparatus 500 also comprises some other known features, such as processor, storeies etc., in order to unnecessarily fuzzy embodiment of the present disclosure, these known structures are not shown in Figure 5.
Below with reference to Fig. 6, it illustrates the structural representation of the computer system 600 of terminal device or the server be suitable for for realizing the embodiment of the present application.
As shown in Figure 6, computer system 600 comprises CPU (central processing unit) (CPU) 601, and it or can be loaded into the program random access storage device (RAM) 603 from storage area 608 and perform various suitable action and process according to the program be stored in ROM (read-only memory) (ROM) 602.In RAM603, also store system 600 and operate required various program and data.CPU601, ROM602 and RAM603 are connected with each other by bus 604.I/O (I/O) interface 605 is also connected to bus 604.
I/O interface 605 is connected to: the importation 606 comprising keyboard, mouse etc. with lower component; Comprise the output 607 of such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.; Comprise the storage area 608 of hard disk etc.; And comprise the communications portion 609 of network interface unit of such as LAN card, modulator-demodular unit etc.Communications portion 609 is via the network executive communication process of such as the Internet.Driver 610 is also connected to I/O interface 605 as required.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc., be arranged on driver 610 as required, so that the computer program read from it is mounted into storage area 608 as required.
Especially, according to embodiment of the present disclosure, the process that reference flow sheet describes above may be implemented as computer software programs.Such as, embodiment of the present disclosure comprises a kind of computer program, and it comprises the computer program visibly comprised on a machine-readable medium, and described computer program comprises the program code for the method shown in flowchart.In such embodiments, this computer program can be downloaded and installed from network by communications portion 609, and/or is mounted from detachable media 611.
Process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of the various embodiment of the application, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.
Be described in unit involved in the embodiment of the present application to be realized by the mode of software, also can be realized by the mode of hardware.Described unit also can be arranged within a processor, such as, can be described as: a kind of processor comprises cutting unit, mark unit, participle unit and generation unit.Wherein, the title of these unit does not form the restriction to this unit itself under certain conditions, and such as, cutting unit can also be described to " carrying out the unit that sentence cutting obtains at least one subordinate sentence to the information obtained ".
As another aspect, present invention also provides a kind of non-volatile computer storage medium, this non-volatile computer storage medium can be the non-volatile computer storage medium comprised in device described in above-described embodiment; Also can be individualism, be unkitted the non-volatile computer storage medium allocated in terminal.Above-mentioned non-volatile computer storage medium stores one or more program, when one or more program described is performed by an equipment, makes described equipment: carry out sentence cutting to the information obtained and obtain at least one subordinate sentence; Argument at least one subordinate sentence according to the argument set notation preset; Each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging; Based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, each subordinate sentence marking argument is analyzed, generate the information segment of SVO structure.
More than describe and be only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art are to be understood that, invention scope involved in the application, be not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, also should be encompassed in when not departing from described inventive concept, other technical scheme of being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed simultaneously.The technical characteristic that such as, disclosed in above-mentioned feature and the application (but being not limited to) has similar functions is replaced mutually and the technical scheme formed.

Claims (10)

1. an information segment generation method, is characterized in that, described method comprises:
Sentence cutting is carried out to the information obtained and obtains at least one subordinate sentence;
Argument at least one subordinate sentence according to the argument set notation preset;
Each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging;
Based on the Lexical collocation preset to the argument of set, mark and part-of-speech tagging result, each subordinate sentence marking argument is analyzed, generate the information segment of SVO structure.
2. method according to claim 1, is characterized in that, described method also comprises:
Based on domain lexicon and described Lexical collocation set removed and there is ambiguity and/or the incomplete information segment of structure in the information segment generated.
3. method according to claim 1, is characterized in that, the described information to obtaining is carried out sentence cutting and obtained at least one subordinate sentence, comprising:
According to the punctuation mark in the information obtained, the sentence in described information is carried out cutting, obtains at least one subordinate sentence.
4. method according to claim 1, is characterized in that, the argument at least one subordinate sentence described in the argument set notation that described basis presets, comprising:
Word lookup tree is set up according to described argument set;
Judge in each subordinate sentence, whether to comprise argument in described argument set according to described word lookup tree, if comprised, then this argument is marked.
5. method according to claim 1, is characterized in that, described each subordinate sentence marking argument is carried out participle, comprising:
Utilize complete syncopate method, and in conjunction with domain lexicon, each participle marking argument is carried out participle, obtain at least one vocabulary.
6. an information segment generating apparatus, is characterized in that, described device comprises:
Cutting unit, the information be configured for obtaining is carried out sentence cutting and is obtained at least one subordinate sentence;
Mark unit, is configured for the argument at least one subordinate sentence according to the argument set notation preset;
Participle unit, is configured for and each subordinate sentence marking argument is carried out participle, and the vocabulary obtained after participle is carried out part-of-speech tagging;
Generation unit, to be configured for based on the Lexical collocation preset the argument of set, mark and part-of-speech tagging result, to analyze, generate the information segment of SVO structure to each subordinate sentence marking argument.
7. device according to claim 6, is characterized in that, described device also comprises:
Removal unit, is configured for remove set based on domain lexicon and described Lexical collocation and there is ambiguity and/or the incomplete information segment of structure in the information segment generated.
8. device according to claim 6, is characterized in that, described cutting unit is configured for further:
According to the punctuation mark in the information obtained, the sentence in described information is carried out cutting, obtains at least one subordinate sentence.
9. device according to claim 6, is characterized in that, described mark unit is configured for further:
Word lookup tree is set up according to described argument set;
Judge in each subordinate sentence, whether to comprise argument in described argument set according to described word lookup tree, if comprised, then this argument is marked.
10. device according to claim 6, is characterized in that, described participle unit is configured for further:
Utilize complete syncopate method, and in conjunction with domain lexicon, each participle marking argument is carried out participle, obtain at least one vocabulary.
CN201510918463.0A 2015-12-10 2015-12-10 Information segment generation method and device Pending CN105573980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510918463.0A CN105573980A (en) 2015-12-10 2015-12-10 Information segment generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510918463.0A CN105573980A (en) 2015-12-10 2015-12-10 Information segment generation method and device

Publications (1)

Publication Number Publication Date
CN105573980A true CN105573980A (en) 2016-05-11

Family

ID=55884132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510918463.0A Pending CN105573980A (en) 2015-12-10 2015-12-10 Information segment generation method and device

Country Status (1)

Country Link
CN (1) CN105573980A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802887A (en) * 2016-12-30 2017-06-06 北京三快在线科技有限公司 Participle processing method and device, electronic equipment
CN107908792A (en) * 2017-12-13 2018-04-13 北京百度网讯科技有限公司 Information-pushing method and device
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN110807311A (en) * 2018-07-18 2020-02-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN112686024A (en) * 2020-12-31 2021-04-20 竹间智能科技(上海)有限公司 Syntax parsing method and apparatus, electronic device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100082980A (en) * 2009-01-12 2010-07-21 울산대학교 산학협력단 Method for tagging part of speech and homograph, terminal device using the same
CN103117919A (en) * 2013-01-21 2013-05-22 南京邮电大学 Method of collecting Sina microblog group purchase information based on position service
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100082980A (en) * 2009-01-12 2010-07-21 울산대학교 산학협력단 Method for tagging part of speech and homograph, terminal device using the same
CN103117919A (en) * 2013-01-21 2013-05-22 南京邮电大学 Method of collecting Sina microblog group purchase information based on position service
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802887A (en) * 2016-12-30 2017-06-06 北京三快在线科技有限公司 Participle processing method and device, electronic equipment
CN107908792A (en) * 2017-12-13 2018-04-13 北京百度网讯科技有限公司 Information-pushing method and device
CN110807311A (en) * 2018-07-18 2020-02-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110807311B (en) * 2018-07-18 2023-06-23 百度在线网络技术(北京)有限公司 Method and device for generating information
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN112686024A (en) * 2020-12-31 2021-04-20 竹间智能科技(上海)有限公司 Syntax parsing method and apparatus, electronic device, and storage medium
CN112686024B (en) * 2020-12-31 2023-12-22 竹间智能科技(上海)有限公司 Syntax analysis method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20220350965A1 (en) Method for generating pre-trained language model, electronic device and storage medium
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN107247707B (en) Enterprise association relation information extraction method and device based on completion strategy
US20160342578A1 (en) Systems, Methods, and Media for Generating Structured Documents
CN105573980A (en) Information segment generation method and device
US20150121290A1 (en) Semantic Lexicon-Based Input Method Editor
CN108319586B (en) Information extraction rule generation and semantic analysis method and device
KR20220064016A (en) Method for extracting construction safety accident based data mining using big data
US8880391B2 (en) Natural language processing apparatus, natural language processing method, natural language processing program, and computer-readable recording medium storing natural language processing program
US20190303437A1 (en) Status reporting with natural language processing risk assessment
CN113704420A (en) Method and device for identifying role in text, electronic equipment and storage medium
US9038004B2 (en) Automated integrated circuit design documentation
Khemani et al. A review on reddit news headlines with nltk tool
CN109992651A (en) A kind of problem target signature automatic identification and abstracting method
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
CN111931491A (en) Domain dictionary construction method and device
CN107908792B (en) Information pushing method and device
US20060085366A1 (en) Method and system for creating hierarchical classifiers of software components
EP4246365A1 (en) Webpage identification method and apparatus, electronic device, and medium
CN110750989B (en) Statement analysis method and device
CN114490709A (en) Text generation method and device, electronic equipment and storage medium
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN116909533B (en) Method and device for editing computer program statement, storage medium and electronic equipment
CN109214005A (en) A kind of clue extracting method and system based on Chinese word segmentation
KR102640887B1 (en) Method and electronic device for generating multilingual website content

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160511